HireFinch — rubric-grounded voice interviewing at production SLOs

Role: CTO & Lead Architect · Scope: realtime voice (OpenAI⇄Gemini⇄TTS), ATS/email, eval harness.

We built an agentic voice interviewer that maps to job-specific rubrics; every score cites the transcript span and rubric level, and managers receive an executive summary in minutes. Reliability comes from a provider-agnostic inference layer with circuit breakers and multi-model failover; safety from evidence-grounded scoring, proctoring, and audit-logged diffs when managers refine rubrics.

~96% Reduction in human screening per 100 roles
p95 ≤ 1.2s Realtime reply latency with failover
99.95% Availability SLO with error-budget guardrails

What changed

Before: ad-hoc early screening created load on recruiters and hiring managers. After: rubric-grounded voice interviews with evidence-cited scoring, manager summaries in minutes, and eval-gated releases. We run a provider-agnostic shim with circuit breakers and OpenAI⇄Gemini failover; realtime p95 ≤ 1.2 s; availability SLO 99.95%. Proctoring blends stylometry, latency profiles, and webcam snapshots (≤7-day retention), plus deepfake/TTS cues and reviewer queues.

Safeguards & evals

  • Bias controls & explainability: scores must reference evidence; PII prompts avoided; manager notes scrubbed pre-model.
  • Proctoring: stylometry, latency, and webcam snapshots cross-check identity; deepfake/TTS cues route to a reviewer queue; snapshots auto-delete within seven days.
  • Release gates: WER ≤8/15% (p50/p95), EOT ≤250 ms (p90), rubric-adherence F1 ≥0.85, PII leakage = 0 on the policy suite.

Eval harness snapshot

CoverageWER p50/p95, EOT p90, rubric F1, PII leakage, cost per hire, transcript audit trail.
AutomationGitHub Actions triggers nightly evals; failures open PagerDuty tasks with failing spans attached.
OpsManagers can tweak rubrics in-product; diffs run through canary evals before auto-merge.

Micro timeline

Problem

Recruiters drowned in early screens, and hiring managers lacked structured signal.

Design

Voice interviews grounded in rubrics, explainable scoring, and proctoring signals.

Evals

Regression harness across WER, latency, rubric F1, and policy prompts; red-team scripts exercised deepfake cues.

Impact

~96% less manual screening, realtime latency SLOs, and recruiter trust in the audit trail.