Escaping the AI Hiring Doom Loop (Part III)
A “real” recruiter–candidate interview, with memory and receipts
In Part I of this series, I argued that we’ve created a self-reinforcing hiring loop: candidates use AI to generate perfectly optimized resumes; companies use AI to screen them at scale; both sides get louder, and the actual signal gets harder to hear.
In Part II, I tried a different primitive: a short screening conversation between two agents—a recruiter agent and a candidate agent—so the first filter isn’t a keyword scan, but a dialogue that leaves a trace.
This post is the third installment: the step where the prototype stopped behaving like a scripted demo and started behaving like… an interview.
What changed is simple to say, a little bit more complex to build:
The recruiter is no longer forced through a fixed number of questions.
The interview isn’t a linear four-step flow anymore.
Every turn is persisted, inspectable, and exportable.
The system can recall prior interactions (lightly) across runs.
The output is still structured, but it now has context.
Now we have an open-ended recruiter–candidate conversation runner, grounded in my “Professional Digital Twin”, with traceability as a first-class feature.
The goal: stop grading “documents” and start grading “interaction”
The hiring doom loop is powered by static artifacts:
a resume optimized for ATS parsing,
a cover letter optimized for pattern matching,
a screening model optimized for throughput.
So the obvious counter-move isn’t “better prompts” or “better parsing”.
It’s switching the object we evaluate: from a document to a short, structured, reviewable interaction. That idea was already present in Part II. The real challenge is making it realistic enough to be useful, while staying traceable enough to be trusted.
What I built since Part II
1) The recruiter can now decide when it has enough signal
Instead of running a pre-set “ask 3 questions then score” pipeline, the recruiter agent runs in a loop and can stop whenever it’s satisfied—by emitting a simple sentinel string. That one change matters a lot. It’s the difference between a questionnaire and an interview.
2) Every run is fully logged (and queryable)
I wanted the opposite of the ATS black hole. So the system persists:
every recruiter and candidate turn,
timestamps,
metadata (job_id, candidate_id, interview_id),
the final assessment JSON.
This is stored in SQLite and also written as a timestamped JSON log file.
3) Long-term memory exists
No fancy “magic memory” here. The interview runner can pull a handful of the candidate’s most recent utterances from prior runs and use them as a lightweight recall mechanism. It’s intentionally simple—and intentionally auditable.
The architecture
There are three parts. Same as Part II, but with more separation, more persistence, and a more realistic loop.
1) Recruiter agent (LangChain + OpenAI)
The recruiter’s job is to behave like a senior recruiter:
ask one question at a time,
probe and branch as needed,
decide when it has enough signal,
then produce a structured assessment.
It’s invoked in two modes:
next-question mode (during the interview)
assessment mode (at the end, producing a JSON scorecard)
The recruiter’s model settings are configurable via environment variables (model, temperature, max_tokens).
2) Candidate agent (my Professional Digital Twin via the OpenAI Agents SDK)
The candidate isn’t “generic ChatGPT pretending to be a candidate”.
It’s my professional twin: grounded in a curated knowledge base of real experience and evidence, with its own session scope per interview.
That grounding has an important side effect: when the twin doesn’t have relevant evidence, it fails safely (it says it doesn’t have that experience) instead of hallucinating.
3) Orchestrator + memory layer (the boring part that makes it trustworthy)
The orchestrator coordinates the loop:
recruiter asks a question,
candidate answers,
both turns are printed and logged,
repeat until the assessment is ready,
then generate the final assessment and persist it.
The memory layer is a small SQLite-backed store with two tables:
interactions (one row per turn)
assessments (one row per final evaluation)
It’s intentionally simple because “simple and inspectable” beats “clever and opaque” in hiring contexts.
How it runs (yes, from a CLI right now)
At the moment this is still a developer-first workflow, run from the command line.
A typical run looks like:
python orchestrator.py \
--job-id cfo_financial_test \
--job-title “Chief Financial Officer (CFO)” \
--job-description-text “[...]” \
--candidate-id carlo_twin \And the output is:
a real-time transcript in the terminal,
a JSON log file under logs/,
a SQLite DB (interview_memory.db) with turns + assessment.
A concrete example: when the role doesn’t match, the system shouldn’t pretend it does
One of the best “test cases” I ran was intentionally adversarial: I gave the recruiter a CFO job description, while the candidate is… me.
The recruiter opens with the obvious CFO-screen question (“tell me about your most recent finance leadership role in SaaS/tech”), and the twin responds with a clean constraint:
“I have not held a finance-specific leadership role in a SaaS or tech company… so my agent candidly admit it.”
Then something interesting happens: the recruiter adapts. It pivots to:
“tell me about partnering with finance”
“how did you influence budgeting/forecasting”
“how would you approach the first 90 days”
and so on.
At the end, the recruiter generates a structured assessment—not a yes/no, but a scorecard with rationale and a “maybe” recommendation.
A simplified version of the returned structure:
{ “skills_score”: 50,
“domain_score”: 40,
“constraints_score”: 50,
“overall_recommendation”: “maybe not”,
“should_interview”: no }This is the kind of behavior hiring systems rarely incentivize today:
be explicit about missing signal,
ask better follow-ups,
leave evidence behind,
produce a reviewable rationale.
What this approach can unlock (organizationally, not just technically)
I care about this experiment because it’s not “HR tech”. It’s an organizational design problem. When hiring becomes a black box, three things happen:
Candidates optimize for the machine, not for authenticity.
Recruiters optimize for throughput, not for signal.
Organizations lose institutional learning, because decisions aren’t traceable.
A persisted, inspectable interview artifact changes the incentives:
You can audit screening behavior.
You can improve prompts and rubrics based on real transcripts.
You can run evals over time (by role, by recruiter persona, by candidate profile version).
You can keep humans in the loop with better inputs, not with vibes.
That’s the real “beyond HR” piece: you’re creating an transparent argumentation space around hiring decisions—one where claims, evidence, and rationale can be inspected and improved.
The uncomfortable lessons so far
A few things became even clearer once the flow became open-ended.
Prompt dependency doesn’t go away, it just moves up a level
If the recruiter prompt is vague, you get meta-questions and hedging.
If it’s too rigid, you get premature scoring.
If you don’t define stopping rules, the interview never ends.
The recruiter agent is not “an LLM”. It’s a product surface.
Grounding is only as good as curation
The twin’s biggest strength is also its constraint: it can only be as good as its knowledge base and retrieval quality. That’s not a bug. It’s the point.
If we want authenticity, knowledge curation becomes part of the system, not an afterthought.
Reliability needs guardrails (schema validation, retries, monitoring)
The assessment output is structured JSON, but real systems need:
validation,
repair strategies,
monitoring of drift,
cost and latency observability.
That’s “LLMOps for hiring”, and it’s non-optional if you want to integrate with enterprise ATS.
What’s next
The design doc already has a roadmap, but the next steps are pretty clear:
Separate the interview engine from the UI (so this can power a web app, not just a CLI).
Richer memory (vector retrieval, per-role summaries, versioning).
Configurable pipelines (role templates, recruiter personas, batch runs).
Validation + monitoring (token/cost tracking, eval harness).
Viewer + analytics (list runs, filter by job/date/recommendation, compute distributions).
And then the bigger question: how to make this safe, fair, privacy-aware, and operationally usable inside real hiring workflows without recreating the doom loop in a new form.
Closing thought
The AI hiring doom loop isn’t inevitable. But breaking it requires changing what we optimize.
If we keep optimizing for: “perfect documents” and “fast screening”, we’ll keep amplifying noise.
If we instead optimize for: grounded interaction, traceable rationale, and human reviewability, we at least have a chance to make hiring feel less like a slot machine for candidates and for recruiters.
If you’re working on hiring, HR tech, talent ops, or agentic systems: I’d genuinely love to hear what you think is missing here (or what you think is naïve).

