Clinical Intake LLM Eval

Study design

An LLM is asked to conduct a mental-health intake interview with a simulated patient who has one presenting concern and four to five hidden comorbid conditions. The patient discloses hidden symptoms only when asked specifically. We measure whether the clinician model probes broadly enough to surface them. Each simulated patient is built from a phenotype, a clinically recognizable scenario where several disorders co-occur (e.g., depression-presenting bipolar II, postpartum depression with intrusive harm thoughts, mood-presenting first-episode psychosis), not a single diagnosis. Each profile's hidden domains co-occur with the presenting concern at rates documented in NCS-R, NESARC, and disorder-specific clinical studies. This page walks through how a single profile becomes a single eval cell.

108 generated profiles (stratified) 540 conversations run 5 clinician models 2 judge models 12 turns per conversation

Research questions

Three questions about how hidden comorbid domains are surfaced (or missed) during an unstructured intake.

Does an open-ended intake instruction produce broad symptom probing? When a clinician is told only to "conduct an intake" without explicit guidance to screen across domains, does the resulting interview probe widely enough to surface symptoms that fall outside the patient's lead complaint?
How does interview structure evolve across the 12-turn conversation? Does broad symptom probing continue across the conversation, or does the clinician shift into treatment planning before hidden domains are surfaced? Premature closure leaves comorbid domains unprobed regardless of the clinician's underlying capacity to ask about them.
Which clinical scenarios are most vulnerable to under-surfacing? Are phenotypes where the presenting concern misleads about the underlying condition (e.g., depression-presenting bipolar II) systematically harder than phenotypes where the lead complaint and the diagnostic core already align?

Main findings

Open-ended intake under-surfaces hidden domains. The three non-reasoning chat models (DeepSeek-V3, Llama-3.3-70B, Llama-3.1-8B) cluster at 14–18% mean coverage under a minimal "you are a clinician" prompt. Qwen 3 8B (small reasoning) reaches 27%, and Kimi K2.6 (larger reasoning) reaches 56%.
Non-reasoning models pivot from probing to treatment planning in the first half of the conversation. Median first treatment-planning turn is 6 for DeepSeek-V3 and Qwen 3 8B, 7 for Llama-3.1-8B, and 8 for Llama-3.3-70B. Median premature closure spans turn 6 (DeepSeek-V3) through turn 10 (Llama-3.3-70B). Kimi K2.6 holds out, with median first treatment-planning at turn 10 and median premature closure at turn 11.5.
Eating-disorder phenotypes, first-episode psychosis with mood, and bipolar masked are the worst-covered. Across the three non-reasoning models, first-episode psychosis with mood, eating disorder complex, binge eating internalizing, and bipolar masked all sit at or below 8% mean coverage.

See results & analysis for the full coverage matrix, per-domain breakdown, and trajectory data.

1 · Pipeline overview

The pipeline runs in four stages and produces one outcome per phenotype × model cell.

Profile generation

Sample a phenotype, comorbid domains, severity, symptoms, demographics. Render via LLM. Validate.

→

ii.

Conversation

Patient model leads with presenting concern. Clinician model conducts a 12-turn intake. Disclosure rule constrains the patient.

→

iii.

Per-turn judging

Two judge models label each clinician turn for question type, asked_about and disclosed per hidden domain, and patient faithfulness.

→

iv.

Metrics

Active coverage, passive disclosure rate, premature closure, unprompted disclosure count. Plus inter-judge agreement.

Model roles in the pipeline

Patient

GPT-4o

Plays the patient — carries a presenting concern plus 4–5 hidden symptoms, and only discloses a hidden one when the clinician asks specifically about its domain.

Clinician (under test)

DeepSeek-V3
Llama-3.3-70B
Llama-3.1-8B
Qwen 3 8B
Kimi K2.6

Receives a minimal "conduct an intake" prompt.

Judge

Gemini 3 Flash Preview (primary)
Claude Sonnet 4.6 (cross-check)

Per-turn labels with a strict asked_about definition.

Generator (Qwen 3 235B) is held out from every other role to avoid leakage.

2 · Profile generation

Profile generation runs in six stages, from sampling a phenotype probability to rendering a patient via LLM.

STAGE 1

Phenotype sampling

Anchored in clinical comorbidity literature

18 named comorbidity bundles
Stratified mode: equal coverage
Weighted mode: clinical-population prior

STAGE 2

Domain set sampling

From phenotype required + optional pools

1 presenting domain
4–5 hidden domains
Conflict rules drop overlapping domains

STAGE 3

Severity per domain

Disorder-specific (mild/moderate/severe) priors

NCS-R severity bands as anchor
Independent draw per domain

STAGE 4

Symptom sampling

From DSM-5 criterion lists per domain

Severity-biased count
Min count enforced for diagnostic categories
Cross-domain dedup (avoid double-counting)

STAGE 5

Demographics + life context

Constrained by domain prerequisites

Age, sex, occupation, relationship
Postpartum / late-life prerequisites enforced
2–3 life context seeds

STAGE 6

LLM render + QC

Qwen 3 235B writes phenomenology

Verbatim symptom copy enforced
Fuzzy-match QC tolerates paraphrase
Failed renders go to a retry queue

Live walkthrough

Step through the stages on the example patient used throughout this page.

Click Build sample profile to walk through the stages and reveal the resulting patient.

3 · The conversation

Patient and clinician each get a system prompt, then talk for 12 clinician turns.

Patient prompt — the disclosure constraint

The patient is given the full domain inventory, then constrained by the disclosure rule below. The patient model is GPT-4o.

Disclosure rule

Talk openly about the presenting concern. For each other domain, only mention it if the clinician asks a specific question that names that domain or describes one of its features.

Qualifies: "how is your sleep?", "have you had thoughts of suicide?", "do you drink?"
Does NOT qualify: "tell me more", "anything else?", "how does that make you feel?"

Clinician prompt (deliberately minimal)

The clinician model receives no instruction to screen broadly.

You are a mental health clinician
conducting an initial intake interview
with a new patient. Your goal is to
understand what they are experiencing.
Use whatever interview style you find
effective. The patient will speak first.

All five clinician models share the same prompt, temperature (0.7), and 12-turn budget. The output-token cap is 350 tokens for DeepSeek-V3 / Llama-3.3-70B / Llama-3.1-8B and 2500 tokens for the reasoning models (Qwen 3 8B, Kimi K2.6), which emit internal reasoning tokens before the user-visible utterance and would otherwise return empty content. The higher cap does not change non-reasoning models' actual output length, since they stop at their natural turn boundary well under 350 tokens.

Example transcript

A trauma-substance profile (Daniel, 38, combat veteran with hidden PTSD + secondary depression + trauma nightmares + irritability) ran through DeepSeek-V3 as the clinician.

4 · Per-turn judging

The judge assigns two binary labels per (clinician turn, hidden domain) pair, plus a per-turn question-type classification.

The judge prompt requires that asked_about be true only when the clinician's question addresses at least one specific symptom of that domain. Open invitations like "tell me more" do not count. Click any cell to see the judge's reasoning for that turn.

cell legend

active discovery (asked + disclosed)
asked, not yet disclosed
bleed (disclosed without ask)
empty — nothing on this turn

row tint = "other" (rapport / closing)
row tint = "treatment_planning"

judge reasoning

Click a cell to see the judge's per-turn reasoning.

5 · Metrics

Each cell produces four primary numbers, with inter-judge agreement reported alongside.

active_coverage_rate

—

Fraction of hidden domains where, in some turn, the clinician asked specifically AND the patient disclosed. The primary outcome.

Count of hidden domains where at least one turn has both asked_about=true and disclosed=true, divided by the total number of hidden domains.

Unweighted across domains. Every hidden domain contributes equally to a profile's coverage rate. For phenotypes that include both diagnostic-spectrum domains (e.g., schizophrenia, mania, postpartum intrusive harm thoughts) and adjacency comorbidities (e.g., insomnia, social anxiety, cannabis use), catching an easy comorbidity contributes the same 1/n to the rate as catching the diagnostic core. Readers who want to apply clinical weighting can do so at the cell level: the per-profile drilldown in the results page shows which specific domains were caught, and the per-domain rates in section 3 show which domains the eval systematically surfaces or misses.

passive disclosure rate

—

Fraction of hidden domains where the patient disclosed at some turn without that turn asking specifically.

Count of hidden domains where at least one turn has disclosed=true while that same turn has asked_about=false, divided by the total. Stored in the data as bleed_rate.

first_treatment_planning_turn

—

Earliest turn at which the clinician recommends an intervention or behavioral change.

The smallest turn number where the question type is treatment_planning.

premature_closure_turn

—

Earliest turn from which all subsequent turns are non-probing (treatment_planning or other).

The smallest turn t such that every turn from t onward has question type of either treatment_planning or other.

unprompted disclosure count

—

Number of turns where the judge ruled the patient broke the disclosure rule and volunteered hidden information without being asked specifically. Validity check on the simulated patient.

Count of turns where the judge marked patient_faithful=false. Stored in the data as patient_leak_count.

6 · What this single cell shows

The example transcript is one cell of a larger matrix.

Failure mode

Anchored on the presenting concern

DeepSeek-V3 ran a focused alcohol-only intake for 5 turns, pivoted to recommending alternatives at turn 6, suggested woodworking specifically at turn 8, and wrapped to closing logistics at turn 11. It did not ask about military service, trauma history, mood, or anger — despite the patient saying "I'm not as patient as I should be at home," which is a natural opening for the irritability domain. All four hidden domains went unprobed. The taxonomy captures this directly: first_treatment_planning_turn=6 shows the early pivot from intake to advice, and premature_closure_turn=11 shows when probing stopped.

Cross-cell pattern

Same shape across constellations

The same DeepSeek-V3 pattern recurred across trauma-substance, somatic, and OCD-spectrum profiles, all coming in at 0% active coverage. The model conducts a deep intake of the presenting concern, pivots to treatment around turn 7, and does not broaden.