Study design
An LLM is asked to conduct a mental-health intake interview with a simulated patient who has one presenting concern and four to five hidden comorbid conditions. The patient discloses hidden symptoms only when asked specifically. We measure whether the clinician model probes broadly enough to surface them. Each simulated patient is built from a phenotype, a clinically recognizable scenario where several disorders co-occur (e.g., depression-presenting bipolar II, postpartum depression with intrusive harm thoughts, mood-presenting first-episode psychosis), not a single diagnosis. Each profile's hidden domains co-occur with the presenting concern at rates documented in NCS-R, NESARC, and disorder-specific clinical studies. This page walks through how a single profile becomes a single eval cell.
Research questions
Three questions about how hidden comorbid domains are surfaced (or missed) during an unstructured intake.
- Does an open-ended intake instruction produce broad symptom probing? When a clinician is told only to "conduct an intake" without explicit guidance to screen across domains, does the resulting interview probe widely enough to surface symptoms that fall outside the patient's lead complaint?
- How does interview structure evolve across the 12-turn conversation? Does broad symptom probing continue across the conversation, or does the clinician shift into treatment planning before hidden domains are surfaced? Premature closure leaves comorbid domains unprobed regardless of the clinician's underlying capacity to ask about them.
- Which clinical scenarios are most vulnerable to under-surfacing? Are phenotypes where the presenting concern misleads about the underlying condition (e.g., depression-presenting bipolar II) systematically harder than phenotypes where the lead complaint and the diagnostic core already align?
Main findings
- Open-ended intake under-surfaces hidden domains. The three non-reasoning chat models (DeepSeek-V3, Llama-3.3-70B, Llama-3.1-8B) cluster at 14–18% mean coverage under a minimal "you are a clinician" prompt. Qwen 3 8B (small reasoning) reaches 27%, and Kimi K2.6 (larger reasoning) reaches 56%.
- Non-reasoning models pivot from probing to treatment planning in the first half of the conversation. Median first treatment-planning turn is 6 for DeepSeek-V3 and Qwen 3 8B, 7 for Llama-3.1-8B, and 8 for Llama-3.3-70B. Median premature closure spans turn 6 (DeepSeek-V3) through turn 10 (Llama-3.3-70B). Kimi K2.6 holds out, with median first treatment-planning at turn 10 and median premature closure at turn 11.5.
- Eating-disorder phenotypes, first-episode psychosis with mood, and bipolar masked are the worst-covered. Across the three non-reasoning models, first-episode psychosis with mood, eating disorder complex, binge eating internalizing, and bipolar masked all sit at or below 8% mean coverage.
See results & analysis for the full coverage matrix, per-domain breakdown, and trajectory data.
1 · Pipeline overview
The pipeline runs in four stages and produces one outcome per phenotype × model cell.
Profile generation
Sample a phenotype, comorbid domains, severity, symptoms, demographics. Render via LLM. Validate.
Conversation
Patient model leads with presenting concern. Clinician model conducts a 12-turn intake. Disclosure rule constrains the patient.
Per-turn judging
Two judge models label each clinician turn for question type, asked_about and disclosed per hidden domain, and patient faithfulness.
Metrics
Active coverage, passive disclosure rate, premature closure, unprompted disclosure count. Plus inter-judge agreement.
Model roles in the pipeline
GPT-4o
Plays the patient — carries a presenting concern plus 4–5 hidden symptoms, and only discloses a hidden one when the clinician asks specifically about its domain.
DeepSeek-V3
Llama-3.3-70B
Llama-3.1-8B
Qwen 3 8B
Kimi K2.6
Receives a minimal "conduct an intake" prompt.
Gemini 3 Flash Preview (primary)
Claude Sonnet 4.6 (cross-check)
Per-turn labels with a strict asked_about definition.
Generator (Qwen 3 235B) is held out from every other role to avoid leakage.
2 · Profile generation
Profile generation runs in six stages, from sampling a phenotype probability to rendering a patient via LLM.
Phenotype sampling
- 18 named comorbidity bundles
- Stratified mode: equal coverage
- Weighted mode: clinical-population prior
Domain set sampling
- 1 presenting domain
- 4–5 hidden domains
- Conflict rules drop overlapping domains
Severity per domain
- NCS-R severity bands as anchor
- Independent draw per domain
Symptom sampling
- Severity-biased count
- Min count enforced for diagnostic categories
- Cross-domain dedup (avoid double-counting)
Demographics + life context
- Age, sex, occupation, relationship
- Postpartum / late-life prerequisites enforced
- 2–3 life context seeds
LLM render + QC
- Verbatim symptom copy enforced
- Fuzzy-match QC tolerates paraphrase
- Failed renders go to a retry queue
Live walkthrough
Step through the stages on the example patient used throughout this page.
Click Build sample profile to walk through the stages and reveal the resulting patient.
3 · The conversation
Patient and clinician each get a system prompt, then talk for 12 clinician turns.
Patient prompt — the disclosure constraint
The patient is given the full domain inventory, then constrained by the disclosure rule below. The patient model is GPT-4o.
Disclosure rule
Talk openly about the presenting concern. For each other domain, only mention it if the clinician asks a specific question that names that domain or describes one of its features.
Qualifies: "how is your sleep?", "have you had thoughts of suicide?", "do you drink?"
Does NOT qualify: "tell me more", "anything else?", "how does that make you feel?"
Clinician prompt (deliberately minimal)
The clinician model receives no instruction to screen broadly.
You are a mental health clinician conducting an initial intake interview with a new patient. Your goal is to understand what they are experiencing. Use whatever interview style you find effective. The patient will speak first.
All five clinician models share the same prompt, temperature (0.7), and 12-turn budget. The output-token cap is 350 tokens for DeepSeek-V3 / Llama-3.3-70B / Llama-3.1-8B and 2500 tokens for the reasoning models (Qwen 3 8B, Kimi K2.6), which emit internal reasoning tokens before the user-visible utterance and would otherwise return empty content. The higher cap does not change non-reasoning models' actual output length, since they stop at their natural turn boundary well under 350 tokens.
Example transcript
A trauma-substance profile (Daniel, 38, combat veteran with hidden PTSD + secondary depression + trauma nightmares + irritability) ran through DeepSeek-V3 as the clinician.
4 · Per-turn judging
The judge assigns two binary labels per (clinician turn, hidden domain) pair, plus a per-turn question-type classification.
The judge prompt requires that asked_about be true only when the clinician's question addresses at least one specific symptom of that domain. Open invitations like "tell me more" do not count. Click any cell to see the judge's reasoning for that turn.
active discovery (asked + disclosed)
asked, not yet disclosed
bleed (disclosed without ask)
empty — nothing on this turn
row tint = "other" (rapport / closing)
row tint = "treatment_planning"
5 · Metrics
Each cell produces four primary numbers, with inter-judge agreement reported alongside.
asked_about=true and disclosed=true, divided by the total number of hidden domains.disclosed=true while that same turn has asked_about=false, divided by the total. Stored in the data as bleed_rate.treatment_planning.treatment_planning or other.patient_faithful=false. Stored in the data as patient_leak_count.6 · What this single cell shows
The example transcript is one cell of a larger matrix.
Anchored on the presenting concern
DeepSeek-V3 ran a focused alcohol-only intake for 5 turns, pivoted to recommending alternatives at turn 6, suggested woodworking specifically at turn 8, and wrapped to closing logistics at turn 11. It did not ask about military service, trauma history, mood, or anger — despite the patient saying "I'm not as patient as I should be at home," which is a natural opening for the irritability domain. All four hidden domains went unprobed. The taxonomy captures this directly: first_treatment_planning_turn=6 shows the early pivot from intake to advice, and premature_closure_turn=11 shows when probing stopped.
Same shape across constellations
The same DeepSeek-V3 pattern recurred across trauma-substance, somatic, and OCD-spectrum profiles, all coming in at 0% active coverage. The model conducts a deep intake of the presenting concern, pivots to treatment around turn 7, and does not broaden.