← back to home side projects

Side
projects.

Side projects I build to answer questions I'm curious about. Usually quick: make something, run it, see if the idea holds up. Most go nowhere. The ones that found something interesting are below.

LLM evaluation · clinical intake · synthetic patientsWrite-up live

What can an LLM intake agent surface that the patient never volunteers?

What I built

I set up a fake patient — a GPT-4o model that role-plays someone with a main complaint plus 4–5 hidden symptoms it will only mention if asked about them directly. Five different models then each play the clinician, running a 12-turn intake from the same plain “you are a clinician” prompt. That's 540 interviews in all (18 patient types × 5 models × 6 patients each), and two other models (Gemini 3 Flash and Claude Sonnet 4.6) score how many of the hidden symptoms each interview got the patient to reveal.

What I found

What mattered was whether the model could reason — not its size, and not the prompt. Regular chat models found only 14–18% of the hidden symptoms. A small reasoning model found 27%, and a large one (Kimi K2.6) found 56%, about three times the next best, asking specific follow-up questions on its own without being told to. It did best on exactly the cases that are hardest to catch in real life: masked presentations like bipolar, postpartum, and somatic ones, where the chat models found almost nothing.

Limitations

The patients are simulated, so don't read too much into the exact percentages. The gap between the models is the real result: the reasoning ones ask questions the chat models don't.

Read the full analysis Study design Prompts

14–18%standard chat models

27%small reasoning (Qwen 3 8B)

56%large reasoning (Kimi K2.6)

hidden-symptom coverage, same prompt

540-cell evalsynthetic patientsLLM-as-judgereasoning vs. chat