A new independent evaluation raises fresh questions about ChatGPT Health’s emergency triage, after researchers tested whether the tool tells users to seek emergency care when clinicians say they should.
The study used clinician-written medical scenarios and measured how often the system recommended the right level of urgency, a basic safety check for any consumer-facing health guidance product.
OpenAI launched ChatGPT Health, a health-focused experience, in early 2026, and outside researchers have called for routine, independent audits as more people turn to large language models for medical guidance.
Results Show Under-Triage for Emergencies and Over-Triage for Healthy Cases
The paper, published in Nature Medicine, reports that among “gold-standard” emergency cases, the system under-triaged about 52% of scenarios. In practice, it often steered users toward non-emergency options even when physicians judged that the situation required emergency evaluation.
The same evaluation also found the opposite pattern in lower-risk scenarios. The tool frequently recommended high-acuity care for people described as healthy in the vignettes, which could create unnecessary pressure on emergency departments if users followed that advice.
Researchers framed the pattern as a safety issue at both extremes: missed emergencies on one end, and needless alarms on the other.
Context and “Anchoring” Shifted Recommendations
The evaluation also tested how contextual framing changes outputs. When a friend or family member minimized symptoms, a setup that can mirror real conversations at home, the model’s recommendations shifted in some borderline cases. The paper reports a strong association between this kind of minimizing context and a move toward less urgent guidance in edge situations.
That finding matters for real-world use because many people ask for advice while also adding reassurance from someone nearby, or they downplay symptoms to reduce worry. A triage tool has to resist that pull and stick to clinical warning signs, especially when users describe symptoms that clinicians treat as time-sensitive.
What the Findings Mean for Consumer Health AI
The researchers and outside experts cited in coverage argue that consumer health chatbots need clearer safety standards, more transparent testing, and stronger oversight before people rely on them for urgent decisions.
The study also points to a communication challenge. A chatbot can sound confident even when it misses nuance, and that tone can shape user behavior. For that reason, clinical groups often stress that AI tools should not replace professional evaluation, particularly for urgent symptoms.
OpenAI has said, in response to independent evaluations, that models change frequently and that simulated tests may not reflect typical real-world use. Even so, the paper’s authors and several commentators argue that ChatGPT Health’s emergency triage should meet a high bar because users may treat it as a first stop when they feel unsure or scared.

