The model is already in the workflow
A quiet fact of 2026 is that many clinicians are using large language models even when their institutions have not formally approved them. The reason is mundane. Clinicians face a double workload: clinical reasoning plus clerical production. Any tool that reduces the clerical share feels like relief.
LLMs fit this pressure because they handle language at scale. They can draft histories, rewrite assessment and plan sections in coherent prose, and produce patient-facing explanations that sound humane. This utility is independent of whether the model is good at diagnosis. A model can be mediocre at medicine and still be valuable in the language work that medicine requires.
Yet this is precisely where risk enters. Once a tool is embedded, its outputs become part of the clinician’s mental environment. The question becomes less about whether the model can diagnose and more about how it changes the clinician’s reasoning.
What the early performance literature actually says
Benchmarks give comfort. Real clinical tasks give discomfort.
Consider the frequently cited study Use of GPT-4 to Diagnose Complex Clinical Cases in NEJM AI, which reported that GPT-4 correctly diagnosed a substantial share of published clinicopathological conference cases. The paper is noteworthy because it deals with complex cases rather than multiple-choice questions, and because it demonstrates that a general-purpose model can perform surprisingly well when the input is structured.
Now place it beside evidence pointing in the other direction. A 2024 paper in Nature Communications evaluated LLM-generated clinical recommendations from emergency department visits and found poor performance relative to a resident physician, as described in Evaluating the use of large language models to provide clinical recommendations. The results were sobering: prompt engineering improved some elements, yet errors remained substantial.
These studies can coexist because clinical performance is conditional. It depends on task definition, input quality, and the implicit expectations of the evaluation rubric. A model may perform well on rare, narrative-rich puzzles, and poorly on routine decision tasks that depend on subtle context, local practice patterns, and non-textual signals.
A 2025 diagnostic study in JAMA Network Open compared LLMs with a dedicated expert diagnostic system. The specialized system outperformed the LLMs on ranking the correct diagnosis, while the LLMs still performed credibly. The finding supports a pragmatic view: LLMs are not the only AI approach, and they may be best used as flexible interfaces rather than as primary engines.
The most honest reading is that LLMs will be intermittently useful in clinical reasoning and reliably useful in clinical language work. Hospitals and clinicians need to align deployment with that reality.
Clinical decision-making includes documentation, and documentation shapes decisions
Many clinicians treat documentation as an afterthought, an administrative residue of care. That view is dated. Documentation affects billing, continuity, patient comprehension, medicolegal risk, and inter-team coordination. It also affects the clinician’s own memory of the case.
When an LLM drafts a note, the clinician is invited into an editorial stance rather than an authorial stance. Editing is faster, which is the attraction. Editing also encourages a different cognitive posture: a clinician may accept plausible statements, miss subtle inaccuracies, or begin to reason inside the model’s framing.
This is the familiar risk of automation bias, which has been studied for decades in decision support systems. The twist is linguistic. LLM outputs are persuasive because they read like professional prose. They can smuggle assumptions into a note, and those assumptions can become a future clinician’s starting point.
Research on patient portal messaging provides a microcosm of this dynamic. A quality improvement study in JAMA Network Open examined AI-drafted replies for patient messages and highlighted how iterative prompt changes shifted perceived usefulness. The study foregrounds a critical fact: prompts are policy. A health system that uses LLMs without controlling prompts is outsourcing part of its clinical voice.
Governance is moving from optional to required
The U.S. regulatory landscape is adjusting, unevenly.
The FDA has developed a growing public body of guidance and framing for AI-enabled medical software. Its page on Artificial Intelligence in Software as a Medical Device anchors the discussion in a lifecycle view, and the agency’s AI/ML-based SaMD Action Plan signaled an intention to refine oversight mechanisms rather than treat AI as an exception.
For clinical decision support specifically, FDA interpretation has been evolving. The agency published a newly updated guidance on January 6, 2026, titled Clinical Decision Support Software, clarifying the scope of oversight for CDS functions intended for healthcare professionals, and distinguishing non-device CDS from functions that remain device-regulated.
Meanwhile, the Office of the National Coordinator for Health IT has pushed transparency requirements through its HTI-1 rule. The HTI-1 Final Rule introduced decision support intervention transparency expectations for certified health IT. The practical implication is straightforward: the health IT market is being nudged toward revealing training data, intended uses, and performance constraints.
Outside the U.S., the European Union has enacted a comprehensive legal structure in the EU Artificial Intelligence Act, with special obligations for high-risk systems and explicit connections to medical products. Even U.S. health systems that do not operate in Europe will feel downstream effects, since vendors increasingly standardize compliance practices across markets.
Finally, cross-sector risk frameworks are becoming the lingua franca of institutional governance. NIST’s AI Risk Management Framework (AI RMF 1.0) provides a vocabulary for mapping, measuring, managing, and governing AI risk, and NIST’s generative AI profile, NIST.AI.600-1, extends this into model-specific considerations.
A deployment philosophy that respects clinical cognition
Hospitals do not merely deploy tools. They shape cognition.
A sensible LLM deployment philosophy treats the model as a second reader, a summarizer, and a drafting assistant, while constraining its role as a recommender. In other words, it handles representation before it handles judgment.
Several design principles follow.
First, preserve friction where it matters. Decision points that carry high morbidity should require explicit clinician confirmation and a documentation trace of why a recommendation was followed or rejected.
Second, separate summarization from recommendation. Summarization can be measured against the chart. Recommendation requires alignment with guidelines and local practice. These are different evaluation problems.
Third, treat prompt design as clinical governance. Prompts should embed institutional policies, preferred guidelines, and safety checks.
Fourth, monitor drift in real use. Model performance is not static, and data distributions shift with seasons, coding patterns, and local outbreaks. A hospital that treats model outputs as immutable will eventually be surprised.
Fifth, maintain auditability. If a model contributes to a note or recommendation, the system should retain the prompt, the output, and the clinician edits.
These principles align with the intent of NIST frameworks and with the emerging transparency language in health IT regulation.
The likely endpoint: a layered model ecology
The future of clinical decision-making is unlikely to belong to one model. It will belong to a layered ecology.
Specialized diagnostic systems will remain valuable because they can be tested narrowly and audited. LLMs will remain valuable because they can translate across domains and serve as interfaces. Retrieval systems will matter because they tie outputs to source documents. Together, these components can form decision support that is both useful and accountable.
Clinicians, for their part, will need training that treats AI as a cognitive factor. Medical education has long taught evidence appraisal and differential diagnosis. It will now need to teach critique of machine-generated narratives.
In a healthcare system where language has become a primary medium of work, LLMs will not remain optional accessories. They will become infrastructure. The only question is whether that infrastructure will be governed as patient safety technology or treated as office software. The evidence to date supports the former.
A decision-support culture that tolerates dissent
Most failures of clinical decision support come from social dynamics rather than from algorithms. A tool that produces a plausible differential can still be dangerous if its presence discourages the team from arguing. A functioning LLM deployment needs a culture where the trainee, the nurse, and the pharmacist can challenge the model’s suggestion and can challenge the attending who is tempted to accept it.
One operational approach is to formalize a short critique step in the workflow. When the model generates a recommendation, the clinician records a brief counterargument. What would persuade me that this recommendation is wrong. What would I do if the model were unavailable. The objective is cognitive friction. It is a small tax that can prevent automation bias from becoming an institutional habit.
The regulatory environment is already nudging organizations in this direction. The NIST AI Risk Management Framework and the companion Generative AI profile emphasize mapping context, measuring performance, and sustaining governance over time. In parallel, transparency requirements are becoming more concrete in health IT. Under the ONC HTI-1 final rule, decision support interventions are expected to carry disclosures about training data, limitations, and intended use.
What to measure, beyond accuracy
- Calibration: does the model express uncertainty in a way that matches reality.
- Stability: does output drift when small details change.
- Equity: do recommendations shift in patterned ways across demographic groups.
- Workflow burden: does the tool reduce clinician time, or does it add verification work that cancels the benefit.
- Harm signals: are there recurring failure modes, such as missed sepsis or over-triage of benign symptoms.
The last category matters because some evaluation work has shown a gap between headline diagnostic performance and real-world clinical recommendation quality. The Nature Communications evaluation of GPT models on emergency department recommendations illustrates how models can falter when the task is not merely naming a diagnosis, but translating data into specific actions.
The human factor that tool evaluations miss
A clinical environment is an attention economy under pressure. Even when a model performs well in controlled tests, the day-to-day failure mode usually arrives through people and process rather than through a mathematical defect. The most common pattern is a quiet shift in responsibility. When a suggestion appears fluent, it acquires social authority; junior staff begin to treat it as a default; seniors stop interrogating it as closely; the organization gradually redefines diligence downward.
The antidote is deliberate friction. Many health systems already use checklists for anticoagulation, perioperative antibiotics, and infection control. AI deserves similar structure. The novelty is that the checklist needs to guard against automation bias. It should force the clinician to articulate a counterargument, even if only in a short phrase. In addition, the system should log the model output and the clinician’s changes, because drift in edits often signals drift in the underlying model.
This is one reason the transparency focus of the ONC HTI-1 Final Rule matters. Clinical decision support that enters the EHR workflow can change practice. The question becomes whether the clinician understands what the tool saw, what it ignored, and what it inferred.
A practical implementation template
- Keep a registry of approved use cases, by department, and link each to a risk rating aligned with the NIST AI Risk Management Framework (AI RMF 1.0).
- Require a one page model card that includes training data boundaries, evaluation results, and known failure patterns.
- Treat prompt templates as clinical artifacts. Version them, review them, and retire them when workflows change.
- Build a feedback loop that captures clinician edits and patient safety events. A tool that cannot be corrected will eventually be bypassed.
When governance is explicit, clinical value follows. The most persuasive argument for LLM decision support will not be a benchmark score. It will be a boring chart that shows fewer missed follow-ups, fewer medication errors, and fewer after-hours clicks per patient panel.














