The Human → AI Reasoning Shunt

Thoughts on AI systems that can reason alongside users, rather than taking all reasoning over

Jan 10, 2025

Note: This post is grounded in views and experiences I’ve developed working as a resident physician while interacting with & building clinical AI tools. I do think that a lot of these thoughts broadly apply across use-cases and domains.

The way I’ve thought about (and prioritized) recalling facts vs. reasoning about them has evolved through the years, as I’ve made the transition from pure computer science to medicine. As a resident physician, my daily decision-making consists of an extensive combination of recall + reasoning — drawing from clinical, physiological, pharmacological knowledge and trying to figure out what is going on with a patient.

Over the past couple years, there have been a few LLM-based clinical reference tools that have gained adoption by healthcare providers. At a high level, they pull information from medical journals and pass them to an LLM to answer a user’s query (a typical RAG framework). The thought behind this is that grounding LLM output with legitimate clinical sources improves the knowledge and accuracy of a system, compared to directly querying a tool like ChatGPT.

Relying on resources to locate clinical data and studies to help guide decision making is not new — we want clinical reasoning to at least be informed by evidence-based guidelines and studies. However, in my view, these LLM-driven tools often subtly cross the line from purely assisting with “information synthesis” into “clinical reasoning”.

I think reflecting on this “reasoning shunt” is important - first to identify how this could end up hurting clinical quality rather than helping, and second to optimize development of AI systems to be better.

The Human → AI Reasoning Shunt

Physicians each have their own process for performing clinical reasoning — a structured method for generating clinical hypotheses, evaluating and ranking them, and finally acting on them.

The overall premise of LLM-powered RAG systems — combining a large knowledge base of high-quality sources and strong semantic question-answering capabilities — represents a potential boost for hypothesis generation. A sufficiently capable LLM can reasonably extrapolate what sort of data that would be helpful to answer a clinical workup; given access to a large body of scientific research, it can generate a response that aligns with this data. To that extent, “just” synthesizing a range of facts and stitching them together can greatly speed up the process of cross-referencing sometimes needed when constructing a clinical plan.

The trouble is when the systems step beyond this box, especially in subtle ways that mask the fact. In some ways, this leads to parts of the clinical reasoning being shunted away from the human to the AI system. This might be acceptable, and potentially could be even valuable, when it’s an expected behavior — but done insidiously, it could stunt or misdirect the clinical thought process.

There are two broad categories I’ve noticed thus far of this behavior happening:

1. The underlying tendency for models to agree with a user’s input

The tendency of ChatGPT and Claude to agree with what the user is saying, even in instances where it may not be true, has been widely discussed. In theory, grounding the models with high-quality references should be able to help with this. However, in medicine, it is surprisingly easy to find at least one study that hints at supporting a particular premise, even if the majority of evidence indicates otherwise.

I’ve run into this a few times when asking about side effects or contraindications about a particular therapy. Although the vast majority of medical literature shows no demonstrated relationship, there may be one study, done on a very specific small population of individuals with several caveats that make it unlikely to be applicable to my query. The bias of the underlying model to agree with the user’s question is fed by demonstrated “evidence”, and it is consequently used to justify a particular answer — even though a cursory manual review of the source would reveal that the real answer is far more ambiguous.

This isn’t an insurmountable problem, even with today’s systems — more nuanced semantic filtering and processing of the references used could solve many of these issues. I don’t see too much of that being done with today’s products, and I worry that seeing the [1][2]and the name of a reputable journal next to a piece of generated text is enough to circumvent further clinician analysis of the data.

2. Hijacking the reasoning mental model

As I’ve been developing my own process for reasoning, I’ve realized an extremely important component is to always leave room for something I may not have considered. Medicine is a field of long tails, and part of being an effective clinician is recognizing when something is in-distribution vs out-of-distribution. Due to the information distribution both in the underlying LLMs’ training data and the references that are pulled from, clinical AI reference tools are biased towards considering more likely etiologies*. Which works quite well for the most part — but does not obviate a clinicians’ need to think about the rarer differential diagnoses. (*I haven’t conducted any formal research/lit review to back up this statement; it’s solely based off my n=1 experiences trying out these tools.)

When asking some of these systems to work up a constellation of symptoms, the response can come across as very definitive: a list of differential diagnoses, a list of workups, and a list of next steps. I worry that, without the proper contextualization of where these systems should be used, it can become very easy to let them guide a clinician’s thought process rather than be used as a tool to enhance it.

Developing an ideal “Co-Reasoning” System

Ultimately, I think there a range of things — simple and complex, general and domain-specific, broadly-applicable and personalized — that can be done with today’s level of tech to mitigate a lot of the above issues. I’ve started compiling a list of all of these, but in the interest of maintaining a reasonable post length, will leave those for another post :)

Back in medical school, I built an LLM-powered clinical reference tool which I ultimately shut down after graduating. I’ve been building something new, inspired by the above issues and my experiences thus far as a resident physician.

If you’re interested in learning more — and trying it out in a few weeks :) — feel free to reach out!

m3 | music, medicine, machine learning

Discussion about this post