0
Superhuman performance of a large language model on the reasoning tasks of a physician | AI Research Paper Details
www.aimodels.fyiPerformance of large language models (LLMs) on medical tasks has traditionally been evaluated using multiple choice question benchmarks. However, such benchmarks are highly constrained, saturated with repeated impressive performance by LLMs, and have an unclear relationship to performance in real clinical scenarios. Clinical reasoning, the process by which physicians employ critical thinking to gather and synthesize clinical data to diagnose and manage medical problems, remains an attractive benchmark for model performance. Prior LLMs have shown promise in outperforming clinicians in routine and complex diagnostic scenarios. We sought to evaluate OpenAI's o1-preview model, a model developed to increase run-time via chain of thought processes prior to generating a response. We characterize the performance of o1-preview with five experiments including differential diagnosis generation, display of diagnostic reasoning, triage differential diagnosis, probabilistic reasoning, and management reasoning, adjudicated by physician experts with validated psychometrics. Our primary outcome was comparison of the o1-preview output to identical prior experiments that have historical human controls and benchmarks of previous LLMs. Significant improvements were observed with differential diagnosis generation and quality of diagnostic and management reasoning. No improvements were observed with probabilistic reasoning or triage differential diagnosis. This study highlights o1-preview's ability to perform strongly on tasks that require complex critical thinking such as diagnosis and management while its performance on probabilistic reasoning tasks was similar to past models. New robust benchmarks and scalable evaluation of LLM capabilities compared to human physicians are needed along with trials evaluating AI in real clinical settings.
The doctor, clearly. Who do you sue if the AI doesn’t get it right? Who’s held accountable for the failure? Also for your scenario to work, either humans have to do that diagnosis enough to generate those stats, or the AI has to fail enough to generate those stats, either way people are going to die due to preventable misdiagnosis.
More over all LLMs just ‘hallucinate’, sometimes those hallucinations happen to line up to reality, but by their very nature they do not deal in factual information. There is a reason no LLMs will ever touch Wikipedia or other knowledge bases.
This is not how this research is done. You can make diagnosis without applying them to patient. You can, for example, go back to database of past cases, then, create diagnosis for these past cases and see in the present if they were right or wrong. This way (just on example) you can create statistics. No one has to die. You don’t know how this is done. (frankly I don’t know a lot either … those people writing the article probably know much more than you and I).
After that, if we know that the A.i. is superior in these cases, (i agree this is a big “if”), then, i would choose the diagnosis from it and i would take responsibility for my choice. I wouldn’t sue any doctor and i would still be at an advantage because of this better choice.
But maybe we cannot agree on this topic. I wish you the very best, take care 😌
Maybe in a country without private medical care, but your idea doesn’t work in the US.
AI is already, currently, this second, in use in the medical insurance industry and has statistically killed at least one person.
Expanding that to the part of the medical business that has some scientific backing is essentially societal suicide, unless you’re rich enough to afford a real human doctor.
Whoops, sorry, no … I didn’t have USA in mind while writing … so in there : yes, “healthcare” is completely fucked up.