This is such an annoyingly useless study. 1) the cases they gave ChatGPT were specifically designed to be unusual and challenging, they are basically brain teasers for pediatrics, so all you’ve shown is that ChatGPT can’t diagnose rare cases, but we learn nothing about how it does on common cases. It’s also not clear that these questions had actual verifiable answers, as the article only mentions that the magazine they were taken from sometimes explains the answers.
since these are magazine brain teasers, and not an actual scored test, we have no idea how ChatGPT’s score compares to human pediatricians. Maybe an 83% error rate is better than the average pediatrician score.
why even do this test with a general purpose foundational model in the first place, when there are tons of domain specific medical models already available, many open source?
the paper is paywalled, but there doesn’t seem to be any indication that the researchers used any prompting strategies. Just last month Microsoft released a paper showing gpt-4, using CoT and multi shot promoting, could get a 90% score on the medical license exam, surpassing the 86.5 score of the domain specific medpapm2 model.
This paper just smacks of defensive doctors trying to dunk on ChatGPT. Give a multi purpose model super hard questions, no promoting advantage, and no way to compare it’s score against humans, and then just go “hur during chatbot is dumb.” I get it, doctors are terrified because specialized LLMs are very certain to take a big chunk of their work in the next five years, so anything they can do to muddy the water now and put some doubt in people’s minds is a little job protection.
If they wanted to do something actually useful, give those same questions to a dozen human pediatricians, give the questions to gpt-4 with zero shot, gpt-4 with Microsoft’s promoting strategy, and medpalm2 or some other high performing domain specific models, and then compare the results. Oh why not throw in a model that can reference an external medical database for fun! I’d be very interested in those results.
Edit to add: If you want to read an actually interesting study, try this one: https://arxiv.org/pdf/2305.09617.pdf from May 2023. “Med-PaLM 2 scored up to 86.5% on the MedQA dataset…We performed detailed human evaluations on long-form questions along multiple axes relevant to clinical applications. In pairwise comparative ranking of 1066 consumer medical questions, physicians preferred Med-PaLM 2 answers to those produced by physicians on eight of nine axes pertaining to clinical utility.” The average human score is about 60% for comparison. This is the domain specific LLM I mentioned above, which last month Microsoft got GPT-4 to beat just through better prompting strategies.
There literally are probably a dozen LLM models trained exclusively on or fined tuned on medical papers and other medical materials, specifically designed to do medical diagnosis. The already perform on pair or better than the average doctors in some tests. It’s already a thing. And they will get better. Will they replace doctors outright, probably not at least not for a while. But they certainly will be very helpful tools to help doctors make diagnosis and miss blind spots. I’d bet in 5-10 years it will be considered malpractice (i.e., below the standard of care) not to consult with a specialized LLM when making certain diagnosis.
On the other hand, you make a very compelling argument of “nuh uh” so I guess I should take that into account.
This is such an annoyingly useless study. 1) the cases they gave ChatGPT were specifically designed to be unusual and challenging, they are basically brain teasers for pediatrics, so all you’ve shown is that ChatGPT can’t diagnose rare cases, but we learn nothing about how it does on common cases. It’s also not clear that these questions had actual verifiable answers, as the article only mentions that the magazine they were taken from sometimes explains the answers.
since these are magazine brain teasers, and not an actual scored test, we have no idea how ChatGPT’s score compares to human pediatricians. Maybe an 83% error rate is better than the average pediatrician score.
why even do this test with a general purpose foundational model in the first place, when there are tons of domain specific medical models already available, many open source?
the paper is paywalled, but there doesn’t seem to be any indication that the researchers used any prompting strategies. Just last month Microsoft released a paper showing gpt-4, using CoT and multi shot promoting, could get a 90% score on the medical license exam, surpassing the 86.5 score of the domain specific medpapm2 model.
This paper just smacks of defensive doctors trying to dunk on ChatGPT. Give a multi purpose model super hard questions, no promoting advantage, and no way to compare it’s score against humans, and then just go “hur during chatbot is dumb.” I get it, doctors are terrified because specialized LLMs are very certain to take a big chunk of their work in the next five years, so anything they can do to muddy the water now and put some doubt in people’s minds is a little job protection.
If they wanted to do something actually useful, give those same questions to a dozen human pediatricians, give the questions to gpt-4 with zero shot, gpt-4 with Microsoft’s promoting strategy, and medpalm2 or some other high performing domain specific models, and then compare the results. Oh why not throw in a model that can reference an external medical database for fun! I’d be very interested in those results.
Edit to add: If you want to read an actually interesting study, try this one: https://arxiv.org/pdf/2305.09617.pdf from May 2023. “Med-PaLM 2 scored up to 86.5% on the MedQA dataset…We performed detailed human evaluations on long-form questions along multiple axes relevant to clinical applications. In pairwise comparative ranking of 1066 consumer medical questions, physicians preferred Med-PaLM 2 answers to those produced by physicians on eight of nine axes pertaining to clinical utility.” The average human score is about 60% for comparison. This is the domain specific LLM I mentioned above, which last month Microsoft got GPT-4 to beat just through better prompting strategies.
Ugh this article and study is annoying.
LLM are not even the right type of AI to try to do medical diagnosis with. Stop treating LLMs like they can fucking think and reason. They do not.
There literally are probably a dozen LLM models trained exclusively on or fined tuned on medical papers and other medical materials, specifically designed to do medical diagnosis. The already perform on pair or better than the average doctors in some tests. It’s already a thing. And they will get better. Will they replace doctors outright, probably not at least not for a while. But they certainly will be very helpful tools to help doctors make diagnosis and miss blind spots. I’d bet in 5-10 years it will be considered malpractice (i.e., below the standard of care) not to consult with a specialized LLM when making certain diagnosis.
On the other hand, you make a very compelling argument of “nuh uh” so I guess I should take that into account.