Multiple LLMs voting together on content validation catch each other’s mistakes to achieve 95.6% accuracy.

Lugh · 9 months ago

Multiple LLMs voting together on content validation catch each other’s mistakes to achieve 95.6% accuracy.

Lugh · 9 months ago

Large language models surpass human experts in predicting neuroscience results

A small study found ChatGPT outdid human physicians when assessing medical case histories, even when those doctors were using a chatbot.

massive_bereavement@fedia.io · 9 months ago

Are you kidding me? How did NYT reach those conclusions when the chair flipping conclusions of said study quite clearly states that [sic]“The use of an LLM did not significantly enhance diagnostic reasoning performance compared with the availability of only conventional resources.”

https://jamanetwork.com/journals/jamanetworkopen/fullarticle/2825395

I mean, c’mon!

On the Nature one:

“we constructed a new forward-looking (Fig. 2) benchmark, BrainBench.”

and

“Instead, our analyses suggested that LLMs discovered the fundamental patterns that underlie neuroscience studies, which enabled LLMs to predict the outcomes of studies that were novel to them.”

and

“We found that LLMs outperform human experts on BrainBench”

Is in reality saying: we made this benchmark that LLMs know how to cheat around our benchmark better than experts do, nothing more, nothing else.

Multiple LLMs voting together on content validation catch each other’s mistakes to achieve 95.6% accuracy.

Multiple LLMs voting together on content validation catch each other’s mistakes to achieve 95.6% accuracy.

Probabilistic Consensus through Ensemble Validation: A Framework for LLM Reliability