I wonder if a different word might work better than “hoax.” My main concern is that, quantitatively, it’s difficult—if not impossible—to draw any solid conclusions from a sample of just six vignettes, especially when evaluating a sophisticated model like GPT-4. Additionally, no traditional evaluation benchmarks were used, and the grading system they implemented seems highly subjective.
Why an 18-point scale? Why does it occasionally become a 19-point scale? Why was reasoning weighted at 78% while the final diagnosis was given only 11%? I’ve outlined all of these issues in Section 5. Initially, I planned to focus solely on the small sample size bias, so I didn’t include a brief summary of my main concerns. However, since you’re not the only one asking, I’m thinking I should add one.
I don't know you, Sergei, but I'm a bit worried about your aggressive and accusatory word choices; they have no place in scientific discourse. So let's see if we can get this down to specifics.
If not "hoax," then what are you saying?
Also, "cherrypicking" is an accusation of scientific malpractice. Why specifically do you assert that this happened?
I’m sorry if I came across the wrong way. I’m not an aggressive person. I referred to it as a hoax because, in my opinion, the claim that AI outperforms physicians wasn’t scientifically proven in the JAMA study. Perhaps it was a poor choice of words, but it’s the one I used.
I initially thought I said "hand-picking," but after reviewing the article, I see that I did use "cherry-picking" on one occasion. The authors employed a subjective method to select 6 cases out of 105 for testing, which, in my view, does not align with any established scientific standard. That said, I don’t believe it constitutes scientific malpractice.
My goal is simply to foster a polite and respectful discussion because this is an important issue. If I’m wrong, I’ll own up to it - no questions asked.
I reached out to the authors to address my concerns. To their credit, they replied, but unfortunately, their explanation only strengthened my concerns.
I don’t see a controversy here. I believe that even those with polar opposite opinions can engage in an intelligent, fact-based, and anger-free discussion.
Yes, I may have used harsh words, but those words are grounded in facts. My article wasn’t written in a day - I took the time to do my research, and I stand by it.
Again without having dug deeply into your writings, it seems that this is an awful lot of analysis to go into when the only real proof of the pudding ("Is there a 'there' there?") will have to come from replication studies.
I'm not arguing with any of your reasoning - I'm not equipped to, honestly - but I do know that the whole point of scientific reasoning is to expand our ability to predict what will happen next, until we can actually get there.
So we'll see, and I will try to absorb more. Meanwhile please let me know your thoughts on those accusations, per my previous response here.
Sergei Polevikov’s analysis highlights fundamental flaws in the JAMA study, from its inadequate sample size to its questionable methodology and misrepresentation in media coverage. While AI holds promise in diagnostic reasoning, rigorous research design and transparent reporting are essential to substantiate its capabilities. Studies with methodological weaknesses not only undermine scientific credibility but also risk misinforming policymakers, clinicians, and the public.
This critique serves as a reminder of the importance of statistical rigor, peer review, and balanced media reporting in evaluating emerging technologies like AI in healthcare.
I actually think the JAMA study was quite good, and they are very clear that the study did not mean that AI should be used alone. The NYT and others picked up on that finding.
And it is a great one. The core fact is that the AI made a huge difference and the doctors (fairly young and highly educated) were unable to incorporate that into their practice.
Dave: It turned out to be yet another hoax. Unfortunately, The New York Times ran with it, and the story spiraled out of control: https://sergeiai.substack.com/p/the-ai-outperforms-doctors-claim
I've started absorbing your linked article.
What makes you call it a hoax?
Hi Dave,
I wonder if a different word might work better than “hoax.” My main concern is that, quantitatively, it’s difficult—if not impossible—to draw any solid conclusions from a sample of just six vignettes, especially when evaluating a sophisticated model like GPT-4. Additionally, no traditional evaluation benchmarks were used, and the grading system they implemented seems highly subjective.
Why an 18-point scale? Why does it occasionally become a 19-point scale? Why was reasoning weighted at 78% while the final diagnosis was given only 11%? I’ve outlined all of these issues in Section 5. Initially, I planned to focus solely on the small sample size bias, so I didn’t include a brief summary of my main concerns. However, since you’re not the only one asking, I’m thinking I should add one.
I don't know you, Sergei, but I'm a bit worried about your aggressive and accusatory word choices; they have no place in scientific discourse. So let's see if we can get this down to specifics.
If not "hoax," then what are you saying?
Also, "cherrypicking" is an accusation of scientific malpractice. Why specifically do you assert that this happened?
Hi Dave,
I’m sorry if I came across the wrong way. I’m not an aggressive person. I referred to it as a hoax because, in my opinion, the claim that AI outperforms physicians wasn’t scientifically proven in the JAMA study. Perhaps it was a poor choice of words, but it’s the one I used.
I initially thought I said "hand-picking," but after reviewing the article, I see that I did use "cherry-picking" on one occasion. The authors employed a subjective method to select 6 cases out of 105 for testing, which, in my view, does not align with any established scientific standard. That said, I don’t believe it constitutes scientific malpractice.
My goal is simply to foster a polite and respectful discussion because this is an important issue. If I’m wrong, I’ll own up to it - no questions asked.
I reached out to the authors to address my concerns. To their credit, they replied, but unfortunately, their explanation only strengthened my concerns.
Your assertion that you wanted to foster a "polite and respectful" discussion gets a big "BS" error buzzer from me.
You started here by calling it a hoax, and in your own Substack post your first sentence says "Fake," then "You've got to be kidding me, JAMA!"
If you're going to sling mud, have some spine, dude.
I don’t see a controversy here. I believe that even those with polar opposite opinions can engage in an intelligent, fact-based, and anger-free discussion.
Yes, I may have used harsh words, but those words are grounded in facts. My article wasn’t written in a day - I took the time to do my research, and I stand by it.
Again without having dug deeply into your writings, it seems that this is an awful lot of analysis to go into when the only real proof of the pudding ("Is there a 'there' there?") will have to come from replication studies.
I'm not arguing with any of your reasoning - I'm not equipped to, honestly - but I do know that the whole point of scientific reasoning is to expand our ability to predict what will happen next, until we can actually get there.
So we'll see, and I will try to absorb more. Meanwhile please let me know your thoughts on those accusations, per my previous response here.
Sergei Polevikov’s analysis highlights fundamental flaws in the JAMA study, from its inadequate sample size to its questionable methodology and misrepresentation in media coverage. While AI holds promise in diagnostic reasoning, rigorous research design and transparent reporting are essential to substantiate its capabilities. Studies with methodological weaknesses not only undermine scientific credibility but also risk misinforming policymakers, clinicians, and the public.
This critique serves as a reminder of the importance of statistical rigor, peer review, and balanced media reporting in evaluating emerging technologies like AI in healthcare.
@Ediriweera Desapriya was this comment generated by AI? If so, which one? And please share with us the prompt you used so we can learn from it.
I actually think the JAMA study was quite good, and they are very clear that the study did not mean that AI should be used alone. The NYT and others picked up on that finding.
And it is a great one. The core fact is that the AI made a huge difference and the doctors (fairly young and highly educated) were unable to incorporate that into their practice.
I too wrote about this here but think it is a key finding: https://www.linkedin.com/feed/update/urn:li:activity:7264598665728516096/
Damon, I'm late in posting this first note due to travel, but what you wrote is interesting and I hope to return to discuss this weekend.