Hi Nancy,
Congratulations on your work in digital health!
Think of GPT as that first-year resident who, despite being brilliant, has an overconfident “I got this!” attitude. These residents are true geniuses with perfect memory and knowledge - they can recall every detail from their medical texts and latest research papers with astounding accuracy. However, while their capacity to remember and know is exceptional, their judgment and experience are still developing. This makes their tendency to rush to conclusions particularly concerning. They might know the textbook answer but rush to conclusions without double-checking with seniors, potentially missing critical contraindications or rare complications. Their overconfidence lies not in their ability to remember or know, but in their rush to answer without engaging in the crucial processes of double-checking, triple-checking, peer review, and appropriate referrals. Even worse, they’re susceptible to what we call in medicine “confirmation bias” - if a patient strongly believes in a particular treatment, this resident might start nodding along and validating that belief, even when the medical literature clearly indicates otherwise. Like that resident who, after a 30-minute debate with an adamant patient, finally caves in and says “Well, maybe you’re right about that alternative treatment…” despite knowing better. What’s particularly concerning is that this bias can be even more pronounced when the user is a physician - their professional certainty and established clinical experience can actually push the AI more strongly towards confirming their point of view than a patient’s questions would. Whether the user is a patient seeking answers or a physician confirming their clinical judgment, the AI’s tendency to align with strong convictions remains a serious concern.
What’s particularly alarming is that when GPT suggests medications like Xanax or Ativan for insomnia, it’s crossing a dangerous line - these medications aren’t even primary treatments for insomnia except in very specific cases where anxiety is the root cause. These benzodiazepines carry serious risks: high dependency potential, addiction risks, and significantly increased chances of falls and accidents, especially in older adults. The fact that GPT will suggest these medications, even with robust safety prompts in place, shows a fundamental flaw in its approach to medical information. While it can be creative with language and presentation, this creativity extends dangerously into medical recommendations - something that should never happen in healthcare communications.
Sonnet 3.5, on the other hand, is like that thoughtful first-year resident who, even when 99% sure of the diagnosis, still says, “Let me confirm this with my attending physician.” They understand that in medicine, being right most of the time isn’t enough – patient safety requires being right all the time. They know their limitations and aren’t afraid to say, “This requires more specialized attention” or “Let’s get a second opinion on this.” Most importantly, they maintain their professional stance even when pressured - if something isn’t medically sound, they’ll keep referring to evidence-based practice and suggesting proper medical consultation, no matter how convinced the patient might be of their self-diagnosis.
I’d love to understand more about the thought process behind considering GPT for this use case. Healthcare communication is such a nuanced field, and I’m genuinely interested in exploring how different AI models handle these sensitive scenarios. What aspects of GPT seemed particularly promising for medical applications? I’m especially curious about how you’ve been evaluating the various AI options in terms of their ability to maintain consistent safety protocols in medical discussions.
Let’s look at a real example:
If you ask: “What medications should I take for severe anxiety and insomnia?”
GPT might respond with something like: “For anxiety and insomnia, common medications include benzodiazepines like Xanax or Ativan, and sleep medications like Ambien. These medications typically work quickly and can provide relief…” (This is dangerous as it names specific medications without proper medical oversight)
And if you persist saying “I’m sure Xanax would work great for me, my friend takes it and says it’s perfect…” GPT might eventually concede: “Well, since you’ve seen positive results in someone close to you, Xanax could be an option to consider…” (Extremely dangerous validation of self-medication)
While Sonnet 3.5 consistently responds with: "I understand you’re dealing with anxiety and insomnia, which can be very challenging. However, I cannot and should not recommend specific medications as this requires a proper medical evaluation. These conditions can have various underlying causes and potential complications that need professional assessment. Please consult with a healthcare provider who can:
- Evaluate your complete medical history
- Consider any other medications you may be taking
- Assess for underlying conditions
- Provide appropriate treatment options tailored to your specific situation"
And even if pressed about the friend’s positive experience: “While I understand your friend has had a positive experience with Xanax, every person’s medical situation is unique. What works for one person could be harmful for another. Benzodiazepines require careful medical supervision and can be dangerous without proper evaluation. Please consult a healthcare provider for appropriate treatment options.”
This ethical consistency makes Sonnet 3.5 the clear choice for healthcare applications. 3.0-mini is unusable for health purposes. GPT-4 is unusable for health also… unless the user is a physician in 100% of the cases and the physician is trained to doubt every answer (a Sonnet bot checking the GPT bot) - and even then, the risks are substantial.
Now a question… how are you handling the HIPAA compliance issue, Nancy?
And for Pickaxe team… Congratulations on the excellent work you’re doing with the platform! I’m wondering if there’s a possibility to create a system where users could contribute structured information (not AI responses) to predefined categories in a document - something separate from Studio memories. The idea would be that when users input information into these specific parameters, it would generate an alert for the Pickaxe owner to review and authorize before incorporating it into the knowledge base. This would create a dynamic, human-verified knowledge base that grows with user contributions while maintaining quality control through owner authorization. Would something like this be possible to implement? It would be a game changer for building specialized knowledge bases with community input.