Introduction
Recent surveys indicate that nearly half of consumers use generative AI for health-related inquiries,1 driven largely by the convenience these platforms offer. As these tools gain widespread popularity, patients are increasingly turning to AI chatbots as convenient sources of medical information,2 a trend with significant implications for patient safety. AI chatbots can produce incorrect information, warranting caution in medical practice.2 Standardized frameworks for evaluating the quality of AI-generated health information remain underdeveloped. This study aims to evaluate and compare the quality of responses provided by ChatGPT, Gemini, and Claude to a clinically realistic patient query.
Aims and Objectives
Evaluating the Quality of AI-Generated Health Information on Perinatal Psychopharmacology
Lydia J Kim1, Kurt R Wharton2
1Oakland University William Beaumont School of Medicine, 2Department of Obstetrics and Gynecology, Corewell Health William Beaumont University Hospital
Methods
Each of the three AI platforms - ChatGPT (OpenAI), Gemini (Google), and Claude (Anthropic) - received a single, identical prompt simulating a realistic patient query from a 30-year-old, 10-week pregnant first-time mother with a history of major depressive disorder asking about the safety of continuing sertraline throughout pregnancy and breastfeeding, including dose adjustment. Each platform’s combined response was evaluated by one rater using the DISCERN instrument, producing item-level scores for Questions 1-16 and a total score ranging from 16 to 80.
Results
Claude received the highest total score (44), followed by Gemini (41) and ChatGPT (37). Using published DISCERN quality benchmarks, all three platforms fell within the “fair quality” range (39-50), with the exception of ChatGPT, which fell just below this threshold in the “poor quality” range (27-38).
Table 1. Total DISCERN Scores by Platform
Platform | Total DISCERN Score | Quality Classification |
Claude | 44 | Fair |
Gemini | 41 | Fair |
ChatGPT | 37 | Poor |
Conclusions
Universal deficiencies were identified across all platforms, particularly in acknowledging uncertainty in current evidence, presenting alternative treatment options, and directing patients to additional sources of support. These findings suggest that publicly available AI platforms should not be relied upon as standalone sources of perinatal medication information, and highlight the need for ongoing quality evaluation of AI-generated health content against validated consumer health information standards.
References
Acknowledgements
I would like to acknowledge the support provided by my mentor, Dr. Kurt Wharton. I am also grateful to Dr. Dwayne Baxa, my advisor, whose guidance was invaluable.