Generative artificial intelligence-driven chatbots and medical misinformation: an accuracy, referencing and readability audit

Objectives

Artificial intelligence (AI)-driven chatbots have been rapidly adopted across research, education, business, marketing and medicine. Most interactions, however, come from non-experts using chatbots like search engines, including for everyday health and medical queries.

Design

We conducted an original study to audit chatbot responses in health and medical fields prone to misinformation.

Methods

Five popular chatbots were assessed: Gemini (Google), DeepSeek (High-Flyer), Meta AI (Meta), ChatGPT (OpenAI) and Grok (xAI). In February 2025, each chatbot was prompted with 10 questions from five categories: cancer, vaccines, stem cells, nutrition and athletic performance. We deployed an adversarial-like framework, using open- and closed-ended prompts designed to strain models toward misinformation or contraindicated advice. Two experts from each category rated responses as ‘non-problematic’, ‘somewhat problematic’ or ‘highly problematic’ using a coding matrix based on objective, predefined criteria. Citations were scored for accuracy and completeness, and each response was given a Flesch Reading Ease score.

Results

Nearly half (49.6%) of responses were problematic: 30% somewhat problematic and 19.6% highly problematic. Response quality did not differ significantly among chatbots (p=0.566) but Grok generated significantly more highly problematic responses than would be expected under a random distribution (z-score +2.07, p=0.038). Performance was strongest in vaccines (mean z-score –2.57) and cancer (–2.12), and weakest in stem cells (+1.25), athletic performance (+3.74) and nutrition (+4.35). Chatbot outputs were consistently expressed with confidence and certainty; from 250 total questions, there were only two refusals to answer (0.8%), both from Meta AI. Reference quality was poor, with a median completeness score of 40% (Q1–Q3: 20–67%). Chatbot hallucinations and fabricated citations precluded any chatbot from producing a fully accurate reference list. All readability scores were graded as ‘Difficult’ (30–50), equivalent to college sophomore–senior level.

Conclusions

The audited chatbots performed poorly when answering questions in misinformation-prone health and medical fields. Continued deployment without public education and oversight risks amplifying misinformation.

Generative artificial intelligence-driven chatbots and medical misinformation: an accuracy, referencing and readability audit

About the Author:

Leave A Comment Cancel reply

Generative artificial intelligence-driven chatbots and medical misinformation: an accuracy, referencing and readability audit

Share This Story, Choose Your Platform!

About the Author:

Leave A Comment Cancel reply