How Guideline-Concordant Are Cancer Treatment Recommendations From ChatGPT?

Researchers have found that about one-third of treatment recommendations from the artificial intelligence (AI) model ChatGPT 3.5 were nonconcordant with the NCCN Clinical Practice Guidelines in Oncology® (NCCN Guidelines®), according to a recent study published by Chen et al in JAMA Oncology. The new findings highlighted the need for awareness of the technology’s limitations.

Background

For many patients, the internet may serve as a powerful tool for self-education on medical topics.

“Patients should feel empowered to educate themselves about their medical conditions, but they should always discuss with a clinician, and resources on the internet should not be consulted in isolation,” emphasized senior study author Danielle Bitterman, MD, of the Department of Radiation Oncology at Brigham and Women’s Hospital and the AI in Medicine (AIM) Program at Mass General Brigham. “ChatGPT responses can sound a lot like a human and can be quite convincing. But, when it comes to clinical decision-making, there are so many subtleties for every patient’s unique situation. A right answer can be very nuanced and not necessarily something ChatGPT or another large language model can provide,” she explained.

The emergence of AI tools in health care has been groundbreaking and may offer the potential to reshape the continuum of care.

Study Methods and Results

In the new study, the researchers evaluated the extent to which ChatGPT’s recommendations aligned with NCCN Guidelines® by prompting the AI model to provide a treatment approach for breast cancer, prostate cancer, and lung cancer based on the severity of the disease. In total, the researchers included 26 unique diagnosis descriptions and used four slightly varied prompts—generating a total of 104 prompts.

The researchers found that 98% of ChatGPT’s responses included at least one treatment approach that was concordant with the NCCN Guidelines. However, 34% of these responses also included one or more nonconcordant recommendations—defined as responses that were only partially correct—which were sometimes difficult to detect among otherwise sound guidance. Notably, complete agreement in scoring only occurred in 62% of the cases, underscoring both the complexity of the NCCN Guidelines and the extent to which ChatGPT’s output could be vague or difficult to interpret.

Additionally, the researchers revealed that for 12.5% of the cases, ChatGPT produced “hallucinations,” that were absent from the NCCN Guidelines—such as novel therapy recommendations for noncurative cancer types. The researchers warned that this form of misinformation can incorrectly set patients’ expectations about treatment and potentially impact the clinician-patient relationship.

Conclusions

The researchers plan to explore how well both patients and clinicians can distinguish between medical advice written by a clinician vs advice generated by a large language model like ChatGPT. They are also prompting ChatGPT with more detailed clinical cases to further evaluate its clinical knowledge.

They recognized that although the results of their study may vary if other large language models and/or clinical guidelines are used, they emphasized that many of the models may be similar in the way they are built and the limitations they possess.

“It is an open research question as to the extent [large language models] provide consistent logical responses as oftentimes hallucinations are observed,” noted lead study author Shan Chen, MS, of the AIM Program at Mass General Brigham. “Users are likely to seek answers from the [models] to educate themselves on health-related topics. At the same time, we need to raise awareness that [large language models] are not the equivalent of trained medical professionals,” she underscored.

The researchers hope their new findings will help inform the responsible incorporation of AI into cancer care delivery, workforce support, and administrative processes.

Disclosure: The research in this study was supported by the Woods Foundation. For full disclosures of the study authors, visit jamanetwork.com.

How Guideline-Concordant Are Cancer Treatment Recommendations From ChatGPT?

COVID-19 Boosters Offer Protection for Patients With Cancer

Can Use of Aprepitant During Chemotherapy Improve Triple-Negative Breast Cancer Outcomes

Urothelial Cancer Survival in the Eras Before and After Immune Checkpoint Inhibitors and Antibody-Drug Conjugates

Early ctDNA Change and Outcomes With ICI Treatment in Metastatic Colorectal Cancer

Classifying Pancreatic Cysts Using AI Models