Commonly used large language models (LLMs) were able to provide appropriate, guideline-aligned treatment recommendations for patients with straightforward cases of early-stage hepatocellular carcinoma; however, greater disagreement with physician recommendations was seen in cases of late-stage disease, according to findings from a Korean retrospective registry study published in PLOS Medicine.
“Our study shows that [LLMs] can help support treatment decisions for early-stage liver cancer, but their performance is more limited in advanced disease. This highlights the importance of using LLMs as a complement to, rather than a replacement for, clinical expertise,” the study authors, including corresponding author Ji Won Han, MD, PhD, Division of Gastroenterology and Hepatology, Department of Internal Medicine, College of Medicine, The Catholic University of Korea, Seoul, Republic of Korea, wrote.
Study Methods
Researchers explored the clinical relevance of treatment recommendations generated by several LLMs (ChatGPT 4o, Gemini 2.0, and Claude 3.5) through comparison with physicians' real-world decisions and the resulting outcomes.
The study authors retrospectively analyzed patient data from the Korean Primary Liver Cancer Registry for 13,614 patients with treatment-naive hepatocellular carcinoma who were diagnosed between 2008 and 2020. Recommendations were tested for the cohort with standardized prompts that referenced guidelines from the American Association for the Study of Liver Diseases and the European Association for the Study of the Liver. Patients were then classified according to whether or not the LLM recommendations matched the real treatment given by the physicians. Analysis using decision trees was used to determine factors that impacted treatment choices.
Key Findings
Gemini 2.0 achieved the highest concordance rate with physician decisions, at 32.7%, followed by ChatGPT 4o at 31.1% and Claude 3.5 at 26.8%.
When stratified by Barcelona Clinic Liver Cancer (BCLC) stages, patients with BCLC-A disease tended to have greater survival outcomes in line with the LLMs' agreement with physician decisions (ChatGPT 4o hazard ratio [HR] = 0.743; 95% confidence interval [CI] = 0.665–0.831; P < .001). In patients with BCLC-C disease, on the other hand, concordance between AI recommendations and physician treatments was associated with worse survival outcomes (ChatGPT 4o HR = 1.650; 95% CI = 1.523–1.787; P < .001; Gemini 2.0 HR = 1.586; 95% CI = 1.470–1.711; P < .001; Claude 3.5 HR = 1.483; 95% CI = 1.366–1.610; P < .001). Only modest or nonsignificant associations were noted for survival for patients with BCLC-B disease.
Analysis of factors impacting decisions demonstrated that physicians tended to value liver function parameters more while LLMs focused more on tumor characteristics. Overall, physicians avoided curative treatments if hepatic reserve was limited for patients with early-stage hepatocellular carcinoma, but chose more locoregional therapies for patients with advanced-stage disease when their liver function was preserved, even though these choices differed from guideline recommendations for systemic therapy.
“While LLMs may serve as adjunctive tools for guideline-concordant decisions in straightforward scenarios, their recommendations may reflect limited contextual awareness in complex clinical situations requiring individualized care,” the study authors suggested. “LLM recommendations should be interpreted cautiously alongside clinical judgment.”
As the study was limited by its retrospective design, lack of imaging information, and focus on guideline-era treatments, the study authors recommended prospective validation of these findings going forward.
DISCLOSURE: This work was supported by the National Research Foundation of Korea grant funded by the Korea government funded by the Ministry of Health & Welfare, Republic of Korea. For full disclosures of the study authors, visit journals.plos.org.

