Artificial intelligence (AI) tools that detect molecular biomarker status from histologic images may be dependent upon correlational relationships with clinicopathologic features, preventing the models from learning the true causal effect of the biomarker, according to findings published in Nature Biomedical Engineering. According to Dawood et al, this results in the AI tools using "shortcuts" to find biomarker(s), rather than understanding true biology, potentially making them unreliable in patient care.
“It’s a bit like judging a restaurant’s quality by the queue of people waiting to get in: it’s a useful shortcut, but it’s not a direct measure of what’s happening in the kitchen,” stated study author Fayyaz ul Amir Afsar Minhas, PhD, Associate Professor and Principal Investigator of the Predictive Systems in Biomedicine Lab in the Department of Computer Science, University of Warwick. “Many AI pathology models are doing the same thing, relying on correlations between biomarkers or on obvious tissue features, rather than isolating biomarker-specific signals. And when conditions change, these shortcuts often fall apart.”
“This study highlights a critical point about the rollout of AI in medicine: to deliver real and lasting impact, the value of AI-based clinically important predictions must be judged through rigorous, bias-aware evaluation, rather than relying solely on headline accuracies that fail to account for confounding effects,” said study author Nasir Rajpoot, PhD, Professor of Computational Pathology and the Founding Director of the Tissue Image Analytics Centre at the University of Warwick as well as the Chief Executive Officer of Warwick spin-out Histofy.
Study Methods
The researchers analyzed more than 8,000 tissue samples from patients with breast, colorectal, lung, and endometrial cancers and compared the results with the performance of several different leading deep-learning approaches for determining biomarker status. They used permutation testing and stratification analyses to show the ways that the models would fail in their accuracy of biomarker detection based on the status of other biomarkers due to interdependencies and patterns of mutual exclusivity and co-occurrence.
Key Findings
Investigators determined that interdependencies between biomarkers can influence the predictive performance of machine-learning models, and when these relationships are ignored in development, it trains the models to learn the aggregated impact of interdependent biomarkers rather than learning to understand and identify true patterns associated with one biomarker.
When looking specifically at BRAF mutations in colorectal cancer samples, for example, they found that the AI tools detected the relationship of BRAF to microsatellite instability (MSI) status to collectively predict the presence of BRAF mutations—rather than identifying the true BRAF signal. “A model that cannot disentangle MSI-[high status] from BRAF status may achieve high aggregate Area Under the Receiver Operating Characteristic curve, but lacks clinical utility, as confusing the two would misguide treatment selection. This example underscores the broader need for bias-aware evaluation: predictors must be assessed not only for overall accuracy but also for their ability to distinguish correlated biomarkers with divergent therapeutic pathways,” the study authors wrote in their report.
They also suggested that if factors shift in test cohorts, then the performance of the model could be significantly affected within specific patient subgroups where these factors were changed or abnormal.
The study authors did suggest that AI tools can still be valuable for cancer research and treatment decision-making, but that they should be used with caution. Going forward, the study authors recommend that stratification-based evaluation framework be used to report bias and to support the development of higher-standard, more trustworthy models in cancer diagnostics.
“This research is not a condemnation of AI in pathology. It is a wake-up call. Current models may perform well in controlled settings but rely on statistical shortcuts rather than genuine biological understanding. Until more robust evaluation standards are in place, these tools should not be seen as replacements for molecular testing, and it is essential that clinicians and researchers understand their limitations and use them with appropriate caution,” Dr. Minhas concluded.
DISCLOSURE: Dr. Branson works for GSK. For full disclosures of the study authors, visit nature.com.

