Investigators have found that patient characteristics such as age and race may influence false-positive results from artificial intelligence (AI)-interpreted screening mammograms, according to a recent study published by Nguyen et al in Radiology.
Background
Although preliminary data suggested that AI algorithms applied to screening mammograms may improve radiologists’ diagnostic performance for breast cancer detection as well as reduce interpretation time, AI may also perpetuate disparities among certain patients.
“AI has become a resource for radiologists to improve their efficiency and accuracy in reading screening mammograms while mitigating reader burnout. However, the impact of patient characteristics on AI performance has not been well studied,” explained lead study author Derek L. Nguyen, MD, Assistant Professor at Duke University. “There are few demographically diverse databases for AI algorithm training, and the [U.S. Food and Drug Administration] (FDA) does not require diverse data sets for validation. Because of the differences among patient populations, it’s important to investigate whether AI software can accommodate and perform at the same level for different patient ages, races, and ethnicities,” he stressed.
Study Methods and Results
In the recent retrospective study, the investigators identified patients with negative digital breast tomosynthesis screening examinations performed between 2016 and 2019. They subsequently followed patients for 2 years and found that none were diagnosed with a breast malignancy.
The investigators then randomly selected a subset of 4,855 patients (with a median age of 54) who were broadly distributed across four ethnic and racial groups: White (n = 1,316), Black (n = 1,261), Asian (n = 1,351), and Hispanic (n = 927). They used an FDA-approved AI algorithm to evaluate the patients’ screening mammograms and generate both case and risk scores, defined as certainty of malignancy and 1-year subsequent malignancy risk, respectively.
“Our goal was to evaluate whether an AI algorithm’s performance was uniform across age, breast density types, and different patient race [and] ethnicities,” Dr. Nguyen detailed.
Because all of the mammograms involved in the study were negative for the presence of malignancies, anything flagged as suspicious by the AI algorithm was considered a false-positive result. The investigators revealed that false-positive case scores were significantly more likely in Black and older patients (aged 71–80) and less likely in Asian patients and younger patients (aged 41–50) compared with White and female patients aged 51 to 60.
Conclusions
The investigators emphasized that health-care institutions should better understand the patient populations they serve before purchasing AI algorithms intended to interpret screening mammograms.
“This study is important because it highlights that any AI software purchased by a health-care institution may not perform equally across all patient ages, races, ethnicities, and breast densities. Moving forward, I think AI software upgrades should focus on ensuring demographic diversity,” underscored Dr. Nguyen. “Having a baseline knowledge of … institutions’ demographics and asking the vendor about the ethnic and age diversity of their training data may help [health-care providers] understand the limitations [they may] face in clinical practice,” he concluded.
Disclosure: For full disclosures of the study authors, visit pubs.rsna.org.