Big Data and Vulnerable Populations: Addressing the Gap in Lung Cancer Screening

Get Permission

Recent advances in medical imaging have led to more accurate detection and management of early thoracic diseases such as lung cancer, chronic obstructive pulmonary disease (COPD), and cardiovascular disease—three of the top four

“Whether the focus is on lung cancer, emphysema, coronary calcium, or osteoporosis, researchers should define what they are collecting data for.”
— David F. Yankelevitz, MD

Tweet this quote

leading causes of death in the United States. Unfortunately, if not implemented equitably, lung cancer screening also has the potential to exacerbate health disparities, according to David F. Yankelevitz, MD, Professor of Radiology and Director of the Lung Biopsy Service Icahn School of Medicine at Mount Sinai.

During the Quantitative Imaging Workshop XVIII, Dr. -Yankelevitz discussed the need for broad participation with imaging and clinical data donation to ensure that robust and efficient informatics tools are developed to better implement thoracic screening.1

Screening and Data Parameters

As Dr. Yankelevitz explained, protecting communities with lung cancer screening will require a useful database of low-dose computerized tomography (CT) suitable for quantitative analysis, but questions about data collection, data quality, and data protection remain. With respect to data collection, Dr. Yankelevitz emphasized the importance of determining the purpose of the images first.

“Whether the focus is on lung cancer, emphysema, coronary calcium, or osteoporosis, researchers should define what they are collecting data for,” said Dr. Yankelevitz. “Is it for advanced image processing, visualization, quantitation, or detection?”

Image quality needs to be established as well. The initial focus of the database may evolve over time as it is used for other health measures, said Dr. Yankelevitz. However, variability in image quality can render a database unusable.

The size of the database itself can also affect utility. Using a large database of more than 50,000 patients, researchers were able to identify COPD in approximately 25% of patients, he reported. What’s more, nearly 75% of these patients with COPD were found to have undiagnosed emphysema too, and many of these patients had severe disease.

Finally, Dr. Yankelevitz noted issues surrounding data protection. Academic institutions cannot typically give away large amounts of data because of the associated costs, he explained, and there are a lot of considerations about who has access to these types of databases. When these databases are built, a certain amount of data is often sequestered, so it can be used for the U.S. Food and Drug Administration.

The Benefits (and Harms) of Data Sharing

In the ensuing panel discussion, Heather Pierce, JD, MPH, Director of Policy at the Association of American Medical Colleges, noted that the industry is increasingly moving toward more data sharing with the goal of advancing health and science. According to Ms. Pierce, federal agencies such as the National Institutes of Health now require data-sharing plans to be submitted by every grantee.

Heather Pierce, JD, MPH

Heather Pierce, JD, MPH

“At some point, plans that do not demonstrate a real effort to share data with other scientists will be deemed insufficient in a grant,” Ms. Pierce observed.

About 5 years ago, the FAIR Guiding Principles for scientific data management and stewardship were published in Scientific Data.2 As Ms. Pierce explained, more and more researchers are becoming aware of the FAIR Guiding Principles, which stand for Findable, Accessible, Interoperable, and Reusable. As a result, efforts have helped to “normalize the sharing of data from all different sides.”

“There’s an argument to be made that once data have been used for their clinical purpose, there is almost an ethical imperative to use that data for secondary purposes, if possible,” she said. “However, we also need to have humility and remind ourselves that the data may have been collected from individuals who have concerns or questions about that later.”

Although some of these issues can be addressed contractually with data use agreements and claims filed against those who breech said agreements, this approach may be insufficient. “Once data have been released and someone has been re-identified, it’s not going to play well from a societal perspective, regardless of the contractual resolution,” Ms. Pierce continued. “We would be better off considering the pieces of data that people are okay sharing in the first place.”

Finally, Ms. Pierce acknowledged concerns about credit and ownership by researchers and institutions that collect data. “The altruistic nature of improving science gets us only so far, which is why we’ve been working on how to better reward, credit, and track data sharing,” she said.


For more on early-stage lung cancer, see an interview with David F. Yankelevitz, MD, on The ASCO Post Newsreels at

According to Ms. Pierce, this approach involves the potential use of persistent identifiers on data sets that could be cited as individual citations in papers and other publications. Combinations of data sets will then be able to trace back to the originators of those data and be incorporated into promotion and tenure packages.

“The goal is to show not only one’s original work and where it was published, but to focus instead on the impact factor of those journal articles, to be able to demonstrate in a real-world way one’s actual impact on science,” Ms. Pierce concluded. 

DISCLOSURE: Dr. Yankelevitz is a named inventor on a number of patents and patent applications related to the evaluation of diseases of the chest including measurement of nodules and has received financial compensation for licensing of these patents; he is a consultant and co-owner of Accumetra, a private company developing tools to improve the quality of CT imaging; and he serves on advisory boards for Pfizer, Genentech, AstraZeneca, and Carestream. Ms. Pierce reported no conflicts of interest.


1. Yankelevitz D, moderator: Big data and vulnerable populations: Addressing the gap. Quantitative Imaging Workshop XVIII. Presented November 4, 2021.

2. Wilkinson MD, Dumontier M, Aalbersberg IJJ, et al: The FAIR Guiding Principles for scientific data management and stewardship. Sci Data 3:160018, 2016.