dataset versions

v2.1	Documentation	Data Dictionary
v2.0	Documentation	Data Dictionary
v1	Documentation	Data Dictionary

Identifying high-risk breast cancer using digital pathology images: A Nightingale Open Science dataset

Authors: Carlo Bifulco¹, Brian Piening¹, Tucker Bower¹, Ari Robicsek¹, Roshanthi Weerasinghe¹, Soohee Lee¹, Nick Foster², Nathan Juergens², Josh Risley², Katy Haynes², Ziad Obermeyer^2,3

¹ Providence Cancer Institute
² Nightingale Open Science
³ University of California, Berkeley

Lead Nightingale analysts: Nick Foster, Nathan Juergens

When using this resource, please cite: more options
Carlo Bifulco, Brian Piening, Tucker Bower, Ari Robicsek, Roshanthi Weerasinghe, Soohee Lee, Nick Foster, Nathan Juergens, Josh Risley, Senthil Nachimuthu, Katy Haynes, and Ziad Obermeyer. 2021. Identifying High-risk Breast Cancer Using Digital Pathology Images: A Nightingale Open Science Dataset. DOI:https://doi.org/10.48815/N5159B

Additionally, please cite: more options
Sendhil Mullainathan and Ziad Obermeyer. 2022. Solving medicine’s data bottleneck: Nightingale Open Science. Nature Medicine 28, 5 (May 2022), 897–899. DOI:https://doi.org/10.1038/s41591-022-01804-4

BibTeX

@dataset{brca-psj-path,
  author = {Bifulco, Carlo and Piening, Brian and Bower, Tucker and Robicsek, Ari and Weerasinghe, Roshanthi and Lee, Soohee and Foster, Nick and Juergens, Nathan and Risley, Josh and Nachimuthu, Senthil and Haynes, Katy and Obermeyer, Ziad},
  title = {Identifying High-risk Breast Cancer Using Digital Pathology Images: A Nightingale Open Science Dataset},
  publisher = {Nightingale Open Science},
  year = {2021},
  doi = {10.48815/N5159B},
  url = {https://doi.org/10.48815/N5159B}
}

ACM

Carlo Bifulco, Brian Piening, Tucker Bower, Ari Robicsek, Roshanthi Weerasinghe, Soohee Lee, Nick Foster, Nathan Juergens, Josh Risley, Senthil Nachimuthu, Katy Haynes, and Ziad Obermeyer. 2021. Identifying High-risk Breast Cancer Using Digital Pathology Images: A Nightingale Open Science Dataset. DOI:https://doi.org/10.48815/N5159B

APA

Bifulco, C., Piening, B., Bower, T., Robicsek, A., Weerasinghe, R., Lee, S., Foster, N., Juergens, N., Risley, J., Nachimuthu, S., Haynes, K., & Obermeyer, Z. (2021). Identifying High-risk Breast Cancer Using Digital Pathology Images: A Nightingale Open Science Dataset [Data set]. Nightingale Open Science. https://doi.org/10.48815/N5159B

Chicago

Bifulco, Carlo, Brian Piening, Tucker Bower, Ari Robicsek, Roshanthi Weerasinghe, Soohee Lee, Nick Foster, et al. 2021. “Identifying High-Risk Breast Cancer Using Digital Pathology Images: A Nightingale Open Science Dataset.” Nightingale Open Science. https://doi.org/10.48815/N5159B.

Vancouver

Bifulco C, Piening B, Bower T, Robicsek A, Weerasinghe R, Lee S, et al. Identifying High-risk Breast Cancer Using Digital Pathology Images: A Nightingale Open Science Dataset [Internet]. Nightingale Open Science; 2021. Available at: https://doi.org/10.48815/N5159B

BibTeX

@article{nightingale2022,
  author = {Mullainathan, Sendhil and Obermeyer, Ziad},
  title = {Solving medicine's data bottleneck: Nightingale Open Science},
  journal = {Nature Medicine},
  year = {2022},
  month = may,
  day = {01},
  volume = {28},
  number = {5},
  pages = {897-899},
  issn = {1546-170X},
  doi = {10.1038/s41591-022-01804-4},
  url = {https://doi.org/10.1038/s41591-022-01804-4}
}

ACM

Sendhil Mullainathan and Ziad Obermeyer. 2022. Solving medicine’s data bottleneck: Nightingale Open Science. Nature Medicine 28, 5 (May 2022), 897–899. DOI:https://doi.org/10.1038/s41591-022-01804-4

APA

Mullainathan, S., & Obermeyer, Z. (2022). Solving medicine’s data bottleneck: Nightingale Open Science. Nature Medicine, 28(5), 897–899. https://doi.org/10.1038/s41591-022-01804-4

Chicago

Mullainathan, Sendhil, and Ziad Obermeyer. 2022. “Solving Medicine’s Data Bottleneck: Nightingale Open Science.” Nature Medicine 28 (5): 897–99. https://doi.org/10.1038/s41591-022-01804-4.

Vancouver

Mullainathan S, Obermeyer Z. Solving medicine’s data bottleneck: Nightingale Open Science. Nature Medicine [Internet]. May 1, 2022;28(5):897–9. Available at: https://doi.org/10.1038/s41591-022-01804-4

The problem

Every year, 40 million women get a mammogram; some go on to have an invasive biopsy to better examine a concerning area. Underneath these routine tests lies a deep—and disturbing—mystery. Since the 1990s, we have found far more ‘cancers’, which has in turn prompted vastly more surgical procedures and chemotherapy. But death rates from metastatic breast cancer have hardly changed.

When a pathologist looks at a biopsy slide, she is looking for known signs of cancer: tubules, cells with atypical looking nuclei, evidence of rapid cell division. These features, first identified in 1928, still underlie critical decisions today: which women must receive urgent treatment with surgery and chemotherapy? And which can be prescribed “watchful waiting”, sparing them invasive procedures for cancers that would not harm them?

Dataset overview

There is already evidence that algorithms can predict which cancers will metastasize and harm patients on the basis of the biopsy image, which would help doctors make this decision. Fascinatingly, these algorithms also hone in on features that humans neglect, for example the nature of the non-cancerous tissue surrounding the tumor. But to date, the datasets linking biopsy images to patient outcomes—metastasis, death—have been far smaller than what is needed to apply modern approaches.

This dataset will link 175,000 biopsy slides from 11,000 unique patients to cancer registry data on cancer stage, electronic health record data on presence of metastasis, and Social Security data on mortality. Linking these rich biopsy data to outcome labels will allow researchers to train algorithms to identify patients at high risk of poor outcomes (metastasis, death), and compare them to the pathologist’s initial staging decision.

Each observation in the dataset corresponds to a microscopy image from a breast biopsy specimen, collected at the Providence Cancer Institute (Portland, OR) between January 1st, 2010 and December 31st, 2020. We retrieved the physical microscope slides from the Institute’s biospecimen repository and digitized the slide at 40x magnification as a NanoZoomer Digital Pathology Image (NDPI) file on a Hamamatsu NanoZoomer S360. We were very fortunate to work with Hamamatsu on this project: their state of the art scanners are built on a foundation of 15 years of product innovation in digital pathology, and another 65 years of photonics experience. The resulting files contain the high-resolution image of the slide in addition to several down-sampled (lower resolution) versions. These multiple resolutions allow a pathologist to examine the entire slide and then quickly zoom into areas of interest at a higher resolution.

Our partners

Providence is a not-for-profit health care system operating in seven states and serves as the parent organization for 100,000 caregivers. The combined system includes 51 hospitals, 829 clinics, and other health, education and social services across Washington, Oregon, California, Alaska, Montana, New Mexico, and Texas.

This dataset was conceived of and created by Carlo Bifulco, MD, Director of Molecular Pathology and Pathology Informatics; Brian Piening, PhD, Technical Director of Clinical Genomics, and thanks to the leadership of Ari Robicsek, Chief Medical Analytics Officer at Providence. We are particularly proud of this dataset because it holds the promise of targeting new patterns in breast cancer tumors, providing insight into which patients may be at risk and need preventive treatment.

We are very grateful to Hamamatsu, developers of the NanoZoomer 360 platform, who supported this work with a grant from their Product Marketing Division. Hamamatsu cares deeply about deploying products and technology that can empower researchers and advance patient outcomes and has been a key collaborator here.

Dataset details

Versions

dataset versions

v2.1	Documentation	Data Dictionary
v2.0	Documentation	Data Dictionary
v1	Documentation	Data Dictionary

This dataset v1: This dataset contains images and outcomes for 24,939 biopsy slides that correspond to 1,648 cases ranging from late 2017 to 2020. At the time of launch these images already occupies 37TB of space, and we’re working on transferring the next batch as soon as possible.

What’s next for v1.1 (released: April 2022): Images and outcomes for 1500 more biopsies. This will add 41,000 more slides to the dataset and expand it to span cases from 2016 to 2020.

v1.1: 68,000

v1: 24,939

175,000

What’s next for v2 (released: April 2022): A critical input to these predictions is data on cancer treatments: clearly, women who received treatment for breast cancer must be handled differently in predictions from women who were not treated, because treatment decreases the likelihood of metastasis and death. While coarse treatment approaches can be inferred from the stage assigned, we will add more data on procedures (with surgery or chemotherapy).

We have currently completed linkage of biopsy specimens to the hospital’s cancer registry data. This captures cancer stage for women who received their initial diagnosis at the cancer center, but not those who were referred after an initial diagnosis elsewhere. In the next version, we will add linkage to the state-level cancer registry, to capture additional staging information from other facilities.

Dataset schema

Dataset Observations Connection to Key Outcomes

Dataset construction and key outcome variables are shown in the diagram above. A note on color choices: the burnt siena (orange) indicates the node that corresponds to the observations (rows) in the dataset, and the grape (purple) indicates key patient outcomes.

Key variables

Metastasis:

We identified patients in our cohort that had a metastatic disease diagnosis using the encounters and diagnosis tables in Providence’s EPIC system. To find metastatic diagnoses, we searched for instances with ICD9 codes 174.xx or 175.xx (for primary breast malignancy) that also had codes 196.xx, 197.xx, or 198.xx recorded on the same date (as previously described in the literature [1, 2]). The earliest diagnosis for each of these patients is used. Of note, the earliest metastatic diagnosis may predate a given biopsy as this dataset is representative of all the biopsies for breast cancer at Providence, and a biopsy may be involved in determining whether there is a recurrence or progression of cancer.

This method is ‘strict’ in the sense that it requires a breast cancer diagnosis to be present on the same day as a metastatic diagnosis. A less strict definition would be to look only for the presence of a metastatic code (in case the coding of breast cancer was implied). In any case, the raw data are provided and researchers can use the definitions they prefer.

v1 Dataset metastatic diagnosis and mortality outcomes

N Biopsies	1,648
N Patients	1,436
Years after biopsy	First Metastatic Diagnosis (strict)	Mortality
Before biopsy	0.91%	0%
0-1	3.3%	0.55%
1	1.2%	1.4%
2	0.73%	0.73%
3	0.24%	0.73%
4	0.06%	0.00%
Total	6.4%	3.4%

Notes: Percentages are calculated as percent of biopsies. These cases are drawn from later dates in our overall dataset (2017 to 2020), so are ‘right-censored’ when it comes to longer follow up times. We will be adding more as they are scanned and transferred.

Mortality:

We identified patients with records of death using multiple data sources. The three sources are EPIC, the cancer registry, and the Social Security Death Index. Unfortunately, these sources are not death certificates, so we do not know the cause of death.

Staging:

Using the Providence cancer registry, we identified the first recorded cancer stage for a patient within a year after the biopsy. Cancer stage typically incorporates information from a biopsy, but additional information is needed to establish the TNM (Tumor, Node, Metastasis) stage. For more information on registries and staging, see the National Cancer Institute’s description here.

Staging information is only available for the subset of biopsies in patients whose cancer was first diagnosed, or formally re-staged, at Providence and thus recorded in their cancer registry. For example, if the biopsy is the result of a patient seeking a second opinion after receiving initial treatment at another health system, the stage will not be included in the dataset. Some cases were eligible for staging at Providence, but no stage was recommended (indicated with “No Stage Rec,” as opposed to missing stage information).

v1 First cancer stage recorded for biopsies

	Stage count	Metastatic Diagnosis (strict)	Mortality
No Stage Rec	199	13%	5.5%
0	131	0.8%	0.76%
I	629	3.3%	3.2%
II	77	14%	3.9%
III	19	47%	16%
IV	17	59%	24%
Total	1072	7.2%	3.9%

Data Dictionary

Identifying high-risk breast cancer using digital pathology images: A Nightingale Open Science dataset

BibTeX

ACM

APA

Chicago

Vancouver

BibTeX

ACM

APA

Chicago

Vancouver

The problem

Dataset overview

Our partners

Dataset details

Versions

Dataset schema

Key variables

Metastasis:

Mortality:

Staging:

Table of contents