Data dictionary v2.1

Table of contents
  1. File tree
  2. Slide biopsy mapping
  3. Outcomes
  4. Demographics
  5. Social determinants
  6. Comorbidities
  7. Treatments
  8. Pathology items
  9. Cancer diagnosis
  10. Other Diagnosis
  11. CCSR Diagnosis Categories

Dataset Changes

v2.1 Changes (Mar 2024)
  • Major change: An error was identified in the date shifting. The dates were shifted per biopsy and not per patient. This has been fixed, and the date differences between events for a patient will be the same as the date differences before de-identification.
  • All cancer diagnosis codes were added to cancer-dx.csv
  • slide-biospy-map.csv file name changed to biopsy-slides.csv
  • Two new tables
Dataset Reorganization (Feb 2023)
  1. Demographic fields moved from outcomes.csv to demographics.csv.
  2. ndpi/ directory flattened. The year directories were removed.

File tree changes

.
└── brca-psj-path
   ├── ...    
   ├── v2
   │   ├── cancer-dx.csv
   │   ├── comorbidities.csv
+  │   ├── demographics.csv   
   │   ├── outcomes.csv
   │   ├── pathology-items.csv
   │   ├── slide-biopsy-map.csv
   │   ├── social-determinants.csv
   │   └── treatments.csv 
   └── ndpi
-      ├── 2016
       │   ├── 0035de3d-81ec-4945-a760-55518ba8b376.ndpi
       │   └── ...   
-      ├── 2017
       │   ├── 00a94273-e9ab-42f5-a47e-512a13e8603e.ndpi
       │   └── ...    
       ...   

Note

The dates in this dataset have been shifted by a random number of days. All dates for any particular patient have been shifted by the same amount in order to preserve the time duration between events.

Male patients made up less than 2% of biopsy patients and were excluded from the dataset.

File tree

.
└── brca-psj-path
    ├── ...    
    ├── v2.1
    │   ├── biopsy-slides.csv
    │   ├── cancer-dx.csv
    │   ├── comorbidities.csv
    │   ├── demographics.csv
    │   ├── other-dx.csv       
    │   ├── outcomes.csv
    │   ├── pathology-items.csv
    │   ├── social-determinants.csv
    │   └── treatments.csv 
    └── ndpi
        ├── 0035de3d-81ec-4945-a760-55518ba8b376.ndpi
        ├── 00a94273-e9ab-42f5-a47e-512a13e8603e.ndpi
        └── ...    

Slide biopsy mapping

biopsy-slides.csv

This table contains the mapping between the digital pathology image files and the corresponding biopsies. One or more slides are produced from the tissue samples of a biopsy procedure. The number of slides for each biopsy in the dataset can vary from 1 to 100.

Column Name Description Sample
slide_id Unique identifier for each digital pathology image c9cc2d38-a042-4883-9ab1-141e7b876678
biopsy_id Unique identifier for each biopsy case b2423e0f-b92f-44ad-8d83-c45b0066a68a
patient_ngsci_id Unique patient identifier 821a6ba7-f5aa-49d3-a4c6-313ff649b715
slide_path Filepath for the NDPI file of the slide /path/to/{slide_id}.ndpi

Outcomes

outcomes.csv

This table contains outcomes for each biopsy case. There are patient that have multiple biopsies.

Column Name Description Sample
biopsy_id Unique identifier for each biopsy case b2423e0f-b92f-44ad-8d83-c45b0066a68a
patient_ngsci_id Unique patient identifier 821a6ba7-f5aa-49d3-a4c6-313ff649b715
case_year Year of biopsy 2018
biopsy_dt Date of biopsy 2152-01-01
mortality If there is a record of patient death
0: no death record
1: death
1
death_dt Date of patient death 2155-08-24
in_registry Whether entries were found in the Providence cancer registry for the patient that match the time of the biopsy
0: not in cancer registry
1: in cancer registry
1
stage Cancer stage for patient in the year of the biopsy IA
strict_metastatic_dx Whether patient has a strict metastatic diagnosis as described in the documentation
0: no
1: yes
0
strict_metastatic_dx_dt Date of first strict metastatic disease diagnosis 2154-01-30

Demographics

demographics.csv

Column Name Description Sample
biopsy_id Unique identifier for each biopsy case b2423e0f-b92f-44ad-8d83-c45b0066a68a
sex Sex of patient F
race Self identified race
1: White or Caucasian
2: Black or African American
3: American Indian or Alaska Native
4: Asian
5: Native Hawaiian or Pacific Islander
8: other
9: unknown
1
ethnicity Self identified ethnicity
0: Non-hispanic or Latino
1: Hispanic or Latino
9: unknown
0
birth_dt De-identified date of birth 2041-03-15

Social determinants

social-determinants.csv

Column Name Description Sample
biopsy_id Unique identifier for each biopsy case b2423e0f-b92f-44ad-8d83-c45b0066a68a
bmi The last recording of BMI at or before the date of biopsy [units: kg/m2] 25
tobacco 0: no documented smoking
1: ICD10 codes F17.XX or Z72.0X
0

Comorbidities

comorbidities.csv

Comorbidities are those included in the Charleson comorbity index (CCI), and were obtained from patient charts using ICD-9 and ICD-10 codes. Comorbidites were only included if patients were diagnosed in the two years before the biopsy date.
For each included comorbidity, 0: does not have diagnosis, 1: has diagnosis.

Treatments

treatments.csv

This table contains the treatments for patient in Providence’s cancer registry. Treatment at another health system would not be recorded in this table. The following is a helpful resource SEER Program Coding and Staging Manual 2021.

Column Name Description Sample
biopsy_id Unique identifier for each biopsy case b2423e0f-b92f-44ad-8d83-c45b0066a68a
cancer_registry_dx_dt Cancer diagnosis date 2156-01-01
most_definitive_surgical_procedure_cd For codes and additional detail, SEER 2021 Manual, Breast
Surgery Codes
22
most_definitive_radiation_modality_cd For codes and additional detail, SEER Program Coding and
Staging Manual 2021
, pg. 191
31
surgical_margin_cd For codes and additional detail, SEER Program Coding and
Staging Manual 2021
, pg. 166
8
radiation_summ_cd For codes and additional detail, SEER 2003 Code Manual, pg. 134a 1
chemo_summ_cd For codes and additional detail, SEER 2003 Code Manual, pg. 137b 87
immuno_therapy_cd For codes and additional detail, SEER 2003 Code Manual, pg. 139b 1
hormone_summ_cd For codes and additional detail, SEER 2003 Code Manual, pg. 138b 87
{therapeutic modality}_dt Multiple data items with date of administered therapy (i.e. rx_chemo_dt, first_surgery_dt, etc.) 2156-01-13
stg_dx_summ_cd For codes and additional detail, NAACCR archives 2

Pathology items

pathology-items.csv

Column Name Description Sample
biopsy_id Unique identifier for each biopsy case b2423e0f-b92f-44ad-8d83-c45b0066a68a
grade_clinical Grade before any treatment
For codes and additional detail, NAACCR Site Specific Data Items, breast
2
grade_pathological Grade after resection
For codes and additional detail, NAACCR Site Specific Data Items, breast
2
er_summary 0: ER negative
1: ER positive
For codes and additional detail, NAACCR Site Specific Data Items, breast
1
pr_summary 0: PR negative
1: PR positive
For codes and additional detail, NAACCR Site Specific Data Items, breast
1
her2_summary 0: HER2 negative
1: HER2 positive
For codes and additional detail, NAACCR Site Specific Data Items, breast
0
multigene_signature_method For codes and additional detail, NAACCR Site Specific Data Items, breast 1
multigene_signature_result For codes and additional detail, NAACCR Site Specific Data Items, breast X4
response_neoadjuvant_therapy For codes and additional detail, SEER 2021 Manual, Neoadjuvant
treatment effect, breast
2

Cancer diagnosis

cancer-dx.csv

This table contains all the cancer diagnoses codes that can be found in the Providence EHR for the patient cohort.

Column Name Description Sample
patient_ngsci_id Unique patient identifier 821a6ba7-f5aa-49d3-a4c6-313ff649b715
icd9 ICD-9 diagnosis codes 174.4
icd10 ICD-10 diagnosis codes C50.411
dx_dt Date of diagnosis (date-shifted) 2153-04-03

Other Diagnosis

other-dx.csv

This table contains diagnoses codes for the patient cohort. Currently, this only contains cardiovascular codes.

Column Name Description Sample
patient_ngsci_id Unique patient identifier 821a6ba7-f5aa-49d3-a4c6-313ff649b715
icd9 ICD-9 diagnosis codes 429.3
icd10 ICD-10 diagnosis codes I51.7
dx_dt Date of diagnosis (date-shifted) 2153-07-05

CCSR Diagnosis Categories

ccsr-dx.csv

The Clinical Classifications Software Refined (CCSR) for ICD-10-CM diagnoses aggregates more than 70,000 ICD-10-CM diagnosis codes into over 530 clinically meaningful categories. For more details about CCRS refer here.

For a quick reference of the CCRS categories refer here.

In this table there is a row for each biopsy and CCRS category. There are three flags.

  • Prior - Whether the patient had any diagnosis code from the category before the biopsy.
  • Post_1yr - Whether the patient had any diagnosis code from the category within the year after the biopsy.
  • Post_1yr_plus - Whether the patient had any diagnosis code from the category past one year after the biopsy.

If all three are 0 the row is omitted.

Column Name Description Sample
biopsy_id Unique identifier for each biopsy case b2423e0f-b92f-44ad-8d83-c45b0066a68a
ccrs_category CCRS category NEO070
prior The patient had a diagnosis in this category prior to the biopsy 1
post_1yr The patient had a diagnosis in this category within one year after the biopsy 1
post_1yr_plus The patient had a diagnosis in this category past one year after the biopsy 0

Copyright © 2021-2023 Nightingale Open Science. All rights reserved.