Data dictionary v1

File tree

    .
    └── fracture-aimi-xray
        ├── ADT.csv
        ├── DX.csv
        ├── enounter.csv
        ├── flowsheet.csv
        ├── Lab
        │   ├── lab000000000000.csv
        │   ├── lab000000000001.csv
        │   ├── lab000000000002.csv
        │   ├── ...
        │   └── lab000000000027.csv
        ├── procedure.csv
        └── images
            ├── train [64540 entries]
            │   ├── patient00001
            │   │   └── study1
            │   │       └── view1_frontal.jpg
            │   ├── patient00002
            │   │   ├── study1
            │   │   │   ├── view1_frontal.jpg
            │   │   │   └── view2_lateral.jpg
            │   │   └── study2
            │   │       └── view1_frontal.jpg
            │   ├── patient00003
            │   │   └── study1
            │   │       └── view1_frontal.jpg
            │   ├── patient00004
            │   │   └── study1
            │   │       ├── view1_frontal.jpg
            │   │       └── view2_lateral.jpg
            │   ...
            ├── train.csv
            ├── valid [200 entries]
            └── valid.csv

EHR


ADT

ADT.csv contains admissions data.

  Column Name Data Type Num. Unique
0 Chexpert_ID object 50675
1 event_id int64 6642690
2 department_id float64 343
3 room_id float64 1399
4 room_csn_id float64 1965
5 bed_id float64 4532
6 bed_csn_id float64 5052
7 bed_status_c float64 3
8 jittered_effective_time object 5253
9 jittered_event_time object 5020
  • Rows: 6,642,690
  • Columns: 10
  • Median patient has 66 events (25th: 26, 75th: 158, Max: 3936)
  • Most records (80%) between 2010 and 2016

Diagnoses

DX.csv contains diagnosis data.

  Column Name Data Type Num. Unique
0 Chexpert_ID object 51056
1 dx_id int64 133534
2 dx_name object 124375
3 icd10 object 27896
4 icd9 object 14698
5 diagnosis_date object 7165
  • Rows:5,290,238
  • Columns: 6

  • Median patient has 61 diagnoses (25th: 24, 75th: 131, Max: 1442)
  • Top 5000 diagnoses make up 80% of all diagnoses
  • Most common diagnoses:
    1. nonspecific abnormal finding of lung field
    2. Unspecified essential hypertension
    3. Nonspecific abnormal electrocardiogram (ECG) (EKG)
    4. Shortness of breath
    5. Unspecified pleural effusion

Encounter

Encounter.csv contains encounters with Stanford Hospitals.

  Column Name Data Type Num. Unique
0 Chexpert_ID object 63973
1 pt_class object 52
2 Contact_Date object 6110
3 ADT_Arrival object 3012
4 Hospital_Admission object 5786
5 Appointment object 6115
6 appt_type object 520
7 appt_status object 7
8 admission_type object 7
9 appt_description object 48
  • Rows: 5,591,038
  • Columns: 10

  • Median patient has 5 encounters (25th: 2, 75th: 18, Max: 3680)
  • Seven kinds of (non-NA) encounters:
    1. Elective (1.4M)
    2. Emergency (81K)
    3. Urgent (38K)
    4. Trauma Center (3.3K)
    5. Outpatient (129)
    6. Newborn (76)
    7. Emergent (16)
  • Most common encounters are:
    1. Appointment (1.62M)
    2. Orders Only (763K)
    3. BPA (422K)
    4. Scan (284K)
    5. Hx Scan (261K)
    6. Surgery (233K)
    7. Hx Clinic (215K)
    8. Addmission (Discharged) (138K)
    9. Anesthesia Event (123K)
    10. TransChart Notes Only (113K)
  • Appointment description only includes reason for cancelation

Flowsheet

Flowsheet.csv contains patient health metrics. There are some outliers that are most likely data entry errors.

  Column Name Data Type Num. Unique
0 Chexpert_ID object 37186
1 Recorded_Date object 4933
2 WEIGHT float64 4266
3 WEIGHT_UNITS object 1
4 HEIGHT float64 817
5 HEIGHT_UNITS object 1
6 BP object 5322
7 BP_UNITS object 1
8 TEMP float64 58
9 TEMP_UNITS object 1
10 BMI float64 23965
11 BMI_UNITS object 1
  • Rows: 37,186
  • Columns: 12

Procedure

Procedure.csv contains patient health procedures.

  Column Name Data Type Num. Unique
0 Chexpert_ID object 37002
1 description object 9389
2 code_type object 3
3 code object 9228
4 Procedure_Date object 2923
  • Rows: 6,476,021
  • Columns: 5

  • Most common procedures:
  description count
0 RADEX CH 1 VIEW FRNT 1064973
1 SUBSEQ HOSPITAL EVAL/MGMT/HIGH COMPLEX/35 MIN 389407
2 CRITICAL CARE/EVAL/MGTMT; FIRST 30-74 MINUTES 369392
3 ECG ROUTINE ECG W/LEAST 12 LDS I&R ONLY 239297
4 SUBSEQ HOSPITAL EVAL/MGMT/MOD COMPLEX/25 MIN 232169

Lab

There are twenty-eight lab files, e.g. lab000000000000.csv.

  Column Name Data Type Num. Unique
0 Chexpert_ID object 40645
1 Lab_Order_Date object 3811
2 Lab_Taken_Date object 3825
3 Lab_Result_Date object 3864
4 order_type object 18
5 proc_code object 1637
6 group_lab_name object 3294
7 lab_name object 2816
8 base_name object 3579
9 ord_value object 73117
10 ord_num_value float64 12756
11 reference_unit object 195
12 result_flag object 8
13 ordering_mode object 1
  • Rows: 6,110,538
  • Columns: 14

Images

Each patient has a series of studies which correspond to groups of chest X-rays taken together. Each study has one more views, e.g. view1_frontal.jpg. From the CheXpert documentation:

CheXpert is a dataset consisting of 224,316 chest radiographs of 65,240 patients who underwent a radiographic examination from Stanford University Medical Center between October 2002 and July 2017, in both inpatient and outpatient centers.

Labels

From the CheXpert documentation:

Each report was labeled for the presence of 14 observations as positive, negative, or uncertain. We decided on the 14 observations based on the prevalence in the reports and clinical relevance, conforming to the Fleischner Society’s recommended glossary whenever applicable. We then developed an automated rule-based labeler to extract observations from the free text radiology reports to be used as structured labels for the images.

Our labeler is set up in three distinct stages: mention extraction, mention classification, and mention aggregation. In the mention extraction stage, the labeler extracts mentions from a list of observations from the Impression section of radiology reports, which summarizes the key findings in the radiographic study. In the mention classification stage, mentions of observations are classified as negative, uncertain, or positive. In the mention aggregation stage, we use the classification for each mention of observations to arrive at a final label for the 14 observations (blank for unmentioned, 0 for negative, -1 for uncertain, and 1 for positive).


Copyright © 2021-2023 Nightingale Open Science. All rights reserved.