Data dictionary `v1`

File tree

    .
    └── fracture-aimi-xray
        ├── ADT.csv
        ├── DX.csv
        ├── enounter.csv
        ├── flowsheet.csv
        ├── Lab
        │   ├── lab000000000000.csv
        │   ├── lab000000000001.csv
        │   ├── lab000000000002.csv
        │   ├── ...
        │   └── lab000000000027.csv
        ├── procedure.csv
        └── images
            ├── train [64540 entries]
            │   ├── patient00001
            │   │   └── study1
            │   │       └── view1_frontal.jpg
            │   ├── patient00002
            │   │   ├── study1
            │   │   │   ├── view1_frontal.jpg
            │   │   │   └── view2_lateral.jpg
            │   │   └── study2
            │   │       └── view1_frontal.jpg
            │   ├── patient00003
            │   │   └── study1
            │   │       └── view1_frontal.jpg
            │   ├── patient00004
            │   │   └── study1
            │   │       ├── view1_frontal.jpg
            │   │       └── view2_lateral.jpg
            │   ...
            ├── train.csv
            ├── valid [200 entries]
            └── valid.csv

EHR

ADT

ADT.csv contains admissions data.

	Column Name	Data Type	Num. Unique
0	Chexpert_ID	object	50675
1	event_id	int64	6642690
2	department_id	float64	343
3	room_id	float64	1399
4	room_csn_id	float64	1965
5	bed_id	float64	4532
6	bed_csn_id	float64	5052
7	bed_status_c	float64	3
8	jittered_effective_time	object	5253
9	jittered_event_time	object	5020

Rows: 6,642,690
Columns: 10
Median patient has 66 events (25th: 26, 75th: 158, Max: 3936)
Most records (80%) between 2010 and 2016

Diagnoses

DX.csv contains diagnosis data.

	Column Name	Data Type	Num. Unique
0	Chexpert_ID	object	51056
1	dx_id	int64	133534
2	dx_name	object	124375
3	icd10	object	27896
4	icd9	object	14698
5	diagnosis_date	object	7165

Rows:5,290,238
Columns: 6
Median patient has 61 diagnoses (25th: 24, 75th: 131, Max: 1442)
Top 5000 diagnoses make up 80% of all diagnoses
Most common diagnoses:
1. nonspecific abnormal finding of lung field
2. Unspecified essential hypertension
3. Nonspecific abnormal electrocardiogram (ECG) (EKG)
4. Shortness of breath
5. Unspecified pleural effusion

Encounter

Encounter.csv contains encounters with Stanford Hospitals.

	Column Name	Data Type	Num. Unique
0	Chexpert_ID	object	63973
1	pt_class	object	52
2	Contact_Date	object	6110
3	ADT_Arrival	object	3012
4	Hospital_Admission	object	5786
5	Appointment	object	6115
6	appt_type	object	520
7	appt_status	object	7
8	admission_type	object	7
9	appt_description	object	48

Rows: 5,591,038
Columns: 10
Median patient has 5 encounters (25th: 2, 75th: 18, Max: 3680)
Seven kinds of (non-NA) encounters:
1. Elective (1.4M)
2. Emergency (81K)
3. Urgent (38K)
4. Trauma Center (3.3K)
5. Outpatient (129)
6. Newborn (76)
7. Emergent (16)
Most common encounters are:
1. Appointment (1.62M)
2. Orders Only (763K)
3. BPA (422K)
4. Scan (284K)
5. Hx Scan (261K)
6. Surgery (233K)
7. Hx Clinic (215K)
8. Addmission (Discharged) (138K)
9. Anesthesia Event (123K)
10. TransChart Notes Only (113K)
Appointment description only includes reason for cancelation

Flowsheet

Flowsheet.csv contains patient health metrics. There are some outliers that are most likely data entry errors.

	Column Name	Data Type	Num. Unique
0	Chexpert_ID	object	37186
1	Recorded_Date	object	4933
2	WEIGHT	float64	4266
3	WEIGHT_UNITS	object	1
4	HEIGHT	float64	817
5	HEIGHT_UNITS	object	1
6	BP	object	5322
7	BP_UNITS	object	1
8	TEMP	float64	58
9	TEMP_UNITS	object	1
10	BMI	float64	23965
11	BMI_UNITS	object	1

Rows: 37,186
Columns: 12

Procedure

Procedure.csv contains patient health procedures.

	Column Name	Data Type	Num. Unique
0	Chexpert_ID	object	37002
1	description	object	9389
2	code_type	object	3
3	code	object	9228
4	Procedure_Date	object	2923

Rows: 6,476,021
Columns: 5
Most common procedures:

	description	count
0	RADEX CH 1 VIEW FRNT	1064973
1	SUBSEQ HOSPITAL EVAL/MGMT/HIGH COMPLEX/35 MIN	389407
2	CRITICAL CARE/EVAL/MGTMT; FIRST 30-74 MINUTES	369392
3	ECG ROUTINE ECG W/LEAST 12 LDS I&R ONLY	239297
4	SUBSEQ HOSPITAL EVAL/MGMT/MOD COMPLEX/25 MIN	232169

Lab

There are twenty-eight lab files, e.g. lab000000000000.csv.

	Column Name	Data Type	Num. Unique
0	Chexpert_ID	object	40645
1	Lab_Order_Date	object	3811
2	Lab_Taken_Date	object	3825
3	Lab_Result_Date	object	3864
4	order_type	object	18
5	proc_code	object	1637
6	group_lab_name	object	3294
7	lab_name	object	2816
8	base_name	object	3579
9	ord_value	object	73117
10	ord_num_value	float64	12756
11	reference_unit	object	195
12	result_flag	object	8
13	ordering_mode	object	1

Rows: 6,110,538
Columns: 14

Images

Each patient has a series of studies which correspond to groups of chest X-rays taken together. Each study has one more views, e.g. view1_frontal.jpg. From the CheXpert documentation:

CheXpert is a dataset consisting of 224,316 chest radiographs of 65,240 patients who underwent a radiographic examination from Stanford University Medical Center between October 2002 and July 2017, in both inpatient and outpatient centers.

Labels

From the CheXpert documentation:

Each report was labeled for the presence of 14 observations as positive, negative, or uncertain. We decided on the 14 observations based on the prevalence in the reports and clinical relevance, conforming to the Fleischner Society’s recommended glossary whenever applicable. We then developed an automated rule-based labeler to extract observations from the free text radiology reports to be used as structured labels for the images.

Our labeler is set up in three distinct stages: mention extraction, mention classification, and mention aggregation. In the mention extraction stage, the labeler extracts mentions from a list of observations from the Impression section of radiology reports, which summarizes the key findings in the radiographic study. In the mention classification stage, mentions of observations are classified as negative, uncertain, or positive. In the mention aggregation stage, we use the classification for each mention of observations to arrive at a final label for the 14 observations (blank for unmentioned, 0 for negative, -1 for uncertain, and 1 for positive).

Data dictionary v1