Authors: Rajiv Pramanik1, Bhumil Shah1, Anna Roth1, Honga Wei1, Ted Castillo1, Katie Lin2, Sachin Shah2, Stelios Serghiou2, Nick Foster2, Josh Risley2, Katy Haynes2, Ziad Obermeyer2,3
1 Contra Costa Health Services
2 Nightingale Open Science
3 University of California, Berkeley
Lead Nightingale analyst: Nick Foster
Every year, millions of heart attacks happen around the world. But up to 78% of them are undiagnosed or “silent”. This means a large fraction of people with heart attack never get the cocktail of drugs known to save lives, by preventing future heart attacks and sudden death.
Today, doctors can order tests (like MRIs or ultrasounds) to diagnose patients when they suspect a prior heart attack. But the reason so many heart attacks remain silent is precisely because doctors and patients don’t even suspect a heart attack has happened.
Finding new ways to diagnose these undiagnosed heart attacks at scale could dramatically expand access to life-saving medications. And because of our close partnership with the county health system that sourced these data, algorithms developed on the platform, once validated, have a clear pathway for making it into clinical use and helping real patients.
Electrocardiograms (ECGs) are a cheap, widespread test done everywhere in the health care system: during annual checkups, ER visits, before surgical procedures, etc. Doctors have learned to diagnose some limited signs of prior heart attack on ECGs (like ‘Q waves’), but these coarse findings still miss about 80% of prior heart attacks. We know that algorithms can match human performance on ECG interpretation—but could they do better, by systematically mining ECG waveforms for signals that might identify prior heart attacks? We don’t know, because there have not historically been datasets linking ECGs to high-quality labels on prior heart attack.
This dataset will link 48,788 ECG waveforms from 13,438 patients to data on prior heart attack: the results of detailed cardiac ultrasound tests (echocardiograms), done in a one-year window around the ECGs, which can visualize scars in the wall of the heart formed by prior heart attack. Linking ECGs to cardiac ultrasound labels will allow researchers to train algorithms to identify patients with prior or future heart attack.
Each observation in the dataset corresponds to a single 12-lead ECG. We identified all ECGs done as an inpatient or outpatient by the Contra Costa Health Services (CCHS) county health system between January 1st, 2013 and December 31st, 2020, using the Philips TraceMasterVue ECG Management System (now known as IntelliSpace ECG), which stores ECG data from all Philips cardiographs and bedside monitors (more details). ECG waveforms were shared with us as an XML file, which we parsed into an array of 5,500 points for each one of twelve leads. Note that, in the eventual
v1 dataset to be released imminently, there may be multiple ECGs per patient.
CCHS is a public, county health system that serves 190,000 people in Contra Costa County Contra Costa County, California. It comprises a federally-qualified Health Maintenance Organization (HMO) health insurer, one regional medical center, and eight health centers and clinics.
This dataset was conceived of and created by Rajiv Pramanik, CCHS Chief Medical Informatics Officer and Bhumil Shah, CCHS Chief Analytics Officer, and thanks to the leadership of Anna Roth, the Chief Executive Officer of CCHS. We think this dataset is unique because it comes from a kind of health system typically under-represented in machine learning: CCHS is not a well-resourced academic health center, or a private health systems, but rather a public, county system that cares for a variety of under-served patient populations. For this reason, this dataset—and the many like it we plan to release over the coming months—holds the promise of expanding access to high-quality medical diagnostics for traditionally under-served patients.
We are deeply grateful to the Gordon and Betty Moore Foundation, who supported this work with a grant from their Diagnostic Excellence Initiative.
v0 dataset is a subset of 5,000 ECGs from 5,000 unique patients who had a cardiac ultrasound in the year before the ECG. These patients are randomly chosen from the
v1 dataset (coming soon, pending formal certification of deidentification). Each row contains the waveforms for the 12 leads of the ECG, and a label that identifies whether a regional wall motion abnormality (RWMA) was identified in the cardiac ultrasound (more detail in the Key Variables section below). In this dataset 9.6% of the 5,000 ECGs have a positive label for RWMA.
What’s next for
v0.1 (target release date: February 2022): We’ll add all remaining ECGs—the full set of patients who had a cardiac ultrasound in the year before their ECG, as well as all patients who had a cardiac ultrasound in the year after their ECG.
What’s next for
v1 (target release date: March 2022): We’ll add meta-data for each ECG (e.g. heart rate, QRS interval, QT interval, interpretation, etc.), as automatically identified by the ECG machine and provided on TraceMasterVue. We’ll also add Information on patients’ diagnoses and medications, both before and after the ECG.
Finally, we’ll add ECGs from a completely new set of patients: those who had ECGs done during primary care or ER visits, but did not have ultrasounds. These are exactly the kinds of patients in whom we’d like to identify undiagnosed prior heart attacks—they are the ultimate hold-out set, because we don’t (yet) have labels for them. On the other hand, we do have some other ways to understand whether physicians have already diagnosed and treated the heart attacks: these patients’ diagnoses and medications. We can use this to start to understand whether algorithmic predictions are providing genuinely new information.
Dataset construction and key variables relevant to the medical setting are shown in the schematic below. A note on color choices: the burnt siena (orange) indicates the node that corresponds to the observations (rows) in the dataset, and the grape (purple) indicates key patient outcomes.
Note that this section describes the
v1 dataset (as opposed to the random sample of 5000 in
Cardiac ultrasound: We identified all cardiac ultrasounds done as an inpatient or outpatient between January 1st, 2013 and December 31st, 2020. We then linked these to ECGs for the same patient done within 1 year of the cardiac ultrasound (backwards and forwards in time; note the v0 dataset contains only cardiac ultrasound results from the year before the ECG).
v1 ECGs identified in the CCHS system between 7/1/2012 and 3/1/2020
|ECGs linked to cardiac ultrasounds||48,788|
|Cardiac Ultrasound Reports||20,159|
Regional wall motion abnormalities: We extracted text-based cardiac ultrasound reports in order to identify Regional Wall Motion Abnormalities (RWMA): abnormalities in the contractile function of the left cardiac ventricle, which suggest prior injury due to myocardial infarction (i.e. a heart attack). The presence or absence of RWMA is ascertained by the cardiologist who interprets the images.
v1 Cardiac ultrasound report feature frequency
|Echocardiogram Reports (N)||24,211|
|Unique Patients (N)||15,183|
|Features||Count||% of total echo reports|
We determined the presence or absence of RWMA in cardiac ultrasound reports by using regular expression matching. The process of selecting the appropriate regex terms for each feature was an iterative process with the oversight of a clinician. Reports were parsed in the following way.
Sample of the free text in a report:
Findings Technical Comments: The study quality is good. Left Ventricle: The left ventricular chamber size is normal. There is no left ventricular hypertrophy. Global left ventricular wall motion and contractility are within normal limits. There is normal left ventricular systolic function. The ejection fraction is calculated to be 63% using the Method of Disks. Age appropriate diastolic function. Left Atrium: The left atrial chamber size is normal.
Additional details on these labels:
Normal Wall Motion
The report includes an observation of normal wall motion. The word “normal” for wall motion must be mentioned.
Global Wall Motion Abnormality
A global wall motion is an observed impairment of multiple segments of heart muscle suggesting an underlying process that affects the entire heart. Note that this is not typically the consequence of heart attack, which affects a specific section of the wall.
Regional Wall Motion Abnormality
Regional wall motion abnormality is an observed impairment of a particular segment(s) of the heart wall, suggesting heart attack. This typically results from a blocked vessel (but may also occur in the absence of coronary artery disease: myocarditis, sarcoidosis and takotsubo cardiomyopathy).
The finding is a prior finding.
A technically limited cardiac ultrasound may not have enough information captured to make an accurate assessment.
The scripts for this regular expression analysis can be found at this repository.