| Literature DB >> 32451379 |
Patrick Wagner1,2,3, Nils Strodthoff2, Ralf-Dieter Bousseljot1, Dieter Kreiseler1, Fatima I Lunze4, Wojciech Samek2, Tobias Schaeffter5,6,7.
Abstract
Electrocardiography (ECG) is a key non-invasive diagnostic tool for cardiovascular diseases which is increasingly supported by algorithms based on machine learning. Major obstacles for the development of automatic ECG interpretation algorithms are both the lack of public datasets and well-defined benchmarking procedures to allow comparison s of different algorithms. To address these issues, we put forward PTB-XL, the to-date largest freely accessible clinical 12-lead ECG-waveform dataset comprising 21837 records from 18885 patients of 10 seconds length. The ECG-waveform data was annotated by up to two cardiologists as a multi-label dataset, where diagnostic labels were further aggregated into super and subclasses. The dataset covers a broad range of diagnostic classes including, in particular, a large fraction of healthy records. The combination with additional metadata on demographics, additional diagnostic statements, diagnosis likelihoods, manually annotated signal properties as well as suggested folds for splitting training and test sets turns the dataset into a rich resource for the development and the evaluation of automatic ECG interpretation algorithms.Entities:
Mesh:
Year: 2020 PMID: 32451379 PMCID: PMC7248071 DOI: 10.1038/s41597-020-0495-6
Source DB: PubMed Journal: Sci Data ISSN: 2052-4463 Impact factor: 6.444
Fig. 1Graphical summary of the PTB-XL dataset in terms of diagnostic superclasses and subclasses, see Table 5 for a definition of the used acronyms.
SCP-ECG acronym descriptions for super- and subclasses.
| Acronym | SCP statement Description | ||
|---|---|---|---|
| Superclasses | NORM | Normal ECG | |
| CD | Conduction Disturbance | ||
| MI | Myocardial Infarction | ||
| HYP | Hypertrophy | ||
| STTC | ST/T change | ||
| Subclasses | NORM | NORM | Normal ECG |
| CD | LAFB/LPFB | left anterior/left posterior fascicular block | |
| IRBBB | incomplete right bundle branch block | ||
| ILBBB | incomplete left bundle branch block | ||
| CLBBB | complete left bundle branch block | ||
| CRBBB | complete right bundle branch block | ||
| _AVB | AV block | ||
| IVCB | non-specific intraventricular conduction disturbance (block) | ||
| WPW | Wolff-Parkinson-White syndrome | ||
| HYP | LVH | left ventricular hypertrophy | |
| RHV | right ventricular hypertrophy | ||
| LAO/LAE | left atrial overload/enlargement | ||
| RAO/RAE | right atrial overload/enlargement | ||
| SEHYP | septal hypertrophy | ||
| MI | AMI | anterior myocardial infarction | |
| IMI | inferior myocardial infarction | ||
| LMI | lateral myocardial infarction | ||
| PMI | posterior myocardial infarction | ||
| STTC | ISCA | ischemic in anterior leads | |
| ISCI | ischemic in inferior leads | ||
| ISC_ | non-specific ischemic | ||
| STTC | ST-T changes | ||
| NST_ | non-specific ST changes |
Summary of selected ECG datasets.
| Name | # ECG | # Leads | # Patients | Average length in seconds | Available labels | # Classes | |
|---|---|---|---|---|---|---|---|
| restricted | CSE[ | 1220 | 15 | 1220 | 30 | D | 7 |
| AHA[ | 154 | 2 | 154 | 1800 | DFRB | 8 | |
| Stanford[ | 64121 | 1 | 29163 | 30 | R | 14 | |
| CCDD[ | 179130 | 12 | 179130 | 30 | D | 378 | |
| THEW[ | 1172 | 12 | 1154 | 86400 | CB | 5 | |
| Mayo CV[ | 649931 | 12 | 180922 | 10 | R | 2 | |
| ICBEB Challenge 2018[ | 6877 | 12 | 6877 | 30 | DFR | 8 | |
| non-restricted | MIT-BIH Noise Stress Test[ | 15 | 1 | 15 | 22500 | B | 1 |
| MIT-BIH Arrhythmia[ | 48 | 2 | 47 | 1800 | B | 1 | |
| Malignant Ventricular Arrhythmia[ | 22 | 2 | 22 | 1800 | R | 3 | |
| Ventricular Tachyarrhythmia[ | 35 | 1 | 35 | 480 | B | 3 | |
| European ST-T Database[ | 90 | 2 | 79 | 7200 | F | 2 | |
| AF Classification Challenge 2017[ | 8528 | 1 | 8528 | 32.5 | R | 4 | |
| PTB Diagnostic ECG[ | 549 | 15 | 294 | 60 | D | 9 | |
| 12 | 10 | DFR |
Columns provided in the metadata table ptbxl_database.csv.
| Section | Variable | Data Type | Description |
|---|---|---|---|
| Identifiers | ecg_id | integer | unique ECG identifier |
| patient_id | integer | unique patient identifier | |
| filename_lr | string | path to waveform data (100 Hz) | |
| filename_hr | string | path to waveform data (500 Hz) | |
| General Metadata | age | integer | age at recording in years (see Fig. |
| sex | categorical | sex (male 0, female 1) | |
| height | integer | height in centimeters (see Fig. | |
| weight | integer | weight in kilograms (see Fig. | |
| nurse | categorical | involved nurse (pseudonymized) | |
| site | categorical | recording site (pseudonymized) | |
| device | categorical | recording device | |
| recording_date | datetime | ECG recording date and time | |
| ECG Statements | report | string | ECG report from diagnosing cardiologist |
| scp_codes | dictionary | SCP ECG statements (see Tables | |
| heart_axis | categorical | heart’s electrical axis (see Table | |
| infarction_stadium1 | categorical | infarction stadium (see Table | |
| infarction_stadium2 | categorical | second infarction stadium (see Table | |
| validated_by | categorical | validating cardiologist (pseudonymized) | |
| second_opinion | boolean | flag for second (deviating) opinion | |
| initial_autogenerated_report | boolean | initial autogenerated report by ECG device | |
| validated_by_human | boolean | validated by human | |
| Signal Metadata | baseline_drift | string | baseline drift or jump present |
| static_noise | string | electric hum/static noise present | |
| burst_noise | string | burst noise | |
| electrodes_problems | string | electrodes problems | |
| extra_beats | string | extra beats | |
| pacemaker | string | pacemaker | |
| Cross-validation Folds | strat_fold | integer | suggested stratified folds |
Each ECG is identified by a unique ID (ecg_id) and comes with a number of ECG statements (scp_codes) that can be used to train a multi-label classifier that can be evaluated based on the proposed fold assignments (strat_fold).
Fig. 2Overview of populated columns in ptbxl_database.csv. Each entry corresponds to a row in the table in temporal order from top to bottom. Black pixels indicate existing values, missing values remain white.
Overview of number of records per patient.
| # Records | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
|---|---|---|---|---|---|---|---|---|---|---|
| # Patients | 16758 | 1604 | 348 | 103 | 43 | 16 | 5 | 4 | 3 | 1 |
Fig. 3Demographic overview of patients in PTB-XL.
Fig. 4Venn Diagram illustrating the assignment of the given SCP ECG statements to the three categories diagnostic, form and rhythm.
Likelihood statements for diagnostic statements inferred from keywords in the ECG report as introduced in ECG Statements.
| Keywords | Weighting Factor (Confidence) |
|---|---|
| nicht auszuschliessen, cannot rule out, cannot be excluded | 15% |
| möglicherweise, consider, suggest, likely | 35% |
| wahrscheinlich, possible, maybe, probably, ablaufend, Verdacht auf | 50% |
| Sonst, Bild | 80% |
| Consistent with, Diagnose, Zustand nach… | 100% |
Diagnostic Statement Overview, where the acronyms of super- and subclass are introduced in Table 5.
| # Records | Description | Superclass | Subclass | |
|---|---|---|---|---|
| LAFB | 1626 | left anterior fascicular block | CD | LAFB/LPFB |
| IRBBB | 1118 | incomplete right bundle branch block | CD | IRBBB |
| AVB | 797 | first degree AV block | CD | _AVB |
| IVCD | 789 | non-specific intraventricular conduction disturbance (block) | CD | IVCD |
| CRBBB | 542 | complete right bundle branch block | CD | CRBBB |
| CLBBB | 536 | complete left bundle branch block | CD | CLBBB |
| LPFB | 177 | left posterior fascicular block | CD | LAFB/LPFB |
| WPW | 80 | Wolff-Parkinson-White syndrome | CD | WPW |
| ILBBB | 77 | incomplete left bundle branch block | CD | ILBBB |
| 3AVB | 16 | third degree AV block | CD | _AVB |
| 2AVB | 14 | second degree AV block | CD | _AVB |
| LVH | 2137 | left ventricular hypertrophy | HYP | LVH |
| LAO/LAE | 427 | left atrial overload/enlargement | HYP | LAO/LAE |
| RVH | 126 | right ventricular hypertrophy | HYP | RVH |
| RAO/RAE | 99 | right atrial overload/enlargement | HYP | RAO/RAE |
| SEHYP | 30 | septal hypertrophy | HYP | SEHYP |
| IMI | 2685 | inferior myocardial infarction | MI | IMI |
| ASMI | 2363 | anteroseptal myocardial infarction | MI | AMI |
| ILMI | 479 | inferolateral myocardial infarction | MI | IMI |
| AMI | 354 | anterior myocardial infarction | MI | AMI |
| ALMI | 290 | anterolateral myocardial infarction | MI | AMI |
| INJAS | 215 | subendocardial injury in anteroseptal leads | MI | AMI |
| LMI | 201 | lateral myocardial infarction | MI | LMI |
| INJAL | 148 | subendocardial injury in anterolateral leads | MI | AMI |
| IPLMI | 51 | inferoposterolateral myocardial infarction | MI | IMI |
| IPMI | 33 | inferoposterior myocardial infarction | MI | IMI |
| INJIN | 18 | subendocardial injury in inferior leads | MI | IMI |
| PMI | 17 | posterior myocardial infarction | MI | PMI |
| INJLA | 17 | subendocardial injury in lateral leads | MI | AMI |
| INJIL | 15 | subendocardial injury in inferolateral leads | MI | IMI |
| NORM | 9528 | normal ECG | NORM | NORM |
| NDT | 1829 | non-diagnostic T abnormalities | STTC | STTC |
| NST_ | 770 | non-specific ST changes | STTC | NST_ |
| DIG | 181 | digitalis-effect | STTC | STTC |
| LNGQT | 118 | long QT-interval | STTC | STTC |
| ISC_ | 1275 | non-specific ischemic | STTC | ISC_ |
| ISCAL | 660 | ischemic in anterolateral leads | STTC | ISCA |
| ISCIN | 219 | ischemic in inferior leads | STTC | ISCI |
| ISCIL | 179 | ischemic in inferolateral leads | STTC | ISCI |
| ISCAS | 170 | ischemic in anteroseptal leads | STTC | ISCA |
| ISCLA | 142 | ischemic in lateral leads | STTC | ISCA |
| ANEUR | 104 | ST-T changes compatible with ventricular aneurysm | STTC | STTC |
| EL | 97 | electrolytic disturbance or drug (former EDIS) | STTC | STTC |
| ISCAN | 44 | ischemic in anterior leads | STTC | ISCA |
Form Statement Overview.
| # Records | Description | |
|---|---|---|
| NDT | 1829 | non-diagnostic T abnormalities |
| NST_ | 770 | non-specific ST changes |
| DIG | 181 | digitalis-effect |
| LNGQT | 118 | long QT-interval |
| ABQRS | 3327 | abnormal QRS |
| PVC | 1146 | ventricular premature complex |
| STD_ | 1009 | non-specific ST depression |
| VCLVH | 875 | voltage criteria (QRS) for left ventricular hypertrophy |
| QWAVE | 548 | Q waves present |
| LOWT | 438 | low amplitude T-waves |
| NT_ | 424 | non-specific T-wave changes |
| PAC | 398 | atrial premature complex |
| LPR | 340 | prolonged PR interval |
| INVT | 294 | inverted T-waves |
| LVOLT | 182 | low QRS voltages in the frontal and horizontal leads |
| HVOLT | 62 | high QRS voltage |
| TAB_ | 35 | T-wave abnormality |
| STE_ | 28 | non-specific ST elevation |
| PRC(S) | 10 | premature complex(es) |
Rhythm Statement Overview.
| # Records | Description | |
|---|---|---|
| SR | 16782 | sinus rhythm |
| AFIB | 1514 | atrial fibrillation |
| STACH | 826 | sinus tachycardia |
| SARRH | 772 | sinus arrhythmia |
| SBRAD | 637 | sinus bradycardia |
| PACE | 296 | normal functioning artificial pacemaker |
| SVARR | 157 | supraventricular arrhythmia |
| BIGU | 82 | bigeminal pattern (unknown origin, SV or Ventricular) |
| AFLT | 73 | atrial flutter |
| SVTAC | 27 | supraventricular tachycardia |
| PSVT | 24 | paroxysmal supraventricular tachycardia |
| TRIGU | 20 | trigeminal pattern (unknown origin, SV or Ventricular) |
Fig. 5Distribution of diagnostic subclasses for given diagnostic superclasses.
Overview of number of statements per ECG introduced in ECG Statements.
| Level | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
|---|---|---|---|---|---|---|---|---|---|---|
| Diagnostic | 407 | 15019 | 4242 | 1515 | 529 | 121 | 4 | 0 | 0 | 0 |
| Diagnostic Superclass | 407 | 16272 | 4079 | 920 | 159 | 0 | 0 | 0 | 0 | 0 |
| Diagnostic Subclass | 407 | 15239 | 4171 | 1439 | 475 | 102 | 4 | 0 | 0 | 0 |
| Form | 12849 | 6693 | 1672 | 524 | 90 | 9 | 0 | 0 | 0 | 0 |
| Rhythm | 771 | 20923 | 142 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| All | 0 | 705 | 11247 | 5114 | 2597 | 1254 | 597 | 253 | 63 | 7 |
Distribution of heart_axis as introduced in ECG Statements.
| Keywords | # Records | |
|---|---|---|
| UNK | Unknown | 8505 |
| MID | Normal axis | 7687 |
| LAD | Left axis deviation | 3764 |
| ALAD | Abnormal LAD, extreme left axis deviation | 1382 |
| RAD | Right axis deviation | 221 |
| ARAD | Abnormal RAD, extreme right axis deviation | 122 |
| AXL | Horizontal axis | 102 |
| AXR | Vertical axis | 51 |
| SAG | Saggital type (S1-S2-S3 Pattern) | 3 |
Distribution of infarction stadium across the dataset as introduced in ECG Statements.
| Keyword | # Records | |
|---|---|---|
| Stadium I | acut, early | 186 |
| Stadium I–II | acut/subacut, ablaufend | 5 |
| Stadium II | recent, subacut, bereits abgelaufen | 107 |
| Stadium II–III | subacut/chronisch | 943 |
| Stadium III | old, abgelaufen, chronisch | 1045 |
| unknown | uncertain, unknown, unbekannt | 3443 |
Counts are cumulated from infarction_stadium and infarction_stadium2 which are only set to a value if at least one statement belongs to the superclass of Myocardial Infarction (MI).
Fig. 6Distribution of ECG statements, sex and age across ten folds with stratified folds. The ninth and tenth fold are folds with a particularly high label quality that are supposed to be used as validation and test sets.
SCP-ECG statement summary.
| Column | Description |
|---|---|
| acronym | SCP statement |
| description | short statement description |
| diagnostic | flag if statement is diagnostic |
| form | flag if statement is related to form |
| rhythm | flag if statement is related to rhythm |
| diagnostic_class | superclass for diagnostic statements |
| diagnostic_subclass | subclass for diagnostic statements |
| Statement Category | official SCP statement category |
| SCP-ECG Statement Description | official SCP statement description |
| AHA code | unique ID in the AHA standard |
| aECG REFID | IEEE 11073-10102 Annotated ECG (aECG) standard |
| CDISC Code | Controlled Terminology |
| DICOM Code | DICOM Tags |
Description of annotation scheme stored in scp_statements.csv.
Fig. 7Example Python code for loading data and labels also using the suggested folds and aggregation of diagnostic labels.
| Measurement(s) | electrocardiography • cardiovascular system |
| Technology Type(s) | 12 lead electrocardiography |
| Factor Type(s) | presence of co-occurring diseases |
| Sample Characteristic - Organism | Homo sapiens |