Literature DB >> 32049282

Temporal condition pattern mining in large, sparse electronic health record data: A case study in characterizing pediatric asthma.

Elizabeth A Campbell^1,2, Ellen J Bass^1,3, Aaron J Masino^1,4.

Abstract

OBJECTIVE: This study introduces a temporal condition pattern mining methodology to address the sparse nature of coded condition concept utilization in electronic health record data. As a validation study, we applied this method to reveal condition patterns surrounding an initial diagnosis of pediatric asthma.
MATERIALS AND METHODS: The SPADE (Sequential PAttern Discovery using Equivalence classes) algorithm was used to identify common temporal condition patterns surrounding the initial diagnosis of pediatric asthma in a study population of 71 824 patients from the Children's Hospital of Philadelphia. SPADE was applied to a dataset with diagnoses coded using International Classification of Diseases (ICD) concepts and separately to a dataset with the ICD codes mapped to their corresponding expanded diagnostic clusters (EDCs). Common temporal condition patterns surrounding the initial diagnosis of pediatric asthma ascertained by SPADE from both the ICD and EDC datasets were compared.
RESULTS: SPADE identified 36 unique diagnoses in the mapped EDC dataset, whereas only 19 were recognized in the ICD dataset. Temporal trends in condition diagnoses ascertained from the EDC data were not discoverable in the ICD dataset. DISCUSSION: Mining frequent temporal condition patterns from large electronic health record datasets may reveal previously unknown associations between diagnoses that could inform future research into causation or other relationships. Mapping sparsely coded medical concepts into homogenous groups was essential to discovering potentially useful information from our dataset.
CONCLUSIONS: We expect that the presented methodology is applicable to the study of diagnostic trajectories for other clinical conditions and can be extended to study temporal patterns of other coded medical concepts such as medications and procedures.

Entities: Chemical Disease Species

Keywords: asthma; data mining; data science; electronic health records

Mesh：

Year: 2020 PMID： 32049282 PMCID： PMC7075539 DOI： 10.1093/jamia/ocaa005

Source DB: PubMed Journal: J Am Med Inform Assoc ISSN： 1067-5027 Impact factor: 4.497

INTRODUCTION

Data mining refers to the computational process of automated information extraction from large datasets to facilitate discovery of novel insights. Pattern mining is a fundamental data mining task. Important pattern types include subsequences of sequentially ordered items or events that occur frequently in the dataset. For example, a collection of temporally ordered event sequences may contain ordered event subsets that are frequent. Such temporal patterns may yield valuable insights on associations or causal relationships among variables in a dataset, and help to predict future events. Temporal pattern mining of electronic health record (EHR) data has the potential to uncover previously unknown relationships among comorbidities (conditions occurring together) and condition trajectories (conditions diagnosed in a temporal order), which can complement clinical knowledge and traditional medical research methods. Diagnoses found to commonly occur before the onset of a condition of interest may inform clinicians of patient risk for developing that condition. Similarly, diagnoses found to occur commonly after the onset of a condition of interest may inform care providers of future risk for other disorders. While longitudinal EHR data may inform healthcare research and policy, data mining methods must be selected carefully based on the characteristics of the EHR data and the outcome of interest. Sequential pattern mining methods can be broadly categorized into 2 classes: the Apriori-based candidate generation method and the pattern growth method. The Apriori approach is based on the Apriori property, which posits that a given sequence is necessarily infrequent if it contains a smaller sequence that is infrequent, and that a subsequence of a frequent sequence is also frequent., Apriori-based algorithms may use a horizontal database format (eg, the generalized sequential patterns algorithm), a vertical database format (eg, SPADE [Sequential PAttern Discovery using Equivalence classes]), or Apriori-based candidate generation and pruning using depth-first traversal (eg, the SPAM [Sequential PAttern Mining] algorithm)., Pattern growth algorithms, such as PrefixSpan, perform database projection and use database scans to count item support (all patterns present above a threshold percentage) but do not generate candidates. Although the application of temporal pattern mining techniques to EHR data remains a relatively new and unexplored approach, there have been some important initial studies that have demonstrated the methods’ effectiveness. For example, Perer et al used the SPAM algorithm to mine EHR data to identify sequential medical event patterns among hyperlipidemic patients and studied their association with outcomes. The study identified sequences that confirmed known associations and revealed novel information. For example, increased use of certain medications, such as fluoroquinolones, in hyperlipidemic patients with hypertension may raise low-density lipoprotein levels. Gotz et al also used the SPAM algorithm for temporal pattern mining of clinical events in retrospective EHR data. Chen et al used the PrefixSpan algorithm to analyze combinations of chronic diseases and specific orders of chronic disease transition from a large medical database. Wright et al utilized the SPADE algorithm to identify temporal relationships among diabetes medication prescription trajectories and to predict the future medication prescriptions. Data heterogeneity and sparsity are major challenges to temporal pattern mining of EHR data. These challenges are particularly salient when working with diagnostic codes. For example, although a typical EHR dataset will contain thousands of distinct International Classification of Diseases (ICD) codes, a small fraction of ICD codes will account for most diagnoses. Previous research indicates that this Pareto principle phenomenon presents a significant challenge for pattern mining. Chen et al noted the complexities of mining disease transition patterns using ICD codes due the volume of codes and the sparse use of medically similar codes. Boytcheva et al observed similar challenges and highlighted analyzing classes of similarly grouped ICD codes as an important area of future work. Diagnostic schemes that group diagnosis codes into medically homogenous clusters are one approach to reduce sparsity in an EHR dataset, which may improve pattern mining results. However, the use of such clustering methods in combination with sequential pattern mining in EHR data is limited., There is also limited comparison of mining results when utilizing diagnostic schemes compared with using standard codes (ie, ICD codes). To address this, our study utilizes expanded diagnostic clusters (EDCs) from the Adjusted Clinical Group (ACG) System to address sparsity and makes a direct comparison to results using the original ICD-based diagnoses. We selected EDCs because they are linked to both ICD–Ninth Revision–Clinical Modification (ICD-9-CM) and –ICD–Tenth Revision–Clinical Modification (ICD-10-CM), and cover all diagnosis codes, thus providing comprehensive coverage. We present a sequential pattern mining approach that combines the SPADE pattern mining algorithm with the use of EDC groupings to identify temporal condition patterns in a large, sparse EHR dataset with a case study of pediatric asthma. Pediatric asthma is one of the most common childhood chronic conditions, has numerous comorbidities, and is a socially significant health condition that disproportionately impacts low-income and minority children in the United States., Our objective is to identify temporal patterns surrounding an incident diagnosis of a given condition (eg, asthma). Toward this end, we sought to develop a generalizable framework to identify temporal patterns utilizing sparse EHR data and a sequential pattern mining algorithm. We also consider the impact of collapsing diagnostic codes into clinically similar groups on temporal pattern recognition. We present our data extraction, data transformation, and knowledge discovery processes. We describe the implementation of the SPADE algorithm on EHR data to ascertain condition patterns associated with pediatric asthma using both ICD and EDC coded conditions, and compare output from both datasets. We discuss the evaluation and interpretation of SPADE output, as well as the successes and limitations of our approach. Our study contributes a methodological framework for extracting and organizing EHR data into temporal sequences, and an approach to analyze temporal patterns in EHR data for associations between condition trajectories and an outcome of interest (eg, pediatric asthma).

MATERIALS AND METHODS

The institutional review board at the Children’s Hospital of Philadelphia (CHOP) approved this research study through institutional review board protocol number 16-012822 and waived the requirement for consent. The primary methodological steps in the study, from data extraction through algorithm implementation, are described in Figure 1.

Figure 1.

The primary methodological steps involved in the study. SPADE: Sequential PAttern Discovery using Equivalence classes.

Setting

Study data were obtained from the Pediatric Big Data (PBD) resource maintained at CHOP. The PBD resource includes EHR data from CHOP, its primary care network, and specialty care and surgical centers. EHR information for conditions recorded during each patient visit was derived from ICD-9-CM and ICD-10-CM diagnostic information., Data were extracted by nonstudy staff personnel and anonymized to remove all personal health information before transfer. The PBD resource contains data for visits through October 2017.

Inclusion criteria

Patient data were included for individuals with at least 2 clinical encounters on or after January 1, 2005, in which an ICD code for an asthma diagnosis or exacerbation (ICD-9-CM codes 493.* and ICD-10-CM codes J45.*, where * is any child code) (see Supplementary Table 1) was recorded and the encounters were face to face (inpatient stays, ambulatory visits, or emergency department visits). The requirement for 2 records with an asthma diagnosis was meant to exclude transient asthmas diagnoses. Patients must also have had 2 clinical encounters of any type (not necessarily face to face, such as a pharmacy visit) with a recorded ICD condition concept that was not an asthma diagnosis or exacerbation code (these encounters must have occurred on or after January 1, 2000). This requirement ensures that the cohort does not include individuals with only asthma diagnoses to enable identification of clinically informative temporal condition patterns surrounding an initial diagnosis of pediatric asthma. ICD codes may be classified as clinical or nonclinical observations, per Observational Health Data Sciences and Informatics condition domain standards. For example, an anemia screening is a nonclinical observation, whereas an anemia diagnosis is a clinical observation. In this study, we excluded encounters without any clinical observations. We extracted the available clinical diagnosis observations for all patients that met the inclusion criteria in the PBD database. This yielded 7 412 524 clinical observations for 94 343 patients. The encounters where asthma conditions were recorded corresponded to 1 135 006 clinical observations. For each individual, we identified the first visit with a recorded asthma diagnosis (the index visit), the encounter immediately before the index visit (the preindex visit, if one exists), and the encounter immediately after the index visit (the postindex visit, if one exists). This yielded a dataset containing 578 839 clinical observations for 94 343 patients. To ensure the ability to analyze patients’ clinical state before the initial asthma diagnosis, patients without a preindex visit were excluded. The final dataset contained 473 607 clinical observations for 71 824 patients.

Data extraction and transformation

R version 3.4.4 (R Foundation for Statistical Computing, Vienna, Austria) and Python 3 (Python Software Foundation, Wilmington, DE) were used for the data analysis. Computations were performed on a MacBook Pro running MacOS version 10.12.6 and with 8 GB of RAM (Apple Inc, Cupertino, CA). After extracting condition occurrence data from the PBD resource, we mapped ICD-9-CM and ICD-10-CM clinical condition concepts into medically homogenous classes using EDCs from the ACG System.

Temporal order identification

The SPADE algorithm analyzes data in a “vertical id-list” database format, in which sequences comprise a list of objects in order of occurrence along with timestamps. To apply the SPADE algorithm, we reorganized PBD data from their normalized relational form (information for a single patient stored in different tables within the database) into row entries, in which each row contained a patient identifier, a visit timing class indicator (preindex, index, or postindex) and all clinical observations associated with that visit. Visit date information was no longer relevant, as the visit timing class variable captured the temporal order information necessary to indicate sequential order for subsequent SPADE analysis. Figure 2A presents a sample sequence database of clinical information for 3 patients, formatted to the SPADE algorithm’s specifications for pattern mining.

Figure 2.

(A) A sample sequence database of clinical information (ie, expanded diagnostic cluster groupings for clinical diagnoses) for 3 patients, formatted for the SPADE (Sequential PAttern Discovery using Equivalence classes) algorithm. (B) The output of frequent sequential patterns that SPADE uncovers from the clinical information of the 3 patients in panel A. The expanded diagnostic cluster codes correspond to the following diagnostic categories: ALL04: asthma, without status asthmaticus; EAR08: deafness, hearing loss; EYE07: conjunctivitis, keratitis; GAS03: constipation; MUS02: acute sprains and strains; MUS08: fractures and dislocations/digits only.

Sequence mining

Our primary study objective required identification of temporal condition patterns in a large, sparse EHR dataset. Prior research has demonstrated that SPADE performs well on data with such characteristics and demonstrates runtime efficiency and low memory usage. For these reasons, we selected SPADE as our mining algorithm for this study. SPADE first identifies individual items (eg, a singular diagnosis in a specific timing class) above a specified support level (eg, the proportion of patients with an identified condition pattern), and then builds more complex sequences (multiple diagnoses across different timing classes) at the given support level. An inherent but provable assumption is that a complex sequence present above a given support level is comprised of individual items that also occur above the support level. The algorithm uses efficient lattice search techniques and simple join operations to find frequent patterns among smaller subproblems obtained from dividing the original problem. The SPADE algorithm is efficient for identifying subsequence patterns in large databases of sequential data, and has been successfully used across numerous domains including internet user behaviors and food purchasing patterns. We applied the SPADE algorithm as implemented in the R arules package to discover common subsequences in clinical diagnoses among patients in the study population. Figure 2B illustrates the sequential patterns identified by the SPADE algorithm for 3 hypothetical patients with clinical history shown in Figure 2A. Figure 3 illustrates the possible combinations of temporal diagnoses that SPADE detects. It is important to note that frequent patterns may not contain diagnoses from all 3 timing classes. As illustrated in Figure 3, condition patterns may comprise diagnoses that occurred in the pre- and postindex visits (Figure 3A), index and postindex visits (Figure 3B), preindex and index visits (Figure 3C), and across all 3 timing classes (Figure 3D). Additionally, SPADE identifies singular temporal diagnoses that occur among patients at the specified minimum support level (eg, a diagnosis of the EDC code ALL04 [asthma] in the index visit).

Figure 3.

Sample temporal diagnostic sequences discoverable by SPADE (Sequential PAttern Discovery using Equivalence classes). (A) A sequence that includes a clinical diagnosis in the preindex visit and the postindex visit. (B) A sequence that includes a clinical diagnosis in the index visit and the postindex visit. (C) A sequence that includes a clinical diagnosis in the preindex visit and the index visit. (D) A sequence that includes a clinical diagnosis in all 3 timing classes. Clinical diagnoses observed in a single timing class may also be considered common sequences. We wanted to identify as many common condition patterns as possible while ensuring that the patterns were present among a meaningful number of patients; therefore, we selected a support level of 0.01 to identify all condition patterns that were present in 1% or more of patients. To characterize the impact of grouping clinically similar conditions, we deployed SPADE on both the full ICD dataset and the mapped EDC dataset. For ease of interpretability and results comparison, the common condition patterns identified by SPADE in the ICD dataset were then mapped from ICD codes to EDC codes. While the support values would not change for the common condition patterns identified from the ICD dataset once they have been mapped to EDC codes, it is possible for 2 or more patterns to appear to have the same temporal and diagnostic information, since more than 1 ICD code is mapped to a single EDC code. We analyzed our results for clinical relevance, and compared the prevalence of condition patterns by support level. We identified sequences with the highest support level, which have the strongest potential associations with pediatric asthma. As support is invariant to timing class, we also examined trends of condition diagnoses relative to the visit timing class.

RESULTS

Clinical term mapping

There were 7072 unique ICD codes represented in the dataset, which mapped to 267 EDC codes.

SPADE analysis

When deployed on the mapped EDC dataset, SPADE’s runtime was 0.72 seconds, and it identified 439 sequences with a support level of 0.01 or higher. Support for these sequences ranged from 0.01 to 0.94. The mean and median level of support were 0.03 and 0.019, respectively. When deployed on the ICD dataset, SPADE’s runtime was 0.49 seconds, and it identified 203 sequences with a support level of 0.01 or higher. Support for these sequences ranged from 0.01 to 0.54. The mean and median level of support were 0.03 and 0.018, respectively. Table 1 illustrates the top 20 condition patterns with the highest level of support (0.09-0.94) in the study population identified in the EDC dataset. Among the most prevalent sequences, 5 distinct conditions were present: asthma without status asthmaticus, respiratory signs and symptoms, acute upper respiratory tract infection, allergic rhinitis, and otitis media.

Table 1.

Top 20 most prevalent conditions patterns surrounding initial diagnosis of pediatric asthma (EDC dataset)

Preindex visit condition(s)	Index visit condition(s)	Postindex visit condition(s)	Support
	Asthma without status asthmaticus		0.942
		Asthma without status asthmaticus	0.436
	Asthma without status asthmaticus	Asthma without status asthmaticus	0.417
Acute upper respiratory tract infection			0.186
Acute upper respiratory tract infection	Asthma without status asthmaticus		0.175
	Allergic rhinitis		0.164
	Allergic rhinitis, asthma without status asthmaticus		0.158
		Acute upper respiratory tract infection	0.145
Respiratory signs and symptoms			0.139
	Acute upper respiratory tract infection		0.139
Otitis media			0.137
	Asthma without status asthmaticus	Acute upper respiratory tract infection	0.136
Respiratory signs and symptoms	Asthma without status asthmaticus		0.134
	Asthma without status asthmaticus, Acute upper respiratory tract infection		0.132
Otitis media	Asthma without status asthmaticus		0.131
		Otitis media	0.120
	Asthma without status asthmaticus	Otitis media	0.116
		Allergic rhinitis	0.102
	Otitis media		0.101
	Asthma without status asthmaticus, Otitis media		0.098

EDC: expanded diagnostic cluster.

Top 20 most prevalent conditions patterns surrounding initial diagnosis of pediatric asthma (EDC dataset) EDC: expanded diagnostic cluster. Table 2 illustrates the conditions present in at least 1 of 439 sequences identified by SPADE in the EDC dataset. There were 36 unique EDC codes represented among the identified sequences. The majority of conditions were present in all 3 visit timing classes. However, there were 5 EDC codes that were identified in the preindex visit only: exanthems, nausea/vomiting, dermatophytosis, nonfungal infections of skin and subcutaneous tissue, and allergic reactions. No conditions were present exclusively in the postindex visit.

Table 2.

Conditions observed in prevalent sequences identified by SPADE (EDC dataset), by timing class

All visit classes	Preindex visits	Preindex and index visits	Preindex and postindex visits	Index visits	Index and postindex visits
Acute lower respiratory tract infection	Allergic reactions	Chronic pharyngitis and tonsillitis	Abdominal pain	Seizure disorder	Asthma without status asthmaticus
Acute upper respiratory tract infection	Dermatophytosis	Musculoskeletal disorders, other	Acute sprains and strains		Asthma, with status asthmaticus
Administrative concerns and nonspecific laboratory abnormalities	Exanthems		Contusions and abrasions
Allergic rhinitis	Nausea, vomiting		Deafness, hearing loss
Attention-deficit disorder	Nonfungal infections of skin and subcutaneous tissue		Musculoskeletal signs and symptoms
Conjunctivitis, keratitis			Urinary symptoms
Constipation
Cough
Dermatitis and eczema
Developmental disorder
ENT disorders, other
Failure to thrive
Gastroenteritis
Gastroesophageal reflux
Nonspecific signs and symptoms
Obesity
Otitis media
Respiratory signs and symptoms
Sinusitis
Viral syndromes

EDC: expanded diagnostic cluster; ENT: ear, nose, and throat; SPADE: Sequential PAttern Discovery using Equivalence classes.

Conditions observed in prevalent sequences identified by SPADE (EDC dataset), by timing class EDC: expanded diagnostic cluster; ENT: ear, nose, and throat; SPADE: Sequential PAttern Discovery using Equivalence classes. Table 3 presents the top 20 condition patterns with the highest level of support (0.06-0.54) identified by SPADE in the full ICD dataset (the ICD concepts in the identified sequences were mapped to EDC codes for comparison, with results in Table 1).

Table 3.

Top 20 most prevalent conditions patterns surrounding initial diagnosis of pediatric asthma (ICD dataset)

Preindex visit condition(s)	Index visit condition(s)	Postindex visit condition(s)	Support
	Asthma without status asthmaticus		0.538
		Asthma without status asthmaticus	0.251
	Asthma without status asthmaticus	Asthma without status asthmaticus	0.183
	Asthma without status asthmaticus		0.149
	Asthma without status asthmaticus		0.137
	Allergic rhinitis		0.125
Respiratory signs and symptoms			0.108
Acute upper respiratory tract infection			0.096
	Acute upper respiratory tract infection		0.088
	Cough		0.081
	Asthma without status asthmaticus		0.079
		Allergic rhinitis	0.077
	Allergic rhinitis, asthma without status asthmaticus		0.075
		Acute upper respiratory tract infection	0.072
Respiratory signs and symptoms	Asthma without status asthmaticus		0.064
		Asthma without status asthmaticus	0.061
Acute upper respiratory tract infection			0.059
	Cough		0.059
Otitis media			0.057
	Asthma without status asthmaticus		0.055

ICD: International Classification of Diseases.

Top 20 most prevalent conditions patterns surrounding initial diagnosis of pediatric asthma (ICD dataset) ICD: International Classification of Diseases. Among the most prevalent sequences, 6 distinct conditions were present: asthma without status asthmaticus, respiratory signs and symptoms, acute upper respiratory tract infection, allergic rhinitis, cough, and otitis media. Table 4 shows the unique conditions present in the 203 sequences identified by SPADE in the ICD dataset (again mapped to EDC codes for ease of comparison). There were 19 unique EDC codes present. The majority of conditions occurred in all 3 visit timing classes. Only gastroenteritis was diagnosed solely in the preindex visit. No conditions were present exclusively in the postindex visit.

Table 4.

Conditions observed in prevalent sequences identified by SPADE (ICD dataset), by timing class

All visit classes	Preindex visits	Preindex and postindex visits	Index visits	Index and postindex visits
Acute lower respiratory tract infection	Gastroenteritis	Deafness, hearing loss	Attention-deficit disorder	Asthma without status asthmaticus
Acute upper respiratory tract infection			Asthma, with status asthmaticus
Administrative concerns and nonspecific laboratory abnormalities			Obesity
Allergic rhinitis
Constipation
Cough
Dermatitis and eczema
Failure to thrive
Gastroesophageal reflux
Otitis media
Respiratory signs and symptoms
Sinusitis
Viral syndromes

ICD: International Classification of Diseases; SPADE: Sequential PAttern Discovery using Equivalence classes.

Conditions observed in prevalent sequences identified by SPADE (ICD dataset), by timing class ICD: International Classification of Diseases; SPADE: Sequential PAttern Discovery using Equivalence classes. Among the top 20 most prevalent patterns observed in the EDC dataset, SPADE detected 3 condition patterns with diagnoses observed in the preindex and index visits and 3 condition patterns with diagnoses observed in the index and postindex visits. Three singular patterns of diagnoses in the preindex visit, 7 in the index visit, and 4 in the postindex visit were observed. Among the top 20 most prevalent patterns in the ICD dataset, SPADE detected 1 condition pattern with diagnoses observed in the preindex and index visits and 1 condition pattern with diagnoses observed in the index and postindex visits. Four singular patterns of diagnoses in the preindex visit, 10 in the index visit, and 4 in the postindex visit were identified.

DISCUSSION

Methodological contributions

We presented a novel methodology to mine EHR data for temporal condition patterns associated with pediatric asthma. Our approach included extraction of relevant information from a large EHR database, temporal data organization, selection of a sequential pattern mining algorithm, and grouping of clinically similar codes to address data sparsity. The methodology described in this study can be used to study temporal condition patterns within large volumes of EHR data regardless of the condition of interest. We found the SPADE algorithm to be a time-efficient approach to mining a large, sparse dataset for frequent longitudinal clinical condition patterns surrounding initial diagnosis of a selected condition of interest (pediatric asthma in this study). In recent years, the generation and use of large datasets derived from nontraditional data sources (eg, EHRs) in the healthcare sector has increased., The generation of such data enable the extraction of useful and potentially previously unknown insights from massive datasets, but only with the proper computational methods to convert this data into clinically meaningful information and knowledge., Our work provides a case study into a successful aggregation method and examination of condition information from EHR data that can be directly applied to similar datasets. Our study specifically demonstrated the utility of EDC codes when mining large EHR datasets to reveal additional important patterns. When deployed on the full dataset of ICD codes, SPADE identified fewer than half the number of common condition patterns than when deployed on the dataset containing EDC codes. SPADE was also able to better detect more complex temporal patterns in diagnoses in the EDC dataset. By using EDC codes, we consolidated clinically similar results, detected common condition patterns at higher levels of support, and found condition patterns at low levels of support that would otherwise not have been identified. The EDC mapping optimized the recognition of potential condition associations (36 unique diagnoses were identified in the mapped EDC dataset, compared with only 19 in the ICD dataset), and was essential to analyzing temporal trends in condition trajectories. Gastroenteritis was the only diagnosis exclusively found in the preindex visit in the ICD dataset. However, in the EDC dataset, gastroenteritis diagnoses were found across all timing classes, and 5 separate conditions were identified as occurring exclusively in the preindex visit. None of these diagnoses were identified by SPADE in the ICD dataset. While our approach utilized the ACG system to aggregate related, but sparsely used clinical concepts, it is important to note that other methods (eg, the Clinical Classifications Software for ICD-9-CM) exist for clustering patient diagnoses into a manageable number of clinically related categories for analysis. To our knowledge, this is the first study to utilize the SPADE algorithm to study temporal condition patterns surrounding initial diagnosis of pediatric asthma. The patterns identified by SPADE can be regarded as hypotheses for associations between specific conditions and temporal patterns and pediatric asthma that future studies can explore. We found strong associations among allergic rhinitis, respiratory signs and symptoms, acute upper respiratory tract infection, and otitis media and pediatric asthma. Five conditions (exanthems, nausea/vomiting, dermatophytosis, nonfungal infections of skin and subcutaneous tissue, and allergic reactions) were present exclusively in the preindex visit among common sequences. These conditions may represent signals of future asthma that could alert clinicians. Future research into the reproducibility and potential causal relations of these associations is warranted.

Diagnostic pattern associations

The patterns with the highest support discovered by SPADE in this study align with clinical knowledge of pediatric asthma comorbidities. Allergic rhinitis, sinusitis, eczema, and respiratory infections have previously been shown to be associated with pediatric asthma. As these findings are consistent with prior epidemiological asthma studies, we may speculate that the other condition patterns found in this study may also be clinically relevant and are possible areas for future epidemiological research. Furthermore, this approach may reveal information about disease comorbidities. For example, although it is known that asthmatic children with multiple morbidities have increased asthma symptoms, school absences, and visits to emergency departments, knowledge of pediatric asthma comorbidities remains an understudied topic. Our methodology represents a scalable approach to study comorbidity patterns that may be employed to explore pediatric asthma and other critical health conditions. The methods described in this study differ from other epidemiological approaches, such as those described in Beck et al, which focused on the predictive power of disease trajectories uncovered from EHR data to assess relative risk of sepsis mortality. Rather than studying the predictive power of a particular sequence, our work serves to describe an approach to analyze complex datasets to uncover patterns and generate hypotheses for future research. While we focused on pediatric asthma, the methodology is not dependent on the conditions analyzed and can be utilized to uncover temporal patterns surrounding any health outcome (eg, diabetes or obesity) or medical event of interest (eg, medication usage). Such an approach may potentially uncover previously unknown associations that have clinical and public health utility for researchers studying asthma and other health outcomes.

Limitations

We utilized a large and complex dataset, which required extensive personnel time and resources to obtain and analyze. As EHRs are primarily designed for clinical care and billing purposes, there are challenges to utilizing the data for clinical and translational research. Although groups such as Observational Health Data Sciences and Informatics have recommended that EHR data be organized temporally, currently most EHRs are stored in databases with various data points for a single patient stored in different tables across schemas within the database. As EHR data are typically not temporally organized in a singular table, relevant data must be identified across the database and transformed into temporal sequences retrospectively for analysis to mine temporal patterns. It is important to note that our dataset was developed from a secondary-use research database composed of aggregated EHR data. Additional steps may be necessary for researchers seeking to implement our methods utilizing data directly collected from the EHR. Grouping related concepts in our dataset was an essential step, otherwise SPADE would not have detected relevant patterns at the specified support levels. Researchers implementing a similar approach may overlook important condition associations if they implement a pattern mining algorithm on EHR data using ICD codes (or similar sparsely used concept codes). Finally, because we used a retrospective observational study design, the associations that SPADE uncovered are purely descriptive. No causation can be attributed to the comorbidities and temporal condition patterns that SPADE identified with pediatric asthma. Rather, this approach should be viewed as a hypothesis-generating method, whose results represent potential associations that future research can investigate for causality.

CONCLUSION

EHR data mining has tremendous potential to improve patient care and reduce financial costs. Frequent pattern mining of EHR data is essential to identify potential associations and correlations in EHR data that researchers may not consider or may have otherwise gone unnoticed. We presented an approach to analyzing a large and complex EHR dataset for temporal condition patterns, using pediatric asthma as a case study. Our analysis revealed strong associations between asthma and several comorbidities and temporal condition patterns. These associations can be used as hypotheses to explore causality in future pediatric asthma research. The methodology presented in this study can be applied to identify temporal patterns in EHR data to investigate conditions and research objectives in numerous contexts outside pediatric asthma.

FUNDING

This work was supported by the Commonwealth Universal Research Enhancement program of the Pennsylvania Department of Health grant number 2015 Formula award SAP #4100072543.

AUTHOR CONTRIBUTIONS

AJM, EAC, and EJB conceived of and designed the study. EAC and AJM conducted the data review. EAC performed the analysis. EAC, AJM, and EJB analyzed the results. EAC wrote the manuscript. All authors contributed to the review and revisions of the manuscript. All authors have seen and approved the final version of the manuscript.

CONFLICT OF INTEREST STATEMENT

None declared. Click here for additional data file.

20 in total

1. The extent and patterns of multiple chronic conditions in low-income children.

Authors: Noreen M Clark; Laurie Lachance; M Beth Benedict; Roderick Little; Harvey Leo; Daniel F Awad; Margaret K Wilkin
Journal: Clin Pediatr (Phila) Date: 2015-04 Impact factor: 1.168

2. The inevitable application of big data to health care.

Authors: Travis B Murdoch; Allan S Detsky
Journal: JAMA Date: 2013-04-03 Impact factor: 56.272

3. Mining and exploring care pathways from electronic medical records with visual analytics.

Authors: Adam Perer; Fei Wang; Jianying Hu
Journal: J Biomed Inform Date: 2015-07-02 Impact factor: 6.317

Review 4. Pediatric asthma: natural history, assessment, and treatment.

Authors: Ronit Herzog; Susanna Cunningham-Rundles
Journal: Mt Sinai J Med Date: 2011 Sep-Oct

5. A methodology for interactive mining and visual analysis of clinical event patterns using electronic health record data.

Authors: David Gotz; Fei Wang; Adam Perer
Journal: J Biomed Inform Date: 2014-01-28 Impact factor: 6.317

Review 6. Comorbidities of asthma during childhood: possibly important, yet poorly studied.

Authors: E P de Groot; E J Duiverman; P L P Brand
Journal: Eur Respir J Date: 2010-09 Impact factor: 16.671

7. Temporal disease trajectories condensed from population-wide registry data covering 6.2 million patients.

Authors: Anders Boeck Jensen; Pope L Moseley; Tudor I Oprea; Sabrina Gade Ellesøe; Robert Eriksson; Henriette Schmock; Peter Bjødstrup Jensen; Lars Juhl Jensen; Søren Brunak
Journal: Nat Commun Date: 2014-06-24 Impact factor: 14.919

Review 8. Big data analytics in healthcare: promise and potential.

Authors: Wullianallur Raghupathi; Viju Raghupathi
Journal: Health Inf Sci Syst Date: 2014-02-07

9. Diagnosis trajectories of prior multi-morbidity predict sepsis mortality.

Authors: Mette K Beck; Anders Boeck Jensen; Annelaura Bach Nielsen; Anders Perner; Pope L Moseley; Søren Brunak
Journal: Sci Rep Date: 2016-11-04 Impact factor: 4.379

10. Evaluating phecodes, clinical classification software, and ICD-9-CM codes for phenome-wide association studies in the electronic health record.

Authors: Wei-Qi Wei; Lisa A Bastarache; Robert J Carroll; Joy E Marlo; Travis J Osterman; Eric R Gamazon; Nancy J Cox; Dan M Roden; Joshua C Denny
Journal: PLoS One Date: 2017-07-07 Impact factor: 3.240

2 in total

1. Network-medicine framework for studying disease trajectories in U.S. veterans.

Authors: Kelly Cho; Albert-László Barabási; Italo Faria do Valle; Brian Ferolito; Hanna Gerlovin; Lauren Costa; Serkalem Demissie; Franciel Linares; Jeremy Cohen; David R Gagnon; J Michael Gaziano; Edmon Begoli
Journal: Sci Rep Date: 2022-07-14 Impact factor: 4.996

Review 2. The applications of eHealth technologies in the management of asthma and allergic diseases.

Authors: Alberto Alvarez-Perea; Ves Dimov; Florin-Dan Popescu; José Manuel Zubeldia
Journal: Clin Transl Allergy Date: 2021-09-06 Impact factor: 5.657

2 in total