| Literature DB >> 31494566 |
Danlu Liu, William Baskett, David Beversdorf, Chi-Ren Shyu.
Abstract
Finding small homogeneous subgroup cohorts in large heterogeneous populations is a critical process for hypothesis development in biomedical research. Concurrent computational approaches are still lacking in robust answers to the question "what hypotheses are likely to be novel and to produce clinically relevant results with well thought-out study designs?" We have developed a novel subgroup discovery method which employs a deep exploratory mining process to slice and dice thousands of potential subpopulations and prioritize potential cohorts based on their explainable contrast patterns and which may provide interventionable insights. We conducted computational experiments on both synthesized data and a clinical autism data set to assess performance quantitatively for coverage of pre-defined cohorts and qualitatively for novel knowledge discovery, respectively. We also conducted a scaling analysis using a distributed computing environment to suggest computational resource needs for when the subpopulation number increases. This work will provide a robust data-driven framework to automatically tailor potential interventions for precision health.Entities:
Mesh:
Year: 2019 PMID: 31494566 PMCID: PMC9341221 DOI: 10.1109/JBHI.2019.2939149
Source DB: PubMed Journal: IEEE J Biomed Health Inform ISSN: 2168-2194 Impact factor: 7.021
Fig. 1.The overall system architecture of the distributed exploratory mining workflow. The architecture can be divided into three parts—data mapping, deep exploratory mining and distributed computing. The expert is involved in the data mapping part to map raw data into the mineable space, then the formatted data is fed to the deep exploratory mining process using a Big Data ecosystem. Contrast subgroups are selected and their contrast patterns are mined in the distributed environment. Finally, all selected contrast subgroups are evaluated based on their effective contrast patterns using an evaluation function J.
Floating Contrast Subgroup Selection.
|
| |
| 1: | |
| 2: | // |
| 3: | |
| 4: | INCLUSION ( |
| 5: | |
| 6: | |
| 7: | |
| 8: | // |
| 9: | |
| 10: | // |
| 11: | |
| 12: | |
| 13: | |
| 14: | |
| 15: | |
| 16: | // |
| 17: | |
| 18: | |
| 19: | |
| 20: | |
| 21: | |
| 22: | |
|
| |
|
| |
|
| |
| 1: | |
| 2: | |
| 3: | |
| 4: | Compose a set of contrast pairs |
| 5: | |
| 6: | |
| 7: | Divide data |
| 8: | |
| 9: | Add |
| 10: | |
| 11: | |
| 12: | Select the highest |
| 13: | |
| 14: | Remove the population variables of |
|
| |
|
| |
|
| |
| 1: | |
| 2: | |
| 3: | |
| 4: | |
| 5: | Divide data |
| 6: | |
| 7: | Add |
| 8: |
|
| 9: | Select the highest |
| 10: | |
| 11: | Add the population variable of |
Fig. 2.The Guided Cascading Shotgun approach for the path expansion process, which explores multiple paths in each inclusion and exclusion procedure for cohort selection.
Fig. 3.Distributed pattern mining for a contrast subgroup using an Apache Spark high performance computing environment.
Fig. 4.The generation of a synthesized dataset containing subgroup pairs where contrast patterns have various overlapping factors in the measurement (M) space with varying length of patterns. There are N/k subgroup pairs for lengths from 1 to k randomly assigned to the dataset.
Fig. 5.The coverage of all artificial cohorts discovered by the algorithm on the synthesized data. Each synthesized dataset has one million records. Synthesized data with the population variable numbers range from 5 to 20. The expanding percentage ranges from 5% to 20%.
Fig. 6.The running time of different numbers of population variables with expanding factor equals to 20% based on 6, 12, 18, 24, and 30 computational nodes.
Rated Contrast Subgroups and Ratio of Published Significant Genes
| Subgroup 1 a | Subgroup 2 | No. of Discovered Genes | No. of Discovered Genes also in AutDB b | No. of PubMed Articles | |||
|---|---|---|---|---|---|---|---|
|
|
| ||||||
| Population Variable(s) | Cohort Size | Population Variable(s) | Cohort Size | Number | Number | Number | |
| Low SSC Full Scale IQ | 459 | vs | High SSC Full Scale IQ | 373 | 5 | 1 | 2242 |
| Normal to Speak Sentences | 346 | vs | Late to Speak Sentences | 304 | 16 | 3 | 5130 |
|
| |||||||
| Mid RBS-R Overall Score | 202 | vs | Low RBS-R Overall Score | 77 | 44 | 6 | 898 |
| Low ABC III Stereotypy Scale | 171 | vs | High ABC III Stereotypy Scale | 159 | 18 | 2 | 452 |
|
| |||||||
| Mid Vineland II Daily Living | 253 | vs | High Vineland II Daily Living | 54 | 22 | 4 | 0 |
| Mid CBCL6 Rule Breaking Score | 228 | vs | High CBCL6 Rule Breaking Score | 59 | 25 | 4 | 0 |
SSC Full Scale IQ = Simons Simplex Complex Full Scale IQ, RBS-R = Repetitive Behaviors Scale-Revised, CBCL6 = Child Behavior Checklist for ages 6-18, ABC III = Aberrant Behavior Checklist-Stereotype Scale, Vineland II Daily Living = Vineland Adaptive Behavior Scales-Second Edition in Daily Living domain, ADIR C Total = Autism Diagnostic Interview-Revised (ADI-R)-Restricted, Repetitive, and Stereotyped Patterns of Behavior total score, SRS-P = Social Responsiveness Scale – Parent Report.
Details about significant genes are in the Supplement 1.