| Literature DB >> 24188919 |
Clarlynda R Williams-DeVane1, David M Reif, Elaine Cohen Hubal, Pierre R Bushel, Edward E Hudgens, Jane E Gallagher, Stephen W Edwards.
Abstract
BACKGROUND: Complex diseases are often difficult to diagnose, treat and study due to the multi-factorial nature of the underlying etiology. Large data sets are now widely available that can be used to define novel, mechanistically distinct disease subtypes (endotypes) in a completely data-driven manner. However, significant challenges exist with regard to how to segregate individuals into suitable subtypes of the disease and understand the distinct biological mechanisms of each when the goal is to maximize the discovery potential of these data sets.Entities:
Mesh:
Year: 2013 PMID: 24188919 PMCID: PMC4228284 DOI: 10.1186/1752-0509-7-119
Source DB: PubMed Journal: BMC Syst Biol ISSN: 1752-0509
List of 81 covariates used in analysis
| Allergen screen | **FoodScreen (5 food allergens) | kUA/L |
| | **Phadiatop (15 aeoallegens) | kUA/L |
| | Total Serum IgE | kU/L |
| Blood Chemistry | C-reactive Protein | mg/ml |
| | Albumin | g/dL |
| | Alkaline Phosphatase | IU/L |
| | Serum Glutamic Pyruvic Transaminase | IU/L |
| | Serum Aspartate Aminotranferase (AST) Serum Glutamic-Oxaloacetic Transaminase (SGOT) | IU/L |
| | Albumin/Globulin ratio | |
| | Serum Total Bilirubin | mg/dL |
| | Serum Blood Urea Nitrogen | mg/dL |
| | Serum Blood Urea Nitrogen Creatinine Ratio | |
| | Serum Calcium | mg/dL |
| | Serum Chloride | mmol/L |
| | Serum Creatinine | mg/dL |
| | Serum Ferritin | ng/ml |
| | Serum Fibrinogen | mg/dL |
| | Serum Gamma-Glutamyl Transpeptidase (GGT) | IU/L |
| | Serum Total Globulin | g/dL |
| | Blood Hematocrit | % |
| | Blood Hemoglobin | g/DdL |
| | Serum Iron | ug/dl |
| | Serum Lactate Dehydrogenase | IU/L |
| | Plasma Leptin | ng/ml |
| | Serum Glycated Hemoglobin | % |
| | Serum Glucose | mg/dL |
| | Potassium | mmol/L |
| | Urine Creatinine | mg/dl |
| | Serum Arachidonic Acid | ug/ml |
| | Serum Osmolality | mOsmol/kg |
| | Serum Phospholipids Concentration | Mg/dL |
| | Serum Phosphorus | mg/dL |
| | Serum Total Protein | g/dL |
| | Sodium | mmol/L |
| | BP Oxygen Saturation (Dissolved Oxygen) | % |
| CBC | White Blood Cell Count | K/uL |
| | Basophil percent of sum White Blood Cells | % |
| | Eosinophil percent of sum White Blood Cells | % |
| | Lymphocyte percent of sum White Blood Cells | % |
| | Monocyte percent of sum White Blood Cells | % |
| | Neutrophils percent of sum White Blood Cells | % |
| Clinic | Subject Age | Years |
| | Subject Height | Cm |
| | Subject Body Mass Index Weight/height | kg/m2 |
| | Subject Weight | Kg |
| | Mean of first Two Diastolic Blood Pressure Measurements | mmHg |
| | Mean of first two Systolic Blood Pressure Measurements | mmHg |
| | Blood Pressure Pulse | beats/min |
| Hematology | Red Blood Cell Count | M/uL |
| | Platelet Count | K/uL |
| | Mean Corpuscular Hemoglobin | Pg |
| | Mean Corpuscular hemoglobin concentration | g/dL |
| | Mean Corpuscular Volume | fL*** |
| | Red Blood Cell Distribution Width | % |
| Inflammatory | Interleukin-4 | pg/ml |
| | Serum Total Antioxidant Status | mmol/L |
| | Serum Unbound Iron-Binding Capacity | ug/dl |
| | Serum Uric Acid | mg/dL |
| | Plasma Average of Reactive Oxygen Species Measurements minus Control | RLU**** |
| Lipids | High density Lipoprotein | mg/dL |
| | Low density Lipoprotein | mg/dL |
| | Total Cholesterol to High density Lipoprotein Ratio | |
| | Total Cholesterol | mg/dL |
| | Serum Triglycerides | mg/dL |
| | Very Low Density Lipoprotein | mg/dL. |
| Lung Function | Forced Expiratory Flow Between 25% and 75% of Forced Expiratory Flow | |
| | Fractional Exhaled Nitric Oxide | ppb |
| | Forced Expiratory Volume /ratio to Forced Vital Capacity | % |
| | Peak Expiratory Flow | (L/min) |
| Serum Allergens | *Serum Alternaria Alternata | kUA/L |
| | *Serum Aspergillus Fumigatus | kUA/L |
| | *Serum Cat Dander Epithel | kUA/L |
| | *Serum Cladosporium Herbarum | kUA/L |
| | *Serum Derm Farin Dustmite | kUA/L |
| | *Serum Derm Pter Dustmite | kUA/L |
| | *Serum Dog Dander | kUA/L |
| | *Serum German Cockroach | kUA/L |
| | *Serum Mouse Urine Protein | kUA/L |
| | *Serum Penicillium Notatum | kUA/L |
| *Serum Rat Urine Protein | kUA/L |
*Indicates variables not included in the 67 Covariate List. **Indicates variables that were converted to categorical variables for the 67 Covariate List.***femtoliters. ****relative luminescence unit.
Figure 1Data pre-processing workflow for gene expression and clinical data used in each of the analysis methods.
Figure 2Data incorporation scheme for each of data analysis method consideration. Shading corresponds to each data domain: Gene Expression, Clinical Covariates, and Indicators of Disease Status.
Indicators of disease status
| Confirmed asthma | When parent questionnaire response to the question “has a doctor ever diagnosed this child as having asthma” was confirmed through administrative records regarding clinic visits and the prescription of asthma medication | 146/59 |
| Current asthma | Labeled Asthmatics and Non-Asthmatics reporting an asthma attack in the last 12 months | 192/13 |
| Questionnaire-defined asthma | Positive response to “has a doctor ever diagnosed this child as having asthma” on parent questionnaire | 186/19 |
| Phadiatop – Atopy/Allergen screen | Positive serum test to a panel of at least 15 common allergens | 173/32 |
| Foodscreen – Food allergen screen | Positive serum test to a panel of 6 common allergy provoking foods (cows milk protein, egg white wheat codfish peanut and soybean) | 190/15 |
Column 2 provides a description of each disease indicator. Column 3 shows the number of subjects for which a given disease status was unknown.
Modk-prototypes weighting schemes
| 33 | 33 | 33 |
| 20 | 40 | 40 |
| 40 | 20 | 40 |
| 50 | 50 | 0 |
| 30 | 60 | 10 |
| 60 | 30 | 10 |
| 40 | 40 | 20 |
| Adaptive | Adaptive | Adaptive |
The Modk algorithm was run with 7 pre-defined weighting schemes. This table shows the combinations of gene expression, clinical covariates, and disease indicator weighting for each run. For the adaptive weighting scheme, the weights were determined by the algorithm as described in the methods.
Figure 3Multistep decision tree method.
Top ranking gene expression clustering methods from clusterSim
| Silhouette | 0.6325 | Manhattan | Hierarchical – Single linkage | 2 |
| Baker & Hubert | 1 | Manhattan | Hierarchical – Single linkage | 2 |
| Hubert & Levine | 0.0615 | Generalized Distance Measure | Hierarchical – Complete linkage | 50 |
| Generalized Distance Measure | Hierarchical - Complete linkage | 14 | ||
| Generalized Distance Measure | Hierarchical - Complete linkage | 12 | ||
The optimal distance measure and clustering method using three separate indices are shown along with the associated index value in each case. Where no index metric or value is given, an attempt was made to create more informative clusters rather than optimize a clustering index.
Figure 4Scatterplots for each of the Gene Expression Clustering methods (Table2); A: Silhouette, B: Baker & Hubert, C: Hubert & Levine, D-E: Additional clustering Combinations. Colors are representative of individual clusters.
Top ranking 81 covariate clustering methods from clusterSim
| Silhouette | 0.6662 | Generalized Distance Measure | Partitioning Around Medoids | 2 |
| Baker & Hubert | 0.9954 | Chebyschev | Hierarchical - Single linkage | 2 |
| Hubert & Levine | 0.0290 | Generalized Distance Measure | Hierarchical - Complete linkage | 25 |
| Generalized Distance Measure | Hierarchical - Average linkage | 11 | ||
| Generalized Distance Measure | Hierarchical - Average linkage | 14 | ||
| Generalized Distance Measure | Hierarchical - Average linkage | 12 | ||
The optimal distance measure and clustering method using three separate indices are shown along with the associated index value in each case. Where no index metric or value is given, an attempt was made to create more informative clusters rather than optimize a clustering index.
Figure 5Scatterplots for each of the 81 Clinical Covariate Clustering Methods (Table3); A: Silhouette, B: Baker & Hubert, C: Hubert & Levine, D: Prespecified cluster count = 11, E: Prespecified cluster count = 12, F: Prespecified cluster count = 14. Colors are representative of individual clusters.
Top ranking 67 covariate clustering methods from clusterSim
| Silhouette | 0.6692 | Generalized Distance Measure | Partitioning Around Medoids | 2 |
| Baker & Hubert | 0.9122 | Chebyschev | Hierarchical - Single linkage | 2 |
| Hubert & Levine | 0.0279 | Generalized Distance Measure | Partitioning Around Medoids | 24 |
| Generalized Distance Measure | Hierarchical - Average linkage | 8 | ||
| Generalized Distance Measure | Hierarchical - Average linkage | 14 | ||
| Generalized Distance Measure | Hierarchical - Average linkage | 13 | ||
The optimal distance measure and clustering method using three separate indices are shown along with the associated index value in each case. Where no index metric or value is given, an attempt was made to create more informative clusters rather than optimize a clustering index.
Figure 6Scatterplots for each of the 67 Clinical Covariate Clustering Methods (Table4); A: Silhouette, B: Baker & Hubert, C: Hubert & Levine, D: Prespecified cluster count = 11, E: Prespecified cluster count = 12, F: Prespecified cluster count = 14. Colors are representative of individual clusters.
Accuracy per Method/Weighting scheme
| 50/50/0 | 56 | 2/4 |
| 30/60/10 | 61 | 3/3 |
| 60/30/10 | 63 | 4/8 |
| Adaptive | 60 | 1/3 |
The weighting schemes are shown as Gene Expression Domain Weighting/Clinical Covariate Domain Weighting/Indicators of Disease Status Domain Weighting. Column 2 shows the percentage of subjects correctly classified by asthma status after assigning asthma status based on the majority of subjects in the group. Column 3 shows the number of asthmatic and non-asthmatics groups respectively using this definition. Entries in bold represent those methods showing at least 65% accuracy. Information on the classification accuracy of the individual asthmatic clusters for these methods is shown in Table 8.
Accuracy per asthmatic leaf for each Modk weighting scheme and the multi-step decision tree method
| 33/33/33 | 1 | 84 |
| 3 | 58 | |
| 5 | 88 | |
| 8 | 67 | |
| 20/40/40 | 2 | 68 |
| 3 | 83 | |
| 4 | 67 | |
| 40/20/40 | 1 | 73 |
| 2 | 90 | |
| 4 | 55 | |
| 5 | 100 | |
| 6 | 56 | |
| 8 | 57 | |
| 40/40/20 | 3 | 86 |
| 5 | 70 | |
| 6 | 60 | |
| 12 | 100 | |
| Decision Tree | 1 | 90 |
| 2 | 71 | |
| 5 | 60 | |
| 8 | 73 |
The weighting schemes are shown as Gene Expression Domain Weighting/Clinical Covariate Domain Weighting/Indicators of Disease Status Domain Weighting. Methods with at least 65% overall accuracy were evaluated based on the accuracy of the individual asthma groups. Column 2 shows the cluster number from the original output. Missing numbers represent non-asthmatic clusters. Column 3 shows the percentage of asthmatics in each asthma group.