| Literature DB >> 24103636 |
Catia M Machado1, Ana T Freitas, Francisco M Couto.
Abstract
: Enrichment analysis is well established in the field of transcriptomics, where it is used to identify relevant biological features that characterize a set of genes obtained in an experiment.This article proposes the application of enrichment analysis as a first step in a disease prognosis methodology, in particular of diseases with a strong genetic component. With this analysis the objective is to identify clinical and biological features that characterize groups of patients with a common disease, and that can be used to distinguish between groups of patients associated with disease-related events. Data mining methodologies can then be used to exploit those features, and assist medical doctors in the evaluation of the patients in respect to their predisposition for a specific event.In this work the disease hypertrophic cardiomyopathy (HCM) is used as a case-study, as a first test to assess the feasibility of the application of an enrichment analysis to disease prognosis. To perform this assessment, two groups of patients have been considered: patients that have suffered a sudden cardiac death episode and patients that have not.The results presented were obtained with genetic data and the Gene Ontology, in two enrichment analyses: an enrichment profiling aiming at characterizing a group of patients (e.g. that suffered a disease-related event) based on their mutations; and a differential enrichment aiming at identifying differentiating features between a sub-group of patients and all the patients with the disease. These analyses correspond to an adaptation of the standard enrichment analysis, since multiple sets of genes are being considered, one for each patient.The preliminary results are promising, as the sets of terms obtained reflect the current knowledge about the gene functions commonly altered in HCM patients, thus allowing their characterization. Nevertheless, some factors need to be taken into consideration before the full potential of the enrichment analysis in the prognosis methodology can be evaluated. One of such factors is the need to test the enrichment analysis with clinical data, in addition to genetic data, since both types of data are expected to be necessary for prognosis purposes.Entities:
Year: 2013 PMID: 24103636 PMCID: PMC4126066 DOI: 10.1186/2041-1480-4-21
Source DB: PubMed Journal: J Biomed Semantics
Figure 1Schematic representation of the prognosis methodology. The methodology is composed by two units: the first (left-side) receives as input data from patients mapped to biomedical ontologies/controlled vocabularies. It performs an enrichment analysis to identify a list of ontology terms considered to be enriched, which will be used to create profiles for individual patients. These profiles will then be subjected to an evaluation step (the second unit, on the right-side) that will result in the evaluation of the prognosis for the patients. For the implementation of the second unit, both a classification and a similarity approach will be explored.
Figure 2Representation of the population and study sets in the enrichment profiling analysis. The two sets of dots represent the genome of two patients, from the same group (e.g. with SCD). The smaller, yellow set of dots, corresponds to the genes mutated in the patient; the larger, white set of dots, corresponds to the entire genome of the patient: genes not mutated (outside the yellow set) and genes mutated. In these sets of genes, blue dots represent genes annotated with a term of interest (t); gray dots represent genes not annotated with t. In the profiling analysis, the study set is the union of the genes mutated in all the patients; the population set is the union of the genome of all the patients. The annotation frequency is then calculated by counting the total number of genes annotated with the term in the study set (study frequency) and in the population set (population frequency).
Number of genes considered in the profiling and the differential enrichment analyses
| 16 | 18,759 × 14 | ||
| 100 | 18,759 × 69 | ||
| 16 | 116 | ||
| 100 | 116 | ||
For each enrichment test performed is indicated the number of genes in the study and the population sets.
Number of enriched terms in each of the analyses performed
| | | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| 30 | 19 | 13 | 11 | 10 | 10 | 53 | 40 | ||
| 39 | 33 | 21 | 19 | 10 | 10 | 70 | 62 | ||
| 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | ||
| 2 | 0 | 1 | 0 | 2 | 0 | 5 | 0 | ||
For each enrichment analysis is indicated the number of terms of each GO type (biological process, molecular function and cellular component), with p-value below 0.1, when considering no multiple-testing correction (noCorr) and with Bonferroni correction (Bonf).
Top 10 enriched biological process terms in the profiling analysis of SCD patients
| GO:0030049 | Muscle filament sliding | 7.7E-40 | 4.1E-38 | 94% | 0.21% |
| GO:0033275 | Actin-myosin filament sliding | 7.7E-40 | 4.1E-38 | 94% | 0.21% |
| GO:0055010 | Ventricular cardiac muscle tissue morphogenesis | 7.7E-40 | 4.1E-38 | 94% | 0.21% |
| GO:0003229 | Ventricular cardiac muscle tissue development | 2.4E-39 | 1.3E-37 | 94% | 0.22% |
| GO:0070252 | Actin-mediated cell contraction | 6.8E-39 | 3.6E-37 | 94% | 0.24% |
| GO:0002027 | Regulation of heart rate | 1.3E-31 | 6.9E-30 | 81% | 0.26% |
| GO:0007512 | Adult heart development | 6.8E-25 | 3.6E-23 | 56% | 0.07% |
| GO:0032781 | Positive regulation of ATPase activity | 2.9E-15 | 1.5E-13 | 38% | 0.09% |
| GO:0043462 | Regulation of ATPase activity | 2.6E-14 | 1.4E-12 | 38% | 0.12% |
| GO:0032971 | Regulation of muscle filament sliding | 1.0E-12 | 5.4E-11 | 25% | 0.02% |
The terms shown are the 10 biological process terms with the lowest p-value, obtained in the profiling of SCD patients. For each term is indicated: GO accession number (Acc), term name, p-value without multiple-testing correction, p-value with Bonferroni correction (p-Bonf), annotation frequency in the study set (SFreq) and annotation frequency in the population set (PFreq).
Top 10 enriched molecular function terms in the profiling analysis of SCD patients
| GO:0008307 | Structural constituent of muscle | 2.9E-35 | 1.6E-33 | 88% | 0.25% |
| GO:0030898 | Actin-dependent ATPase activity | 1.1E-26 | 6.1E-25 | 56% | 0.05% |
| GO:0000146 | Microfilament motor activity | 1.8E-23 | 9.4E-22 | 56% | 0.11% |
| GO:0032036 | Myosin heavy chain binding | 3.3E-11 | 1.8E-09 | 25% | 0.04% |
| GO:0001671 | ATPase activator activity | 9.1E-11 | 4.8E-09 | 25% | 0.05% |
| GO:0031432 | Titin binding | 2.1E-10 | 1.1E-08 | 25% | 0.06% |
| GO:0060590 | ATPase regulator activity | 4.0E-10 | 2.1E-08 | 25% | 0.07% |
| GO:0017022 | Myosin binding | 7.6E-09 | 4.0E-07 | 25% | 0.14% |
| GO:0030172 | Troponin C binding | 5.3E-06 | 2.8E-04 | 13% | 0.02% |
| GO:0031013 | Troponin I binding | 8.4E-06 | 4.4E-04 | 13% | 0.03% |
The terms shown are the 10 molecular function terms with the lowest p-value, obtained in the profiling of SCD patients. For each term is indicated: GO accession number (Acc), term name, p-value without multiple-testing correction, p-value with Bonferroni correction (p-Bonf), annotation frequency in the study set (SFreq) and annotation frequency in the population set (PFreq).
Top 10 enriched cellular component terms in the profiling analysis of SCD patients
| GO:0005859 | Muscle myosin complex | 2.4E-37 | 1.3E-35 | 81% | 0.10% |
| GO:0032982 | Myosin filament | 2.4E-37 | 1.3E-35 | 81% | 0.10% |
| GO:0016460 | Myosin II complex | 1.1E-35 | 5.8E-34 | 81% | 0.13% |
| GO:0001725 | Stress fiber | 4.1E-20 | 2.2E-18 | 56% | 0.25% |
| GO:0032432 | Actin filament bundle | 7.3E-20 | 3.8E-18 | 56% | 0.27% |
| GO:0014705 | C zone | 9.2E-15 | 4.9E-13 | 25% | 0.01% |
| GO:0005863 | Striated muscle myosin thick filament | 1.0E-12 | 5.4E-11 | 25% | 0.02% |
| GO:0031672 | A band | 4.7E-09 | 2.5E-07 | 25% | 0.13% |
| GO:0005861 | Troponin complex | 2.2E-05 | 1.1E-03 | 13% | 0.04% |
| GO:0005865 | Striated muscle thin filament | 6.6E-05 | 3.5E-03 | 13% | 0.07% |
The terms shown are the 10 cellular component terms with the lowest p-value, obtained in the profiling of SCD patients. For each term is indicated: GO accession number (Acc), term name, p-value without multiple-testing correction, p-value with Bonferroni correction (p-Bonf), annotation frequency in the study set (SFreq) and annotation frequency in the population set (PFreq).
Enriched terms in the profiling analysis of no-SCD patients, not identified in the SCD patients
| | | | | |||
| GO:0001980 | Regulation of systemic arterial blood | 6.0E-41 | 4.4E-39 | 13% | 0.00 | |
| | pressure by ischemic conditions | | | | | |
| GO:0001976 | Neurological system process involved in regulation | 3.4E-25 | 2.5E-23 | 13% | 0.00 | |
| | of systemic arterial blood pressure | | | | | |
| GO:0006940 | Regulation of smooth muscle contraction | 6.6E-19 | 4.9E-17 | 13% | 0.00 | |
| GO:0007522 | Visceral muscle development | 5.3E-03 | 3.9E-01 | 1% | 0.00 | |
| GO:0042694 | Muscle cell fate specification | 5.3E-03 | 3.9E-01 | 1% | 0.00 | |
| GO:0055009 | Atrial cardiac muscle tissue morphogenesis | 2.6E-02 | 1.9E+00 | 1% | 0.00 | |
| GO:0003228 | Atrial cardiac muscle tissue development | 2.6E-02 | 1.9E+00 | 1% | 0.00 | |
| GO:0048739 | Cardiac muscle fiber development | 4.2E-02 | 3.1E+00 | 1% | 0.00 | |
| GO:0042693 | Muscle cell fate commitment | 5.7E-02 | 4.2E+00 | 1% | 0.00 | |
| | | | | |||
| GO:0031014 | Troponin T binding | 9.9E-33 | 7.3E-31 | 13% | 0.00 | |
| GO:0019855 | Calcium channel inhibitor activity | 1.9E-31 | 1.4E-29 | 13% | 0.00 | |
| GO:0008200 | Ion channel inhibitor activity | 1.4E-23 | 1.0E-21 | 13% | 0.00 | |
| GO:0016248 | Channel inhibitor activity | 1.4E-23 | 1.0E-21 | 13% | 0.00 | |
| GO:0005246 | Calcium channel regulator activity | 4.9E-23 | 3.6E-21 | 13% | 0.00 | |
| GO:0048306 | Calcium-dependent protein binding | 1.1E-19 | 8.1E-18 | 13% | 0.00 | |
| GO:0042805 | Actinin binding | 5.6E-06 | 4.2E-04 | 4% | 0.00 | |
| GO:0030899 | Calcium-dependent ATPase activity | 1.6E-02 | 1 | 1% | 0.00 | |
| GO:0003785 | Actin monomer binding | 7.2E-02 | 1 | 1% | 0.00 | |
The terms shown are the biological process and molecular function terms identified as enriched in the profiling analysis of no-SCD patients that were not identified in the profiling of SCD patients. For each term is indicated: GO accession number (Acc), term name, p-value without multiple-testing correction, p-value with Bonferroni correction (p-Bonf), annotation frequency in the study set (SFreq) and annotation frequency in the population set (PFreq).
Features used for the clinical characterization of the HCM patients
| True | 5 (36) | 0 | |
| False | 9 (64) | 69 (100) | |
| True | 3 (21) | 0 | |
| False | 8 (57) | 69 (100) | |
| True | 9 (64) | 0 | |
| False | 2 (14) | 69 (100) | |
| True | 0 | 0 | |
| False | 14 (100) | 69 (100) | |
| True | 4 (29) | 8 (12) | |
| False | 1 (7) | 17 (25) | |
| True | 1 (7) | 17 (25) | |
| False | 4 (29) | 8 (12) | |
| True | 3 (21) | 1 (1) | |
| False | 2 (14) | 25 (36) | |
| Familial | 9 (64) | 32 (46) | |
| Sporadic | 2 (14) | 37 (54) | |
| Normal | 4 (29) | 22 (32) | |
| Hypotension | 0 | 1 (1) | |
| | Hypertension | 0 | 5 (7) |
| Male | 6 (43) | 41 (59) | |
| Female | 5 (36) | 25 (36) | |
| [0,20] | 0 | 5 (7) | |
| ]20,40] | 2 (14) | 11 (16) | |
| | ]40,60] | 3 (21) | 15 (22) |
| > 60 | 3 (21) | 10 (14) |
For each feature are indicated its possible values, the number of SCD and no-SCD patients that have them, and the respective percentages. The total number of SCD patients is 14, whereas of no-SCD is 69. (See Table 8 for the percentage of patients with known values for each of the features).
Percentage of SCD and no-SCD patients that have a known value for each clinical feature
| 100 | 100 | |
| 79 | 100 | |
| 79 | 100 | |
| 100 | 100 | |
| 36 | 36 | |
| 36 | 36 | |
| 36 | 38 | |
| 79 | 100 | |
| 29 | 41 | |
| 79 | 96 | |
| 57 | 61 |
Genes used for the genetic characterization of the HCM patients
| 4 | 25 | 202 | |
| 9 | 36 | 192 | |
| 1 | 4 | 138 | |
| 2 | 20 | 178 | |
| 0 | 13 | 173 | |
| 0 | 1 | 133 | |
| 0 | 1 | 251 |
For each gene in indicated the number of SCD and no-SCD patients with at least one mutation in it, as well as the number of Gene Ontology (GO) annotations. The total number of SCD patients is 14, whereas of no-SCD is 69.