| Literature DB >> 29212468 |
Kang K Yan1, Hongyu Zhao2, Herbert Pang3.
Abstract
BACKGROUND: High-throughput sequencing data are widely collected and analyzed in the study of complex diseases in quest of improving human health. Well-studied algorithms mostly deal with single data source, and cannot fully utilize the potential of these multi-omics data sources. In order to provide a holistic understanding of human health and diseases, it is necessary to integrate multiple data sources. Several algorithms have been proposed so far, however, a comprehensive comparison of data integration algorithms for classification of binary traits is currently lacking.Entities:
Keywords: Bayesian network; Classification; Graph-based semi-supervised learning; Multiple data sources; Relevance vector machine; Semi-definite programming (SDP)-support vector machine
Mesh:
Year: 2017 PMID: 29212468 PMCID: PMC6389230 DOI: 10.1186/s12859-017-1982-4
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Data integration algorithms compared
Data sets used for evaluating the data integration algorithms
| Data Set | Sample Size | Data Source | Platform | Numbers of Features |
|---|---|---|---|---|
| GAW 19 | 617 | Genotypes | lllumina Infinium Beadchips | 440,762 |
| Gene Expression | lllumina Sentrix Human-6 Expression BeadChips | 20,634 | ||
| Clinical Covariates | Clinical Data | 2 | ||
| Ovarian | 135 | Gene Expression | Agilent G4502A | 17,814 |
| miRNA Expression | Agilent Human miRNA 8x15K | 799 | ||
| Protein Expression | Reverse phase protein array | 176 | ||
| Methylation | HumanMethylation 27 | 24,981 | ||
| Breast | 453 | RNA SeqV2 | Illumina HiSeq | 20,531 |
| miRNA Expression | Agilent Human miRNA 8x15K | 1046 | ||
| Protein Expression | Reverse phase protein array | 166 | ||
| Methylation | HumanMethylation 450 | 396,065 |
Fig. 2Mean accuracy of seven integration algorithms. BRCA represents breast cancer data set, GAW represents GAW 19 data set, and Ovarian represents ovarian cancer data set. “95% LCL” is the abbreviation of “95% lower confidence limit” and “95% UCL” is the abbreviation of “95% upper confidence limit”
Fig. 3Mean F1 score of seven integration algorithms. BRCA represents breast cancer data set, GAW represents GAW 19 data set, and Ovarian represents ovarian cancer data set. “95% LCL” is the abbreviation of “95% lower confidence limit” and “95% UCL” is the abbreviation of “95% upper confidence limit”
Fig. 4Mean AUC score of seven integration algorithms. BRCA represents breast cancer data set, GAW represents GAW 19 data set, and Ovarian represents ovarian cancer data set. “95% LCL” is the abbreviation of “95% lower confidence limit” and “95% UCL” is the abbreviation of “95% upper confidence limit”
Average computation time (in seconds) of different integration algorithms with different training sizes
| Integration Algorithms | Training Size 100 | Training Size 400 |
|---|---|---|
| Graph-based semi-supervised learning | 0.127 | 4.148 |
| Graph sharpening integration | 0.052 | 1.943 |
| Composite association network | 0.007 | 0.052 |
| Bayesian network | 0.002 | 0.004 |
| Semi-definite programming – SVM | 12.553 | 28.186 |
| Relevance vector machine | 10.471 | 368.455 |
| Ada-boost relevance vector machine | 23.190 | 306.172 |
Comparison of different data integration algorithms
| Integration Algorithms | Computation Time | Stability | Characteristics |
|---|---|---|---|
| Graph-based semi-supervised learning | Low | Medium | Tuning parameter; performance can be poor sometimes |
| Graph sharpening integration | Low | Low | Tuning parameter; average weights frequently occur |
| Composite association network | Low | High | Average weights occur when all weights are negative |
| Bayesian network | Low | Low | Bins selection and training sample size affect performance |
| Semi-definite programming SVM | Medium | Low | Two tuning parameters; |
| Relevance vector machine | High | High | Long training time; Probabilistic result |
| Ada-boost relevance vector machine | High | Medium | Resampling size and iteration can be hard to determine |
Fig. 5Data integration algorithms decision tree