| Literature DB >> 34911979 |
Sarmistha Das1,2, Indranil Mukhopadhyay3.
Abstract
Multi-omics data integration is widely used to understand the genetic architecture of disease. In multi-omics association analysis, data collected on multiple omics for the same set of individuals are immensely important for biomarker identification. But when the sample size of such data is limited, the presence of partially missing individual-level observations poses a major challenge in data integration. More often, genotype data are available for all individuals under study but gene expression and/or methylation information are missing for different subsets of those individuals. Here, we develop a statistical model TiMEG, for the identification of disease-associated biomarkers in a case-control paradigm by integrating the above-mentioned data types, especially, in presence of missing omics data. Based on a likelihood approach, TiMEG exploits the inter-relationship among multiple omics data to capture weaker signals, that remain unidentified in single-omic analysis or common imputation-based methods. Its application on a real tuberous sclerosis dataset identified functionally relevant genes in the disease pathway.Entities:
Year: 2021 PMID: 34911979 PMCID: PMC8674330 DOI: 10.1038/s41598-021-03034-z
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Structure of data availability for genotype (G), gene expression (E), methylation (M) and phenotype (P). Each letter indicates the presence of corresponding data.
Missing data schemes.
| Sample size | Data type | ||||
|---|---|---|---|---|---|
| Phenotype | Covariates | Genotype | Gene expression | Methylation | |
| ✗ | |||||
| ✗ | |||||
| ✗ | ✗ | ||||
Notation: ‘’ indicates ‘available’, ‘✗’ means ‘missing’ data. For individuals, no data is missing; for individuals only gene expression (methylation) data are missing; both gene expression and methylation data are missing for individuals.
Type I error rate under different combination of sample sizes and varying percentages of missing methylation and/or gene expression values based on 10,000 simulations.
| 100 | 150 | 200 | 100 | 150 | 200 | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| TiMEG | CC | TiMEG | CC | TiMEG | CC | TiMEG | CC | TiMEG | CC | TiMEG | CC | ||
| (0,0,0) | 0.0534 | 0.0534 | 0.0528 | 0.0528 | 0.0533 | 0.0533 | (10,0,10) | 0.0860 | 0.0559 | 0.0847 | 0.0544 | 0.0708 | 0.0529 |
| (0,0,10) | 0.0836 | 0.0527 | 0.0805 | 0.0549 | 0.0757 | 0.0505 | (0,10,10) | 0.0889 | 0.0536 | 0.0519 | 0.0540 | 0.0469 | 0.0519 |
| (0,0,20) | 0.0828 | 0.0590 | 0.0760 | 0.0531 | 0.0758 | 0.0519 | (10,10,0) | 0.0848 | 0.0557 | 0.0795 | 0.0533 | 0.0750 | 0.0520 |
| (0,0,40) | 0.0758 | 0.0586 | 0.0803 | 0.0527 | 0.0776 | 0.0565 | (20,0,20) | 0.0826 | 0.056 | 0.0810 | 0.0585 | 0.0793 | 0.0536 |
| (0,0,60) | 0.0826 | 0.0632 | 0.0789 | 0.0628 | 0.0741 | 0.0551 | (0,20,20) | 0.0798 | 0.0537 | 0.0812 | 0.0514 | 0.0476 | 0.0558 |
| (0,0,80) | 0.0869 | 0.0805 | 0.0848 | 0.0689 | 0.0761 | 0.0625 | (20,20,0) | 0.0871 | 0.0542 | 0.0814 | 0.0552 | 0.0834 | 0.0516 |
| (0,10,0) | 0.0552 | 0.0540 | 0.0508 | 0.0509 | 0.0497 | 0.0530 | (40,0,40) | 0.0884 | 0.0783 | 0.0773 | 0.0652 | 0.0794 | 0.0617 |
| (0,20,0) | 0.0538 | 0.0607 | 0.0497 | 0.0564 | 0.0520 | 0.0519 | (0,40,40) | 0.0546 | 0.0621 | 0.0507 | 0.0594 | 0.0455 | 0.0550 |
| (0,40,0) | 0.0552 | 0.0602 | 0.0533 | 0.0580 | 0.0516 | 0.0492 | (10,10,10) | 0.0551 | 0.0565 | 0.0551 | 0.0549 | 0.0566 | 0.0526 |
| (0,60,0) | 0.0558 | 0.0649 | 0.0537 | 0.0594 | 0.0507 | 0.0578 | (10,20,10) | 0.0792 | 0.0531 | 0.0527 | 0.0569 | 0.0530 | 0.0494 |
| (0,80,0) | 0.0641 | 0.0837 | 0.0570 | 0.0637 | 0.0507 | 0.0620 | (10,10,20) | 0.0556 | 0.0577 | 0.0512 | 0.0547 | 0.0502 | 0.0529 |
| (10,0,0) | 0.0869 | 0.0533 | 0.0755 | 0.0531 | 0.0747 | 0.0495 | (20,10,10) | 0.0536 | 0.0551 | 0.0574 | 0.0554 | 0.0495 | 0.0510 |
| (20,0,0) | 0.0866 | 0.0595 | 0.0796 | 0.0532 | 0.0760 | 0.0552 | (20,10,20) | 0.0556 | 0.0544 | 0.0511 | 0.0590 | 0.0573 | 0.0513 |
| (40,0,0) | 0.0813 | 0.0598 | 0.0783 | 0.0570 | 0.0796 | 0.0556 | (20,20,10) | 0.0523 | 0.0588 | 0.0497 | 0.0548 | 0.0501 | 0.0571 |
| (60,0,0) | 0.0836 | 0.0624 | 0.0871 | 0.0564 | 0.0766 | 0.0593 | (10,20,20) | 0.0523 | 0.0567 | 0.0515 | 0.0556 | 0.0467 | 0.0556 |
| (80,0,0) | 0.0982 | 0.0849 | 0.0854 | 0.0679 | 0.0836 | 0.0624 | (20,20,20) | 0.0538 | 0.0600 | 0.0510 | 0.0581 | 0.0518 | 0.0532 |
both missing, only methylation missing, only gene expression missing); SS: sample size for case (or control).
Power under different combination of sample sizes and varying percentages of missing methylation and/or gene expression values based on 1000 simulations.
| 100 | 150 | 200 | 100 | 150 | 200 | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| TiMEG | CC | TiMEG | CC | TiMEG | CC | TiMEG | CC | TiMEG | CC | TiMEG | CC | ||
| (0,0,0) | 0.697 | 0.697 | 0.878 | 0.878 | 0.950 | 0.950 | (10,0,10) | 0.564 | 0.564 | 0.748 | 0.745 | 0.907 | 0.881 |
| (0,0,10) | 0.640 | 0.636 | 0.800 | 0.810 | 0.933 | 0.908 | (0,10,10) | 0.638 | 0.631 | 0.831 | 0.815 | 0.923 | 0.919 |
| (0,0,20) | 0.579 | 0.570 | 0.760 | 0.770 | 0.917 | 0.885 | (10,10,0) | 0.616 | 0.613 | 0.857 | 0.827 | 0.931 | 0.909 |
| (0,0,40) | 0.552 | 0.423 | 0.730 | 0.617 | 0.866 | 0.745 | (20,0,20) | 0.483 | 0.435 | 0.701 | 0.637 | 0.834 | 0.767 |
| (0,0,60) | 0.485 | 0.288 | 0.676 | 0.416 | 0.829 | 0.573 | (0,20,20) | 0.611 | 0.590 | 0.794 | 0.789 | 0.901 | 0.879 |
| (0,0,80) | 0.445 | 0.147 | 0.634 | 0.218 | 0.795 | 0.278 | (20,20,0) | 0.601 | 0.563 | 0.796 | 0.736 | 0.897 | 0.868 |
| (0,10,0) | 0.605 | 0.618 | 0.842 | 0.822 | 0.937 | 0.935 | (40,0,40) | 0.327 | 0.135 | 0.488 | 0.214 | 0.642 | 0.290 |
| (0,20,0) | 0.646 | 0.537 | 0.830 | 0.767 | 0.933 | 0.888 | (0,40,40) | 0.514 | 0.422 | 0.745 | 0.613 | 0.879 | 0.757 |
| (0,40,0) | 0.647 | 0.441 | 0.817 | 0.610 | 0.936 | 0.778 | (10,10,10) | 0.567 | 0.567 | 0.762 | 0.752 | 0.889 | 0.877 |
| (0,60,0) | 0.619 | 0.281 | 0.818 | 0.451 | 0.931 | 0.566 | (10,20,10) | 0.598 | 0.609 | 0.753 | 0.739 | 0.907 | 0.888 |
| (0,80,0) | 0.596 | 0.145 | 0.802 | 0.215 | 0.916 | 0.289 | (10,10,20) | 0.543 | 0.482 | 0.724 | 0.697 | 0.867 | 0.817 |
| (10,0,0) | 0.612 | 0.604 | 0.804 | 0.795 | 0.911 | 0.920 | (20,10,10) | 0.524 | 0.482 | 0.755 | 0.690 | 0.858 | 0.843 |
| (20,0,0) | 0.539 | 0.533 | 0.778 | 0.763 | 0.882 | 0.885 | (20,10,20) | 0.484 | 0.441 | 0.710 | 0.607 | 0.797 | 0.780 |
| (40,0,0) | 0.425 | 0.411 | 0.641 | 0.603 | 0.793 | 0.769 | (20,20,10) | 0.518 | 0.494 | 0.740 | 0.687 | 0.871 | 0.821 |
| (60,0,0) | 0.362 | 0.285 | 0.514 | 0.436 | 0.663 | 0.557 | (10,20,20) | 0.528 | 0.513 | 0.740 | 0.690 | 0.886 | 0.838 |
| (80,0,0) | 0.228 | 0.129 | 0.366 | 0.208 | 0.488 | 0.304 | (20,20,20) | 0.497 | 0.426 | 0.695 | 0.590 | 0.831 | 0.762 |
both missing, only methylation missing, only gene expression missing); SS: sample size for case (or control).
Figure 2QQ-plot with sample size 200 based on the performance of simulated data. (A): QQ-plot with no missing data, (B): QQ-plot with 10% both gene expression and methylation missing, 10% only methylation and 20% only gene expression missing.
Expected average computation time (in seconds) per run based on 100 simulations.
| 100 | 150 | 200 | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| TiMEG | CC | KNN | MI | TiMEG | CC | KNN | MI | TiMEG | CC | KNN | MI | |
| (0,0,10) | 2.348 | 1.510 | 1.643 | 1.540 | 2.572 | 2.221 | 2.437 | 2.378 | 2.625 | 3.053 | 3.312 | 3.176 |
| (0,0,80) | 2.177 | 0.355 | 1.546 | 1.864 | 2.195 | 1.270 | 2.384 | 2.589 | 2.495 | 1.946 | 3.103 | 3.253 |
| (0,10,0) | 0.185 | 1.437 | 1.668 | 1.659 | 0.222 | 2.300 | 2.580 | 2.448 | 0.260 | 3.063 | 3.334 | 3.132 |
| (0,80,0) | 0.189 | 0.370 | 1.674 | 1.833 | 0.226 | 1.155 | 2.363 | 2.494 | 0.257 | 1.946 | 3.183 | 3.260 |
| (10,0,0) | 2.496 | 1.506 | 1.737 | 1.683 | 2.581 | 2.235 | 2.613 | 2.455 | 2.762 | 3.100 | 3.433 | 3.295 |
| (80,0,0) | 2.430 | 0.388 | 3.237 | 1.909 | 2.451 | 1.237 | 2.433 | 2.681 | 2.645 | 1.911 | 3.223 | 3.298 |
both missing, only methylation missing, only gene expression missing); SS: sample size for case (or control).
Figure 3Plot of ROC graphs depicting situations with (A) only gene expression missing for individuals, (B) only methylation missing for individuals, (C) both omics missing for individuals, and (D) both omics missing for , only gene expression missing for and only methylation missing for individuals.
Figure 4Boxplot of prediction accuracy from tenfold CV based on 100 datasets each with 200 cases and 200 controls under different missing omics data structure. The black horizontal line indicates median prediction accuracy for the datasets with no missing information. Each boxplot (from left to right) signifies one combination each viz. no missing information, only , , , , gene expression missing respectively, only , , , , methylation missing respectively, , , , , of both gene expression and methylation missing respectively, of both missing along with of only gene expression missing, of only gene expression missing along with of only methylation missing, of both missing along with of only methylation missing, of both missing along with of only gene expression missing, of both missing along with of only methylation missing, of both missing along with of only gene expression missing and another of only methylation missing.
Figure 5Plot of Misclassification rate vs False positive rate (1-Specificity) for only gene expression missing. (A) depicts no missing data scenario while (B–F) respectively depict , , , and only gene expression data missing scenarios.