| Literature DB >> 27317252 |
Arnav Kapur1, Kshitij Marwah2, Gil Alterovitz2,3.
Abstract
BACKGROUND: An exponential growth of high-throughput biological information and data has occurred in the past decade, supported by technologies, such as microarrays and RNA-Seq. Most data generated using such methods are used to encode large amounts of rich information, and determine diagnostic and prognostic biomarkers. Although data storage costs have reduced, process of capturing data using aforementioned technologies is still expensive. Moreover, the time required for the assay, from sample preparation to raw value measurement is excessive (in the order of days). There is an opportunity to reduce both the cost and time for generating such expression datasets.Entities:
Keywords: Gene expression; Machine learning; Prediction
Mesh:
Year: 2016 PMID: 27317252 PMCID: PMC4912738 DOI: 10.1186/s12859-016-1106-6
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Variation of performance with τ and δ. This example shows a variation in the relative error in predicting two synthetic datasets of dimensions 150 × 150 and 20000 × 150. The datasets were predicted, and 50 % values were known prior to the prediction at a run of 100 iterations
Prediction results with additive noise
| Ratio | Observability (%) | Relative error |
|---|---|---|
| 0.003 | 50 | 4.22 ×10−4 |
| 0.03 | 50 | 4.21 ×10−3 |
| 0.3 | 50 | 1.78 ×10−2 |
| 0.003 | 10 | 1.21 ×10−2 |
| 0.03 | 10 | 1.57 ×10−2 |
| 0.3 | 10 | 1.91 ×10−1 |
Analysis of the addition of noise to synthetic 2000 × 2000 data matrix of rank 10 in low-rank prediction after 100 iterations
Abbreviations: Ratio noise deviation ratio
Fig. 2The results of low-rank prediction in 119 datasets containing a combined total of 10,024 microarray slides at 750 iterations. Boxplots representing Frobenius relative error (top left) and spectral relative error (bottom left) in prediction of converged datasets, and the fraction of values known prior to prediction were varied. Edges of box represent 25 % and 75 % coverage, and the whiskers extend it to 99.73 % coverage, where outliers represent matrices generated using 10 datasets. Variation of omega relative error with the observability of three example datasets with a low Frobenius error (top right) and high Frobenius error (bottom right). Datasets with a high relative error in prediction (bottom right) have a corresponding high omega relative error
Differential analysis on predicted expression datasets. Top unique differentially expressed genes upregulated in lesional skin compared with those in non-lesional skin when ranked according to log2-fold-change in (a) original dataset, (b) predicted dataset with 60 % observability and (c) sparse known-value (checkpoint) dataset without prediction at 60 % observability
| Original dataset | Recovered dataset (60 %) | Checkpoint dataset (60 %) | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Gene | Probe ID | Symbol FC | log | Adj. | Probe ID | Symbol | log FC | Adj. | Probe ID | Symbol | log FC | Adj. |
| ranking |
|
|
| |||||||||
| 1 | 205863_at | S100A12 | 9.79929 | < 1 | 205863_at | S100A12 | 8.99648 | < 1 | 211906_s_at | SERPINB4 | 6.21118 | 3.3×10−10 |
| 2 | 211906_s_at | SERPINB4 | 9.60376 | < 1 | 211906_s_at | SERPINB4 | 8.67119 | < 1 | 205863_at | S100A12 | 5.48282 | 3.3×10−9 |
| 3 | 205513_at | TCN1 | 8.65788 | < 1 | 205513_at | TCN1 | 8.12271 | < 1 | 205513_at | TCN1 | 5.07988 | 4.8×10−9 |
| 4 | 232220_at | S100A7A | 8.21988 | < 1 | 232220_at | S100A7A | 7.92112 | < 1 | 204385_at | KYNU | 5.06729 | 3.3×10−10 |
| 5 | 205660_at | OASL | 7.94647 | < 1 | 205660_at | OASL | 7.4045 | < 1 | 1569555_at | GDA | 4.75835 | 4.8×10−9 |
| 6 | 220664_at | SPRR2C | 7.87929 | < 1 | 220664_at | SPRR2C | 7.3366 | < 1 | 205844_at | VNN1 | 4.70129 | 3.3×10−10 |
| 7 | 207602_at | TMPRSS11D | 7.64471 | < 1 | 1569555_at | GDA | 7.11896 | < 1 | 209719_at | SERPINB3 | 4.67529 | 1.6×10−4 |
| 8 | 1569555_at | GDA | 7.39506 | < 1 | 207602_at | TMPRSS11D | 7.10503 | < 1 | 234699_at | RNASE7 | 4.57012 | 2.9×10−7 |
Significance is demonstrated by adjusted P-values for fold change in every gene by using eBayes with Benjamini–Hochberg correction
Abbreviations: logFC log2-fold-change, Ave Expr average log2-expression of the probe over all arrays, Adj. P-Value P-value adjusted from the raw P-value
Fig. 3Comparison of differential analysis on original and predicted datasets. Volcano plots represent differentially expressed genes at logFC > 2 and FDR P < 0.05 in original psoriasis vulgaris dataset (leftmost), predicted dataset with 10 % values unknown, with 40 % values unknown and with 70 % values unknown (rightmost)
Comparison of the results of classification obtained using Bayesian networks learnt on low observability predicted datasets with those in which networks were learnt on original datasets
| Study | Dataset | True positive rate | False positive rate | Precision | Recall | F-measure | AUROC |
|---|---|---|---|---|---|---|---|
| Lung adenocarcinoma | Original | 0.944 | 0.057 | 0.944 | 0.944 | 0.944 | 0.988 |
| Low-rank prediction | 0.944 | 0.057 | 0.944 | 0.944 | 0.944 | 0.996 | |
| ( | |||||||
| Sampled Uniform distribution | 0.757 | 0.256 | 0.758 | 0.757 | 0.755 | 0.777 | |
| ( | |||||||
| Myelodysplastic syndrome | Original | 0.865 | 0.866 | 0.844 | 0.865 | 0.854 | 0.673 |
| Low-rank prediction | 0.865 | 0.92 | 0.833 | 0.865 | 0.849 | 0.675 | |
| ( | |||||||
| Sampled Uniform distribution | 0.85 | 0.868 | 0.842 | 0.85 | 0.846 | 0.425 | |
| ( | |||||||
| Pulmonary hypertension | Original | 0.638 | 0.121 | 0.633 | 0.638 | 0.635 | 0.854 |
| Low-rank prediction | 0.681 | 0.118 | 0.645 | 0.681 | 0.659 | 0.897 | |
| ( | |||||||
| Sampled Uniform distribution | 0.267 | 0.372 | 0.213 | 0.267 | 0.218 | 0.424 | |
| ( | |||||||
| Pancreatic ductal | Original | 0.782 | 0.218 | 0.784 | 0.782 | 0.782 | 0.886 |
| adenocarcinoma | Low-rank prediction | 0.821 | 0.179 | 0.821 | 0.821 | 0.82 | 0.905 |
| ( | |||||||
| Sampled Uniform distribution | 0.397 | 0.603 | 0.389 | 0.397 | 0.385 | 0.417 | |
| ( | |||||||
| Psoriasis | Original | 0.912 | 0.088 | 0.913 | 0.912 | 0.912 | 0.96 |
| Low-rank prediction | 0.912 | 0.088 | 0.912 | 0.912 | 0.912 | 0.956 | |
| ( | |||||||
| Sampled Uniform distribution | 0.641 | 0.359 | 0.641 | 0.641 | 0.641 | 0.648 | |
| ( |
Datasets were condensed and constituted of randomly selected 100 gene attributes. Bayesian networks were learned using a bottom-up search method known as K2 algorithm and evaluated in a 10-fold cross validation analysis. The predicted datasets were evaluated by comparing the classification results with those obtained using datasets constructed employing values sampled from a set uniform distribution instead of low-rank recovery, and the fraction of known values were the same in both cases. Notably, the performance of low-rank recovered datasets closely matched with that of the original datasets
Abbreviations: O observability, AUROC Area Under the Receiver Operating Characteristic curve deviation ratio
Top unique differentially expressed genes upregulated in lesional skin compared with those in non-lesional skin when ranked according to log2-fold-change in (a) original dataset, (b) predicted dataset with 30 % observability, and (c) sparse known-value (checkpoint) dataset without prediction at 30 % observability
| Original dataset | Recovered dataset (30 %) | Checkpoint dataset (30 %) | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Gene | Probe ID | Symbol | log FC | Adj. | Probe ID | Symbol | log FC | Adj. | Probe ID | Symbol | log FC | Adj. |
| ranking |
|
|
| |||||||||
| 1 | 205863_at | S100A12 | 9.79929 | < 1 | 205863_at | S100A12 | 8.48947 | < 1 | 207367_at | ATP12A | 3.17871 | 0.02 |
| 2 | 211906_s_at | SERPINB4 | 9.60376 | < 1 | 211906_s_at | SERPINB4 | 7.98211 | < 1 | 201086_x_at | SON | 3.12259 | 0.17 |
| 3 | 205513_at | TCN1 | 8.65788 | < 1 | 220664_at | SPRR2C | 7.17109 | < 1 | 213356_x_at | NA | 3.06212 | 0.29 |
| 4 | 232220_at | S100A7A | 8.21988 | < 1 | 232220_at | S100A7A | 6.77508 | < 1 | 209719_x_at | SERPINB3 | 2.98365 | 0.15 |
| 5 | 205660_at | OASL | 7.94647 | < 1 | 204385_at | KYNU | 6.4279 | < 1 | 33322_i_at | SFN | 2.89353 | 0.36 |
| 6 | 220664_at | SPRR2C | 7.87929 | < 1 | 207602_at | TMPRSS11D | 6.41765 | < 1 | 213523_at | KIAA0368 | 2.88306 | 0.29 |
| 7 | 207602_at | TMPRSS11D | 7.64471 | < 1 | 207367_at | ATP12A | 6.40415 | < 1 | 210413_x_at | CCNE1 | 2.83059 | 0.06 |
| 8 | 1569555_at | GDA | 7.39506 | < 1 | 210413_x_at | NA | 6.39934 | < 1 | 217388_s_at | NA | 2.82118 | 0.19 |
It is to be noted that the analysis performed solely on known expression values (c) gives incorrect conclusions. However, the results of analysis after low-rank prediction matched with those obtained using original dataset
Abbreviations: logFC log2-fold-change, Ave Expr average log2-expression of the probe over all arrays, Adj. P-Value P-value adjusted from the raw P-value