| Literature DB >> 28035236 |
Li Liu1, Yung Chang2, Tao Yang3, David P Noren4, Byron Long4, Steven Kornblau5, Amina Qutub4, Jieping Ye6.
Abstract
Despite wide applications of high-throughput biotechnologies in cancer research, many biomarkers discovered by exploring large-scale omics data do not provide satisfactory performance when used to predict cancer treatment outcomes. This problem is partly due to the overlooking of functional implications of molecular markers. Here, we present a novel computational method that uses evolutionary conservation as prior knowledge to discover bona fide biomarkers. Evolutionary selection at the molecular level is nature's test on functional consequences of genetic elements. By prioritizing genes that show significant statistical association and high functional impact, our new method reduces the chances of including spurious markers in the predictive model. When applied to predicting therapeutic responses for patients with acute myeloid leukemia and to predicting metastasis for patients with prostate cancers, the new method gave rise to evolution-informed models that enjoyed low complexity and high accuracy. The identified genetic markers also have significant implications in tumor progression and embrace potential drug targets. Because evolutionary conservation can be estimated as a gene-specific, position-specific, or allele-specific parameter on the nucleotide level and on the protein level, this new method can be extended to apply to miscellaneous "omics" data to accelerate biomarker discoveries.Entities:
Keywords: evolutionary medicine; genomics/proteomics; molecular evolution; transcriptomics
Year: 2016 PMID: 28035236 PMCID: PMC5192825 DOI: 10.1111/eva.12417
Source DB: PubMed Journal: Evol Appl ISSN: 1752-4571 Impact factor: 5.183
Figure 1TimeTree of the 46 species used in computing evolutionary parameters. Branch length is proportional to species divergence times obtained from the TimeTree database (Hedges et al., 2006)
Figure 2Graphical representation of the workflow of evolution‐informed modeling. (A) Input matrix. Each row represents a sample, with positive samples (i.e., with poor clinical outcomes) labeled as “1” and negative samples (i.e., with good clinical outcomes) labeled as “0.” Each column represents a feature, as indicated by different symbols. (B) Feature selection. Subsets of the input data are generated using under‐sampling that randomly chooses equal numbers of positive and negative samples. For each subset, feature values are transformed with composite weights. Feature selection is then applied on the weighted features. Using stability selection and sparse logistic regression, informative features are selected. Open symbols represent un‐weighted features. Solid symbols represent weighted features. (C) Classification model. For each subset, un‐weighted values of selected features are used to build a random forest classifier (a submodel). Collectively, these submodels comprise the ensemble model. (D) Prediction. For an unknown sample, each submodel produces a predicted label. The majority rule is used for the final prediction. The percentage of submodels that predict the sample as the positive class label is used as the confidence score of the final prediction
Figure 3Evolution‐informed modeling to predict treatment outcomes for AML patients. Distributions of evolutionary weights (A) and statistical weights (B). Balanced accuracy (C) and AUROC (D) value of models that uses composite weight, only evolutionary weight, only statistical weight and no weight. (E) Distribution of the number of features in each submodel when composite weight (solid line) or no weight is used (broken line). Number of features is an indicator of the complexity of a model. (F) Number of submodels in which a clinical feature (black bars) or a proteomic feature (gray bars) is included. Plot consists of 85 features that were included in at least one submodel when composite weight is used
Figure 4Evolution‐informed modeling to predict metastasis for prostate cancers. Balanced accuracy (A) and AUROC values (B) for evolution‐informed models (solid lines) and for un‐weighted models (broken lines) that include various numbers of features. Average values with standard errors are plotted. * and ** indicate significant difference with t test p value <.05 or <.01, respectively. (C) Venn diagram of proteins included in the top‐performing evolution‐informed model and in the top‐performing uninformed model. Box plots to compare the distributions of evolutionary rate (D) and statistical significance (E) between all proteins, proteins included in the top‐performing evolution‐informed model, proteins included in the top‐performing uninformed models, and proteins unique to the top‐performing uninformed model. ** indicates significant difference with t test p value <.01