| Literature DB >> 31358562 |
Fan Lin1, Jue Fan1, Seung Y Rhee2.
Abstract
Linkage mapping is one of the most commonly used methods to identify genetic loci that determine a trait. However, the loci identified by linkage mapping may contain hundreds of candidate genes and require a time-consuming and labor-intensive fine mapping process to find the causal gene controlling the trait. With the availability of a rich assortment of genomic and functional genomic data, it is possible to develop a computational method to facilitate faster identification of causal genes. We developed QTG-Finder, a machine learning based algorithm to prioritize causal genes by ranking genes within a quantitative trait locus (QTL). Two predictive models were trained separately based on known causal genes in Arabidopsis and rice. An independent validation analysis showed that the models could recall about 64% of Arabidopsis and 79% of rice causal genes when the top 20% ranked genes were considered. The top 20% ranked genes can range from 10 to 100 genes, depending on the size of a QTL. The models can prioritize different types of traits though at different efficiency. We also identified several important features of causal genes including paralog copy number, being a transporter, being a transcription factor, and containing SNPs that cause premature stop codon. This work lays the foundation for systematically understanding characteristics of causal genes and establishes a pipeline to predict causal genes based on public data.Entities:
Keywords: Arabidopsis; causal gene; machine learning; quantitative trait loci; rice
Mesh:
Year: 2019 PMID: 31358562 PMCID: PMC6778793 DOI: 10.1534/g3.119.400319
Source DB: PubMed Journal: G3 (Bethesda) ISSN: 2160-1836 Impact factor: 3.154
Figure 1Model training and optimization based on cross-validation. (A) model training and cross-validation framework. We randomly selected negatives from the genome and iterated to maximize the combinations of training and testing data. (B) The ROC curve of Arabidopsis and rice models after parameter optimization. True and false positive rates were based on the average of all iterations. The gray diagonal line indicates the expected performance based on random guessing. The number in parentheses indicates Area Under the ROC Curve (AUC-ROC).
Figure 2Important features of causal genes and their enrichment or depletion relative to the genome background. (A) Feature importance as indicated by the change of AUC-ROC (ΔAUC-ROC) when excluding each feature. The ΔAUC-ROC indicates the average value of all iterations. Error bars indicate standard deviation. The features with a name that starts with “is_” are binary variables. (B) The enrichment or depletion of the top 6 features in Arabidopsis and rice models. The enrichment/depletion were indicated by the ratio of causal genes to genome background. ns, not shown because the feature is not one of the top 6 features in that species.
Figure 3Model performance at different thresholds. (A) Percentage of recalled causal genes of a single QTL at different rank thresholds. Dashed lines indicate the background of random selections. (B-C) The probability of causal gene recall when analyzing multiple QTL simultaneously.
Figure 4Performance comparison across trait categories. (A) Trait categories of known causal genes from the training set. (B) The rank percentile of causal genes of different trait categories. Each causal gene and 200 neighboring genes were used as testing set only once. All other known causal genes were used for training. Each dot indicates a known causal gene. The gray dashed line indicates 20% rank percentile. The trait categories of causal genes are defined in Tables S1 and S2.