| Literature DB >> 31420517 |
Ryan C Sartor1, Jaclyn Noshay2, Nathan M Springer2, Steven P Briggs3.
Abstract
Accurate annotation of plant genomes remains complex due to the presence of many pseudogenes arising from whole-genome duplication-generated redundancy or the capture and movement of gene fragments by transposable elements. Machine learning on genome-wide epigenetic marks, informed by transcriptomic and proteomic training data, could be used to improve annotations through classification of all putative protein-coding genes as either constitutively silent or able to be expressed. Expressed genes were subclassified as able to express both mRNAs and proteins or only RNAs, and CG gene body methylation was associated only with the former subclass. More than 60,000 protein-coding genes have been annotated in the reference genome of maize inbred B73. About two-thirds of these genes are transcribed and are designated the filtered gene set (FGS). Classification of genes by our trained random forest algorithm was accurate and relied only on histone modifications or DNA methylation patterns within the gene body; promoter methylation was unimportant. Other inbred lines are known to transcribe significantly different sets of genes, indicating that the FGS is specific to B73. We accurately classified the sets of transcribed genes in additional inbred lines, arising from inbred-specific DNA methylation patterns. This approach highlights the potential of using chromatin information to improve annotations of functional genes.Entities:
Keywords: epigenomics; genome annotation; machine learning; maize; proteomics
Mesh:
Year: 2019 PMID: 31420517 PMCID: PMC6731682 DOI: 10.1073/pnas.1813645116
Source DB: PubMed Journal: Proc Natl Acad Sci U S A ISSN: 0027-8424 Impact factor: 11.205
Fig. 1.Overview of model features and training set definitions. (A) The various genomic regions where DNA methylation levels were quantified and used as features for classification. Features with gray labels were discarded after initial testing. Each gene was also split into 5 equivalent regions, called bins, and features were quantified separately in each bin. (B) The distribution of detected mRNA abundance is bimodal. The 2 mRNA populations can be roughly separated using an FPKM of 1. Here the nondetected mRNA (No mRNA) is represented as a separate population and given an artificial value of −12. Each population can be further refined into observed vs. nonobserved protein (No Protein) to yield 6 different groups of genes indicated by the different colors. LR_OP refers to all annotated genes that were observed to express low levels of mRNAs and detectable levels of proteins. (C) Three separate random forest models were built. Colored blocks correspond to the gene sets (from B) used for each training class. Blocks on the left indicate the positive (true) training instances vs. blocks on the right that indicate the negative (false) training instances. Numbers in parentheses indicate the number of genes in each training class.
Fig. 2.Results for random forest models. (A) Receiver operating characteristic (ROC) curves showing classification accuracy of the EPC, ERC, and PFI models. (B and C) Binned scatterplot showing prediction accuracy for quantitative abundance models considering only genes with observed expression for mRNA abundance (B) and protein abundance (C). (D–F) Signed feature importance measures for 3 different models. The values reflect the random forest “mean decrease in accuracy” measure of feature importance. The sign is based on the relationship of the feature values to the training class assignments. Positive values indicate a positive correlation between the feature and either protein observation (EPC and PFI) or high mRNA (ERC).
Fig. 3.A new version of the ERC was generated (called ERC-2) using the same training classes defined for ERC but with WGBS data from B73 third leaf tissue that was summarized in 100 bp windows along the genome. This ERC-2 was then used to classify 2 test data sets of similar WGBS data. The first set (A, C, and E) was sampled from the third leaf of 4 diverse maize inbred lines (CML322, Mo17, Oh43, and Tx303). The second set (B, D, and F) was sampled from 3 additional B73 tissues (anther, ear shoot, and shoot apical meristem). In addition, each of these test samples has corresponding transcript profiling via RNA-seq available. (A) Receiver operating characteristic (ROC) curve showing prediction accuracies achieved by ERC-2 model on the B73 training genotype, using cross validation and when the ERC-2 model is tested with new methylation data from different maize inbreds. (B) Receiver operating characteristic (ROC) curve showing prediction accuracies achieved by the ERC-2 model on 3 additional B73 tissues. (C and D) Scatterplots showing the prediction scores between pairwise comparisons of all 4 test inbreds (C) or all 3 test tissues of B73 (D). Each point represents one gene for one test inbred–to–test inbred or test tissue–to–test tissue comparison. Upper left (blue) represent genes that are classified differently in one sample compared to another. (E and F) Scatterplots showing comparison of mRNA abundance in test sample pairs for differentially classified genes (blue dots in C and D). The numbers in the corners represent gene counts in each quadrant (quadrants are defined using cutoffs at log2[FPM] = 0). Quadrant 1 is further split into 2 via a diagonal gray line, with black numbers representing corresponding gene counts.