| Literature DB >> 31235560 |
Caitlin M A Simopoulos1, Elizabeth A Weretilnyk1, G Brian Golding2.
Abstract
Long non-coding RNAs (lncRNAs) represent a diverse class of regulatory loci with roles in development and stress responses throughout all kingdoms of life. LncRNAs, however, remain under-studied in plants compared to animal systems. To address this deficiency, we applied a machine learning prediction tool, Classifying RNA by Ensemble Machine learning Algorithm (CREMA), to analyze RNAseq data from 11 plant species chosen to represent a wide range of evolutionary histories. Transcript sequences of all expressed and/or annotated loci from plants grown in unstressed (control) conditions were assembled and input into CREMA for comparative analyses. On average, 6.4% of the plant transcripts were identified by CREMA as encoding lncRNAs. Gene annotation associated with the transcripts showed that up to 99% of all predicted lncRNAs for Solanum tuberosum and Amborella trichopoda were missing from their reference annotations whereas the reference annotation for the genetic model plant Arabidopsis thaliana contains 96% of all predicted lncRNAs for this species. Thus a reliance on reference annotations for use in lncRNA research in less well-studied plants can be impeded by the near absence of annotations associated with these regulatory transcripts. Moreover, our work using phylogenetic signal analyses suggests that molecular traits of plant lncRNAs display different evolutionary patterns than all other transcripts in plants and have molecular traits that do not follow a classic evolutionary pattern. Specifically, GC content was the only tested trait of lncRNAs with consistently significant and high phylogenetic signal, contrary to high signal in all tested molecular traits for the other transcripts in our tested plant species.Entities:
Keywords: CREMA; RNASeq; evolution; lncRNA; phylogenetic signal
Mesh:
Substances:
Year: 2019 PMID: 31235560 PMCID: PMC6686929 DOI: 10.1534/g3.119.400201
Source DB: PubMed Journal: G3 (Bethesda) ISSN: 2160-1836 Impact factor: 3.154
Information on the data sources of RNASeq libraries used in this study
| Species | # high quality reads | # mapped reads | BioProject | SRA | Source of RNASeq | Genome Source |
|---|---|---|---|---|---|---|
| 12,469,853 | 11,106,056 | PRJNA311702 | SRR3162008 | |||
| 18,624,814 | 18,277,415 | PRJNA307656 | SRR3095793 | |||
| 49,522,792 | 46,130,371 | PRJNA494564 | SRR7962298 | This manuscript | ||
| 23,490,825 | 23,111,430 | PRJNA186843 | SRR2079778 | |||
| 15,141,539 | 14,481,792 | PRJNA269060 | SRR1688291 | |||
| 23,501,682 | 22,145,297 | PRJNA301554 | SRR2931278 | |||
| 17,913,230 | 17,355,462 | PRJNA212863 | SRR5293262 | |||
| 108,008,790 | 92,873,912 | PRJNA351923 | SRR4762345 | |||
| 10,520,395 | 8,243,406 | PRJNA265205 | SRR1553300 | |||
| 22,002,690 | 21,222,625 | PRJNA264777 | SRR1622084 | |||
| 16,972,867 | 15,594,598 | PRJNA210992 | SRR929426 |
Figure 1Total predicted lncRNAs from 10 plant species. The counts of putative lncRNAs are categorized by transcripts that appear in the reference annotation of each species (purple) and novel transcripts, or those that did not appear in transcriptome annotation (coral). The percentages of novel transcripts (coral) predicted as lncRNAs appear above each bar.
Number of predicted lncRNAs in each species ordered by phylogenetic relationship
| Species | # of assembled transcripts | # predicted lncRNAs | % lncRNAs |
|---|---|---|---|
| 73,656 | 3,783 | 5.1% | |
| 43,936 | 2,721 | 6.2% | |
| 34,862 | 1,040 | 3.0% | |
| 61,480 | 2,918 | 4.8% | |
| 95,713 | 7,225 | 7.6% | |
| 66,562 | 3,753 | 5.6% | |
| 42,118 | 6,972 | 16.6% | |
| 33,266 | 2,269 | 6.8% | |
| 88,649 | 4,648 | 5.2% | |
| 21,467 | 1,383 | 6.4% | |
| 58,531 | 1,796 | 3.0% |
We did not complete phylogenetic analysis on B. hygrometrica due to limited gene annotation availability, and thus it is placed at the end of the table.
Figure 2Mean trait values of transcripts predicted as lncRNAs (coral, circle) and all other assembled transcripts (purple, triangle). Species are ordered as per phylogenetic relationships.
Phylogenetic signal estimates in transcript features
| Feature | lncRNA | All other transcripts | ||||
|---|---|---|---|---|---|---|
| I | K | I | K | |||
| ORF | 0.040 | 0.975 | 0.621 | 0.010 | 0.974 | 1.746 |
| GC% | 0.032 | 1.027 | 1.614 | 0.048 | 1.020 | 1.038 |
| Exons | −0.053 | 0.620 | 0.336 | 0.010 | 0.922 | 1.068 |
| Length | −0.020 | 1.007 | 0.642 | 0.038 | 0.953 | 1.436 |
P < 0.05.
I = Moran’s I, K = Blomberg’s K, = Pagel’s
Where “ORF“ refers to ORF length,“ GC%“ refers to GC content, “Exons“ refers to number of exons and “Length” refers to transcript length.
Figure 3Moran’s I local correlogram of mean trait values in lncRNAs and All Other Transcripts. Coral points indicate significant phylogenetic signal at a particular phylogenetic distance. The horizontal line represents a value of the null hypothesis that no phylogenetic signal is detected. The null hypothesis value is -0.111, or where , or the number of tested species. The 95% confidence intervals, computed using bootstrapping, are also plotted and were used to identify significant values.