| Literature DB >> 30404245 |
Xiaoyong Pan1,2,3, Kai Xiong4,5, Christian Anthon6,7,8, Poul Hyttel9,10, Kristine K Freude11,12, Lars Juhl Jensen13,14, Jan Gorodkin15,16,17.
Abstract
Circular RNAs (circRNAs) are increasingly recognized to play crucial roles in post-transcriptional gene regulation including functioning as microRNA (miRNA) sponges or as wide-spread regulators, for example in stem cell differentiation. It is therefore highly relevant to identify if a transcript of interest can also function as a circRNA. Here, we present a user-friendly web server that predicts if coding and noncoding RNAs have circRNA isoforms and whether circRNAs are expressed in stem cells. The predictions are made by random forest models using sequence-derived features as input. The output scores are converted to fractiles, which are used to assess the circRNA and stem cell potential. The performances of the three models are reported as the area under the receiver operating characteristic (ROC) curve and are 0.82 for coding genes, 0.89 for long noncoding RNAs (lncRNAs) and 0.72 for stem cell expression. We present WebCircRNA for quick evaluation of human genes and transcripts for their circRNA potential, which can be essential in several contexts.Entities:
Keywords: Circular RNA; noncoding RNA; random forest
Year: 2018 PMID: 30404245 PMCID: PMC6266491 DOI: 10.3390/genes9110536
Source DB: PubMed Journal: Genes (Basel) ISSN: 2073-4425 Impact factor: 4.096
Figure 1Flowchart of the WebCircRNA framework. BED: browser extensible data; ORF: open reading frame; ALU: transposable element; SNP: single nucleotide polymorphism; CP: circular RNA potential; PCG: protein coding gene; lncRNA: long non-coding RNA; SP: stem cell potential; circRNA: circular RNA; UCSC: University of California, San Diego.
The details of training and independent test sets. The table summarizes which sequences were used as positive and negative examples for the respective random forest (RF) models.
| Model | Positive Data | Negative Data |
|---|---|---|
| circRNA vs. PCG | Total: 14,084 circRNAs | Total: 9533 PCGs not overlapping with circRNAs |
| Training: 10,000 | Training: 8000 | |
| Independent testing: 4084 | Independent testing: 1533 | |
| circRNA vs. lncRNA | Total: 14,084 circRNAs | Total: 19,722 lncRNAs not overlapping with circRNAs |
| Training: 10,000 | Training: 10,000 | |
| Independent testing: 4084 | Independent testing: 9722 | |
| Stem cell vs. not | Total: 2082 circRNAs | Total: 2082 circRNAs |
| Training: 1800 | Training: 1800 | |
| Independent testing: 282 | Independent testing: 282 |
The 178 extracted features divided into four groups.
| Feature Group | Feature Names |
|---|---|
| Basic sequence features | Length; AG, GT, GTAG, AGGT, GC content; 64 trinucleotide frequencies |
| Graph features | Top 101 graph features from GraphProt 1.0.1 |
| Conservation features | Mean, standard deviation of conservation score |
| Other features | ALU, tandem, ORF length, ORF prop, SNP density |
Figure 2The flowchart illustrates how the final fractile score of the input sequences is obtained. Each model predicts a score which is then converted into a fractile. Novel sequences not in the validation sets are scored relative to the fractile in each model and then averaged over all five models.
Figure 3ROC curves for: (A) the PCG circRNA model (CP-PCG); (B) the lncRNA circRNA model (CP-lncRNA); and (C) stem cell circRNA model (SP-circRNA). In these three instances, the ROC curve using all 178 features indicated in Table 2 is compared to models using “only the GC content and sequence length” and “only the sequence features”.
Figure 4ROC curves for the testing mouse data on the CP-PCG and the CP-lncRNA, which are trained on human data using “only the sequence features”, respectively.
Figure 5The Top 10 features for: (A) the CP-PCG model; (B) the CP-lncRNA model; and (C) the SP-circRNA model. Prefix graph refers to the 101 graph features, prefix cons refers to conservation feature and prefix freq refers to codon frequency feature.
Figure 6WebCircRNA. Example output for the lncRNA CDKN2B-AS and the two PCGs DHDDS and OCT4 is shown. We thus ignore the “PCG circRNA” score for CDKN2B-AS and the “lncRNAs circRNA” scores for DHDDS and OCT4. Because “stem cell circRNA” scores only apply to circRNAs, this too should be disregarded for OCT4, since it is not predicted to have a circRNA isoform. The respective “FPR” shows the estimated false positive rate of the corresponding methods. When submitting a BED file, the genomic context of any prediction can view in the UCSC browser via the link in the “Position” column.