Chengyu Liu1, Yu-Chen Liu2,3,4, Hsien-Da Huang5,6,7, Wei Wang1. 1. Department of Chemistry and Biochemistry. 2. Institute of Engineering in Medicine, University of California, San Diego, La Jolla, CA, USA. 3. Institute of Bioinformatics and Systems Biology, National Chiao Tung University, HsinChu, Taiwan. 4. Institute of Pharmacology, National Yang-Ming University, Taipei, Taiwan. 5. School of Life and Health Sciences. 6. Warshel Institute for Computational Biology, The Chinese University of Hong Kong, Shenzhen, Longgang District, Shenzhen, Guangdong Province, China. 7. Shenzhen Bay Laboratory, Nanshan District, Shenzhen, Guangdong Province, China.
Abstract
MOTIVATION: In recent years, multiple circular RNAs (circRNA) biogenesis mechanisms have been discovered. Although each reported mechanism has been experimentally verified in different circRNAs, no single biogenesis mechanism has been proposed that can universally explain the biogenesis of all tens of thousands of discovered circRNAs. Under the hypothesis that human circRNAs can be categorized according to different biogenesis mechanisms, we designed a contextual regression model trained to predict the formation of circular RNA from a random genomic locus on human genome, with potential biogenesis factors of circular RNA as the features of the training data. RESULTS: After achieving high prediction accuracy, we found through the feature extraction technique that the examined human circRNAs can be categorized into seven subgroups, according to the presence of the following sequence features: RNA editing sites, simple repeat sequences, self-chains, RNA binding protein binding sites and CpG islands within the flanking regions of the circular RNA back-spliced junction sites. These results support all of the previously reported biogenesis mechanisms of circRNA and solidify the idea that multiple biogenesis mechanisms co-exist for different subset of human circRNAs. Furthermore, we uncover a potential new links between circRNA biogenesis and flanking CpG island. We have also identified RNA binding proteins putatively correlated with circRNA biogenesis. AVAILABILITY AND IMPLEMENTATION: Scripts and tutorial are available at http://wanglab.ucsd.edu/star/circRNA. This program is under GNU General Public License v3.0. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
MOTIVATION: In recent years, multiple circular RNAs (circRNA) biogenesis mechanisms have been discovered. Although each reported mechanism has been experimentally verified in different circRNAs, no single biogenesis mechanism has been proposed that can universally explain the biogenesis of all tens of thousands of discovered circRNAs. Under the hypothesis that human circRNAs can be categorized according to different biogenesis mechanisms, we designed a contextual regression model trained to predict the formation of circular RNA from a random genomic locus on human genome, with potential biogenesis factors of circular RNA as the features of the training data. RESULTS: After achieving high prediction accuracy, we found through the feature extraction technique that the examined human circRNAs can be categorized into seven subgroups, according to the presence of the following sequence features: RNA editing sites, simple repeat sequences, self-chains, RNA binding protein binding sites and CpG islands within the flanking regions of the circular RNA back-spliced junction sites. These results support all of the previously reported biogenesis mechanisms of circRNA and solidify the idea that multiple biogenesis mechanisms co-exist for different subset of human circRNAs. Furthermore, we uncover a potential new links between circRNA biogenesis and flanking CpG island. We have also identified RNA binding proteins putatively correlated with circRNA biogenesis. AVAILABILITY AND IMPLEMENTATION: Scripts and tutorial are available at http://wanglab.ucsd.edu/star/circRNA. This program is under GNU General Public License v3.0. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Circular RNAs (circRNAs) represent an emerging type of regulatory RNA with 3’ and 5’ ends covalently join together as ‘back-spliced junctions’. circ RNAs have been reported to have multiple functions. Beside serving as miRNA sponges, many circRNAs are important for brain function, synaptic plasticity (Westholm ; Rybak-Wolf ; You ) and fetal development (Szabo ). Cell free circRNAs are found stable in saliva (Bahn ) as well as exosomes (Li ), which makes circRNA a promising diagnosis biomarker. Most circRNAs are originated from circularization of coding gene exons, which leads to the hypothesis that circRNA biogenesis competes with pre-mRNA splicing (Ashwal-Fluss ). Literature evidences suggest that the biogenesis of each circRNA subset may likely be regulated by different mechanisms (Chen and Yang, 2015; Conn ; Ivanov ; Jeck ; Li ; Liang and Wilusz, 2014; Zhang , 2014, 2016), which supports the existence of multiple subclasses of circRNAs and each with specific roles.In this study, we aim to decipher the relationship between the sequence features and circRNA subclasses. To this end, we have developed a computational model to predict whether a genomic locus would generate circRNA based on the presence of the following sequence features within the flanking regions of the circular RNA back-spliced junction sites: CpG islands, enhancers, RNA binding protein (RBP) binding sites, simple repeats, RNA editing sites and DNA self-chains. These features were selected based on the hypothesized circRNA biogenesis mechanisms (Conn ; Ivanov ; Jeck ; Li ; Liang and Wilusz, 2014; Zhang , 2014, 2016) as well as sequence features that can potentially participate in biogenesis of non-coding RNAs including CpG islands (Lai and Shiekhattar, 2014) and enhancers regions (Chen ; Lam ). We have designed a contextual regression model (Liu and Wang, 2017) that successfully distinguish the circRNA back-spliced junction sites defined by transcriptome sequencing from randomly selected human genome loci in an averagely 72.6% accuracy. Using the feature extraction technique in the contextual regression model (Khalid ), we found that the examined 21 427 circRNAs can be categorized into 7 groups based on the biogenesis contributing factors. Our analysis supports that multiple biogenesis mechanisms co-exist for different subset of human circRNAs. In particular, we found 79 RBPs were identified to be significantly correlated to circRNA biogenesis. Interestingly, we uncovered a potential new link between circRNA biogenesis and flanking CpG islands, which suggests the potential correlation between DNA methylation and circRNA biogenesis.
2 Materials and methods
The data analysis process of this research is summarized in Supplementary Figure S1. First, 55 689 human circRNAs back-spliced junction sites were collected from the database CircNet (Liu ) as positive training data, and equal amount of randomly selected locus on HG19 human genome as negative training data. These junction sites and randomly selected locus were then divided into training and testing sets (in a ratio of 7:3) for the contextual regression network model designed to predict whether a randomly selected locus from human genome would generate circRNA. The features of the training set include whether CpG islands, enhancer regions, RBP binding sites, simple repeats, A-to-I RNA editing sites (RNA editing sites for short in the later text) and DNA self-chains present in the upstream and downstream region of the selected locus. After reaching optimum average accuracy, through the application of feature extraction techniques (Khalid ), we successfully found that the examined 21 427 circRNAs can be categorized into 7 groups based on the presumed biogenesis contributing factors.The back-spliced junction sites in the positive training set were selected from the database CircNet (Liu ). The data include reported human back-spliced junction sites were collected from 22 recent studies (Alhasan ; Bachmayr-Heyda ; Bahn ; Boeckel ; Cheng ; Conn ; Dang ; Gao ; Guo ; Jeck ; Kelly ; Memczak ; Rybak-Wolf ; Salzman , 2013; Song ; Zhang , 2014, 2016; Zheng ) and 465 human transcriptome sequencing datasets were collected from NCBI Sequence Read Archive (Leinonen ). The back-spliced junction sites in each RNA-seq sample were identified using a circRNA discovery pipeline referred as find_circ (Glažar ; Hansen ; Memczak ). The criteria defined in the pipeline hence the detected junction sites met same standards as those in the previous reports, as described in the Memczak study was applied. Adhering to the suggestion of recent year comparison study (Hansen ), we selected the back-spliced junction sites based on the number of previous peer review reports in which the back-spliced junction sites were reported and the number of samples among the 465 collected samples in which the back-spliced junction sites were found meeting the criteria defined in find_circ (Glažar ; Hansen ; Memczak ). Only the circRNAs with the sum of these two numbers > 3 were considered as positive training data in this study. Locus of the CpG islands, enhancer regions, simple repeats, RNA editing sites and DNA self-chains on HG19 human genome was collected from UCSC Genome Browser (Casper ). Although the locus of the RBP binding sites was collected from the Ray study.
3 Results
Using the feature extraction technique in the contextual regression model (Liu and Wang, 2017), we found that the examined 21 427 circRNAs can be categorized into seven groups based on the biogenesis contributing factors. Our analysis supports that multiple biogenesis mechanisms co-exist for different subset of human circRNAs. In particular, we found 79 RBPs were identified to be significantly correlated to circRNA biogenesis. Interestingly, we uncovered a potential new link between circRNA biogenesis and flanking CpG islands, which suggests the potential correlation between DNA methylation and circRNA biogenesis.
3.1 An interpretable neural network model to predict circRNAs
We implemented a contextual regression model to predict circRNA using these features. Instead of letting the neural network learn a function that maps features to target values, our method let the neural network learn a function that maps features to local linear models that best predict the target value from the features, thus generating a model that can both achieve state-of-the-art accuracy like a deep neural network while giving human interpretable quantification of feature contribution (Fig. 1A). As summarized in Supplementary Figure S1C, in this contextual regression model, we used a residual neural network (He ) that is composed of three layers of FNN (feed forward neural network, Fig. 1B) as the embedding function to generate the linear models. Batch normalization (Ioffe and Szegedy, 2015) is applied in the input layer to reduce the variance between input batches and make the training process more stable. The output of the embedding function is then dot-producted with the features and fed into a logistic function to output the prediction result. The FNN model is implemented as an operation that maps a vector x to s(Ax+b) where A is an matrix, b is an vector and s is the activation function. Both A and b are first initialized and then trained with tensorflow AdamOptimizer (Kingma and Ba, 2014). As for the parameter setting, all three layers of the FNN have 10 hidden units and tanh as their activation function, batch size is set to 50, max gradient norm to 10, learning rate to 0.0001. The weight matrix of each neural network is initialized with the tensorflow tf.truncated_normal function with standard deviation of 0.05 to prevent vanishing gradient problem. The bias term in each layer is initialized all to value 0. We used cross-entropy as the loss function during training. To make the feature contribution easily interpretable, we applied a Lasso penalty in the form of L1 regularization on the context weight with penalty coefficient 0.0001.
Fig. 1.
Illustration of contextual regression. (A) Graphic illustration of the mechanism of contextual regression: the features are inputted into a neural network that generates a contextual weight for each feature which represents the importance of the features. Then, the features are then weighted by the corresponding weights to makes an easier separation of samples. In classification or regression tasks, the weighted features are then summed to yield the prediction. (B) A graphic demonstration of the FNN parts in the contextual regression model
Illustration of contextual regression. (A) Graphic illustration of the mechanism of contextual regression: the features are inputted into a neural network that generates a contextual weight for each feature which represents the importance of the features. Then, the features are then weighted by the corresponding weights to makes an easier separation of samples. In classification or regression tasks, the weighted features are then summed to yield the prediction. (B) A graphic demonstration of the FNN parts in the contextual regression model
3.2 Prediction and the feature selection results
As summarized in Supplementary Table S1, the designed contextual regression model successfully distinguish circRNA back-spliced junction sites defined by transcriptome sequencing data from randomly selected human genome loci, in an average 72.6% accuracy and the area under curve of ROC curve 0.801 (Supplementary Fig. S2).To extract the informative features, we selected the run with the highest accuracy (Run no. 5) from 10 runs. The difference between the accuracy on the training and testing sets were negligible (72.5 versus 72.8%), which suggested no overfitting. Therefore, we pooled together the training and testing sets in the feature analysis. We selected the most confidently predicted 21 427 circRNAs that have confident scores > 0.7 to evaluate the contribution of each feature. The weighted feature contribution (WFC) was then obtained from the model output. The features for each circRNA genesis mechanism (except ‘flanking short ALU repeats’ (both_have_Alu) and ‘binding sites of single RBP’ (if_same_RBP) since they only have 1 value for the whole region) were pooled into short (within 1000 bp radius from the circRNA) and long range (1000–2000 bp from the circRNA) by summing the WFC in each category for better data visualization. This resulted in 14 total feature contributions for each circular RNA data point (CpG short range, CpG long range, enhancer short range, enhancer long range, RBP short range, RBP long range, repeats short range, repeats long range, RNA editing short range, RNA editing long range, self-chain short range, self-chain long range, if head and tail both have Alu, if head and tail have the same RBP) that could be treated as a vector with 14 elements. Then, each of these vectors was normalized to the unit length and PCA was applied to the whole data point collection. The top 10 principal components were extracted which explained 98.9% of the total variance. Then a K-mean clustering was applied to separate the processed data into subclasses. Multiple values of k were experimented and k = 7 was chosen for being the largest k that ensured all cosine similarities between cluster centers are < 0.5. The significantly contributing features were plotted in the heat map format and confirmed in the original feature data.As a result, we found that these circRNAs could be clustered into seven different categories according to their biogenesis factors (Fig. 2A). To validate that the features with high contribution scores are enriched in each cluster, we calculated the percent enrichment of each feature in each cluster (Fig. 2B). The enrichment plot supported our clustering result.
Fig. 2.
Prediction and feature collection result. The result of the feature collection is summarized in this figure. (A) Through the result we found that these circRNAs can be into seven different categories according to their biogenesis factors. The long range features are marked with ‘_l’ and the short range ones are marked with ‘_s’. (B) Percentage of members in each cluster that contains each of the features
Prediction and feature collection result. The result of the feature collection is summarized in this figure. (A) Through the result we found that these circRNAs can be into seven different categories according to their biogenesis factors. The long range features are marked with ‘_l’ and the short range ones are marked with ‘_s’. (B) Percentage of members in each cluster that contains each of the featuresIn Cluster 0, RNA editing sites occurrence within 1000 nucleotide upstream or downstream of the circRNA locus was considered to be the most important factor for the biogenesis of 4576 circRNAs. For the other 4585 circRNAs in Cluster 1, appearance of RNA editing site within range of 1000–2000 nucleotides upstream or downstream of circRNA was considered as the most important biogenesis factor. Similarly, in Cluster 2, existence of RBP binding sites within short, which means within 1000 nucleotides upstream or downstream of the circRNA locus, was considered as the most important biogenesis factor for the 7503 circRNAs. For the 2342 circRNAs in the Cluster 4, occurrence in the long range, which means within range of 1000–2000 nucleotides upstream or downstream of the circRNA locus, of RBP binding sites was suggested as the main biogenesis factor. Short repeat sequence flanking circRNA locus was clustered as the main factor for the 1344 circRNAs in Cluster 3. Biogenesis of 620 circRNAs was linked to flanking CpG islands, whereas 457 circRNAs’ biogenesis mechanism appeared to relate to the flanking DNA self-chains. The amount of these seven clusters is summarized in Supplementary Figure S3, while a complete list of the circRNAs is available in Supplementary Additional File S1. Through examining the input data that forms Clusters 2 and 4, which were associated with RBP, we identified 79 RBPs presumably participate in the biogenesis of these 9845 circRNAs. A complete list of the circRNA locus and associated RBP is available in Supplementary Additional File S2. These 79 RBPs (available in Supplementary Additional File S4) appear both in the short range (within 1000 nucleotides) in Cluster 2 and long range (from 1000 nucleotides to 2000 nucleotides) in Cluster 4 consistently. Among these RBPs, binding motifs of SRSF1, PUM1, SF3B4, ELAVL1, HNRNPA1, CSDA, SNRPA, RBFOX1 and PABPC1 were found appear in upstream or downstream of thousands of circRNAs in both clusters, as summarized in Supplementary Table S2.Hence based on the result of this study, circRNAs can be categorized into multiple different subclasses, each with specific different function and biogenesis mechanism. Result of this research support all of previous discoveries and solidify the idea that multiple biogenesis mechanisms co-exist for different subset of human circRNAs. Result of this study can also contribute to the advance of circRNA detection algorithm. Current mainstream research methods adopted to discover circRNAs are based on detection of back-spliced junction sites spanning reads within transcriptome sequencing data. However, this kind of approaches tends to have high false positive rate. Sensitivity of the junction site detection is also limited by sequencing depth. Had a clear concept of circRNA biogenesis mechanism is available, improved algorithm ruling out sequence base false positive might be able to be developed.Click here for additional data file.
Authors: Andranik Ivanov; Sebastian Memczak; Emanuel Wyler; Francesca Torti; Hagit T Porath; Marta R Orejuela; Michael Piechotta; Erez Y Levanon; Markus Landthaler; Christoph Dieterich; Nikolaus Rajewsky Journal: Cell Rep Date: 2014-12-31 Impact factor: 9.423
Authors: Xiaofeng Song; Naibo Zhang; Ping Han; Byoung-San Moon; Rose K Lai; Kai Wang; Wange Lu Journal: Nucleic Acids Res Date: 2016-02-11 Impact factor: 16.971