Literature DB >> 23335781

CPAT: Coding-Potential Assessment Tool using an alignment-free logistic regression model.

Liguo Wang¹, Hyun Jung Park, Surendra Dasari, Shengqin Wang, Jean-Pierre Kocher, Wei Li.

Abstract

Thousands of novel transcripts have been identified using deep transcriptome sequencing. This discovery of large and 'hidden' transcriptome rejuvenates the demand for methods that can rapidly distinguish between coding and noncoding RNA. Here, we present a novel alignment-free method, Coding Potential Assessment Tool (CPAT), which rapidly recognizes coding and noncoding transcripts from a large pool of candidates. To this end, CPAT uses a logistic regression model built with four sequence features: open reading frame size, open reading frame coverage, Fickett TESTCODE statistic and hexamer usage bias. CPAT software outperformed (sensitivity: 0.96, specificity: 0.97) other state-of-the-art alignment-based software such as Coding-Potential Calculator (sensitivity: 0.99, specificity: 0.74) and Phylo Codon Substitution Frequencies (sensitivity: 0.90, specificity: 0.63). In addition to high accuracy, CPAT is approximately four orders of magnitude faster than Coding-Potential Calculator and Phylo Codon Substitution Frequencies, enabling its users to process thousands of transcripts within seconds. The software accepts input sequences in either FASTA- or BED-formatted data files. We also developed a web interface for CPAT that allows users to submit sequences and receive the prediction results almost instantly.

Entities: Chemical Disease Gene Species

Mesh：

Substances：
RNA, Untranslated

Year: 2013 PMID： 23335781 PMCID： PMC3616698 DOI： 10.1093/nar/gkt006

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Although the human genome sequence was released a decade ago, the role of functional noncoding RNAs (ncRNAs) is much less understood compared with their coding counterparts. Several previous studies have demonstrated that the human genome is pervasively transcribed (1–4), but thoroughly cataloging all the RNA species (especially ncRNA) is challenging. Undiscovered ncRNAs might be rare, transient or beyond the detection limits of conventional approaches. Furthermore, ncRNAs also tend to be idiosyncratic to species and tissues (5,6). Nevertheless, advances in RNA-Seq have provided a new method of surveying the whole transcriptome to an unprecedented degree. Recent genome-wide studies revealed tens of thousands of novel transcripts, the majority of which were long noncoding RNAs (lncRNAs, >200 nt) (4–9). Although a few dozen lncRNAs have been characterized to some extent and are reported to have critical roles in diverse cellular and disease development processes (6,10–14), the biogenesis and function of most lncRNAs remain unclear. Accurate and quantitative assessment of coding potential is the first step toward comprehensive annotation of newly discovered transcripts. Until now, prediction of coding potential heavily relied on sequence alignment, either pairwise homology search for protein evidence such as that used in the Coding-Potential Calculator (CPC) and PORTRAIT methods (15,16) or multiple alignments to calculate the phylogenetic conservation score such as that used in the Phylogenetic Codon Substitution Frequencies (PhyloCSF) and RNAcode methods (17,18). Alignment-based approaches are particularly useful for highly conserved protein-coding genes and, to a lesser extent, short genes encoding housekeeping or regulatory RNAs (e.g. snRNAs, snoRNA, transfer RNA). However, these approaches cannot immediately apply to all the novel transcripts because of several intrinsic limitations. First, most newly discovered transcripts are lncRNAs, which tend to be lineage specific and less conserved (5,6). This greatly limits the discriminatory power of alignment-based methods. For example, only 29 of 550 lncRNAs identified from zebrafish had detectable sequence similarity with putative mammalian orthologs (6), and only 993 of 8195 human lncRNAs have orthologous transcripts in other species (5). Second, considerable fractions of lncRNAs are overlapped with either the sense or antisense strand of protein-coding genes. These lncRNAs cannot be correctly classified by homology searching because they would have significant matches to protein-coding genes (3,8,19). Third, the reliability of alignment-based approaches largely depends on the quality of alignments (20). This is problematic because most widely used multiple-sequence alignment tools use heuristics and do not guarantee optimal alignments. Finally, alignment-based methods are extremely time-consuming. For instance, CPC and PhyloCSF took 2 days to evaluate the coding potential of 14 353 lncRNAs identified by Cabili et al. (5). This problem is getting more attention as massive-scale RNA sequencing is increasingly being performed. Consequently, a more accurate, robust and faster method that does not rely on sequence alignment is needed to distinguish ncRNAs, especially lncRNAs, from protein-coding genes. Here, we present Coding-Potential Assessment Tool (CPAT), an alignment-free program, which uses logistic regression to distinguish between coding and noncoding transcripts on the basis of four sequence features. CPAT is highly accurate (0.967) and extremely efficient (10 000 times faster than CPC and PhyloCSF, and 50 times faster than PORTRAIT). CPAT needs only the sequence or coordinate file as input, and it is straightforward to use. We expanded the availability of CPAT to a larger scientific audience via a web interface, which allows users to submit sequences and receive the prediction results back almost instantaneously (http://lilab.research.bcm.edu/cpat/index.php).

MATERIALS AND METHODS

Coding-potential prediction is essentially a binary decision problem, which makes logistic regression a suitable approach. As an alignment-free method, all selected features (predictor variables) were calculated directly from the sequence. The first feature was the maximum length of the open reading frame (ORF). ORF length is one of the most fundamental features used to distinguish ncRNA from messenger RNA because a long putative ORF is unlikely to be observed by random chance in noncoding sequences. Despite the simplicity, ORF length has high concordance with more sophisticated discrimination methods and remains the primary criterion in almost all coding-potential prediction methods (21). The second feature was ORF coverage defined as the ratio of ORF to transcript lengths. This feature also has good classification power, and it is highly complementary to, and independent of, the ORF length (Supplementary Figures S1 and S3). Some large bona fide ncRNAs may contain putative long ORFs by random chance (5), and thus cannot be classified correctly by ORF length alone. Fortunately, those large ncRNAs usually have much lower ORF coverage than protein-coding RNAs (Figure 1B).

Figure 1.

Score distribution between coding (red) and noncoding (blue) transcripts for the four linguistic features selected to build the logistic regression model; training data set containing 10 000 coding and 10 000 noncoding transcripts were used. (A) ORF size. (B) ORF coverage. (C) Fickett score (TESTCODE statistic). (D) Hexamer usage bias measured by log-likelihood ratio. The third feature we used was the Fickett TESTCODE score (termed ‘Fickett score’ hereafter), which is a simple linguistic feature that distinguishes protein-coding RNA and ncRNA according to the combinational effect of nucleotide composition and codon usage bias (22). Briefly, the Fickett score is obtained by computing four position values and four composition values (nucleotide content) from the DNA sequence. The position value reflects the degree to which each base is favored in one codon position versus another. For example, position value of A (A) is calculated as follows: C, G and T are determined in the same way. The percentage composition of each base is also determined. These eight values are then converted into probabilities (p) of coding using a lookup table provided in the original article. Each probability is multiplied by a weight (w) for the respective base, where the value of w reflects the percentage of time each parameter alone successfully predicts coding or noncoding function for the sequences of known function. Finally, the Fickett score is calculated as follows: The Fickett score is independent of the ORF, and when the test region is ≥200 nt in length (which includes most lncRNA), this feature alone can achieve 94% sensitivity and 97% specificity, with ‘no opinion’ on 18% of the sequences (22). The fourth feature we used was hexamer usage bias (termed ‘hexamer score’ hereafter). This may be the most discriminating feature because of the dependence between adjacent amino acids in proteins (23). The hexamer score can be computed in numerous ways; here, we used a log-likelihood ratio to measure differential hexamer usage between coding and noncoding sequences. For a given DNA sequence, we calculated the probability of the sequence under the model of coding DNA and under the model of noncoding DNA, and then we took the logarithm of the ratio of these probabilities as the score of coding potential. We used F (h) (i = 0, 1, … , 4095) and F′ (h) (i = 0, 1, … , 4095) to represent in-frame hexamer frequency, calculated from coding and noncoding training data sets (described below), respectively. For a given hexamer sequence S = H, Hexamer score determines the relative degree of hexamer usage bias in a particular sequence. Positive values indicate a coding sequence, whereas negative values indicate a noncoding sequence. We build a logistic regression model using these four linguistic features as predictor variables. A χ2 test was used to evaluate whether our logit model with predictors fits the training data significantly better than the null model, which had only an intercept. We built a high-confidence training data set to measure the prediction performance of our logit model. This data set contained 10 000 protein-coding transcripts selected from the RefSeq database; all transcripts had high-quality protein sequences annotated by the Consensus Coding Sequence project. We also added 10 000 randomly selected noncoding transcripts from the GenCODE database. We evaluate the model with a 10-fold cross-validation and measured its sensitivity, specificity, accuracy, precision and area under the curve (AUC) characteristics. The receiver operating characteristic (ROC) curve and precision–recall (PR) curve were generated using ROCR package (24). We also built a nonparametric two-graph ROC curve for selecting the optimal CPAT score threshold that maximizes the sensitivity and specificity of CPAT while minimizing misclassifications. We built an independent test data set to compare the performance of CPAT with that of CPC, PhyloCSF and PORTRAIT. This test set composed of 4000 high-quality protein-coding genes (RefSeq annotated) and 4000 lncRNAs from a human lncRNA catalog (5). None of these 8000 genes was included in the training data set for CPAT. Assuming that all 4000 lncRNAs are truly noncoding sequences, we could compute the sensitivity, specificity, accuracy and precision of the algorithms to measure their performance. PhyloCSF could not determine the coding status of 528 (13.2%) noncoding genes. Those 528 genes were equally assigned to the true-negative and false-positive categories. The abbreviations in the equations below are as follows: FN, false negative; FP, false positive; TN, true negative; TP, true positive

RESULTS

All four selected features were concordantly higher in coding transcripts and lower in noncoding transcripts (Figure 1). We plotted three major features (ORF size, Fickett score and hexamer score) in a three-dimensional space to evaluate their combinatorial effect (Figure 2). Coding and noncoding transcripts in our training data set were grouped into two distinct clusters, indicating good concordance between features. The χ2 test P value was <.001 (χ2 = 23 548.44; degrees of freedom = 4), indicating that the logit model as a whole fits significantly better than the null model. Ten-fold cross-validation showed that CPAT could achieve very high accuracy, with an AUC of 0.9927 (Figure 3A). We also provide the PR curve because the ROC curve can be misleading when the test data are largely skewed (Figure 3B). We use nonparametric two-graph ROC curves to determine an optimal CPAT score threshold that maximizes the discriminatory power (Figure 3C and D). According to Figure 3D, a score threshold of 0.364 gave the highest sensitivity and specificity (0.966 for both) for human data.

Figure 2.

Three-dimensional plot shows combinatorial effects of Fickett score, hexamer score and ORF size on 10 000 coding genes (red dots) and 10 000 noncoding genes (blue dots).

Figure 3.

Performance evaluation using 10-fold cross-validation. Dashed curves represent the 10-fold cross-validation; solid curves represent the averaged curve from 10 validation runs. (A) ROC curve. (B) PR curve. PPV = positive predictive value, TPR = true positive rate. (C) Accuracy versus cutoff value. (D) Two-graph ROC curve is used to determine the optimum cutoff value.

Three-dimensional plot shows combinatorial effects of Fickett score, hexamer score and ORF size on 10 000 coding genes (red dots) and 10 000 noncoding genes (blue dots). Performance evaluation using 10-fold cross-validation. Dashed curves represent the 10-fold cross-validation; solid curves represent the averaged curve from 10 validation runs. (A) ROC curve. (B) PR curve. PPV = positive predictive value, TPR = true positive rate. (C) Accuracy versus cutoff value. (D) Two-graph ROC curve is used to determine the optimum cutoff value. We compared the performance of CPAT with that of CPC, PhyloCSF and PORTRAIT (protein-independent support vector machine model) using an independent test data set composed of 4000 coding genes and 4000 noncoding genes. A multiple alignment of 45 vertebrate genomes, including that of human, was downloaded from the UCSC (University of California, Santa Cruz) Genome Browser and was used as the input alignment for PhyloCSF. In general, CPAT (sensitivity: 0.96, specificity: 0.97) had greater classification power compared with all other programs (Figure 4; Supplementary Tables S1 and S2). Although CPC had the highest sensitivity (0.99), it suffered from poor specificity (0.74). One possible explanation is that a significant proportion of ncRNAs has a certain degree of sequence similarity to protein-coding genes. PhyloCSF had the least sensitivity (0.90) and the lowest specificity (0.63). Part of the reason for these outcomes is that nonconserved transcripts cannot be processed by PhyloCSF. If we consider those 528 nonconserved transcripts as noncoding, the specificity increased from 0.63 to 0.69, and the sensitivity remained unchanged. PORTRAIT had relatively balanced sensitivity (0.96) and specificity (0.87). CPAT achieved highest overall accuracy (0.97) when compared with CPC (0.87), PhyloCSF (0.76) and PORTRAIT (0.92). CPAT’s excellent discriminatory power was further demonstrated by the greatest separation between the score distributions of coding and noncoding sequences (Figure 5). Unlike CPC, PhyloCSF and PORTRAIT, choosing a smaller CPAT score threshold to increase the sensitivity will not sacrifice too much specificity.

Figure 4.

Performance comparison between CPAT, CPC, PhyloCSF and PORTRAIT using ROC curves.

Figure 5.

Cumulative curves of coding-potential assessment score for (A) CPAT, (B) PORTRAIT, (C) CPC and (D) PhyloCSF.

Performance comparison between CPAT, CPC, PhyloCSF and PORTRAIT using ROC curves. Cumulative curves of coding-potential assessment score for (A) CPAT, (B) PORTRAIT, (C) CPC and (D) PhyloCSF. One could argue that PhyloCSF underperformed in this study because we used whole transcripts for testing rather than consecutive protein-coding exons and intergenic regions as used in its original article (17). To address this issue, we compiled another single-exon test data set consisting of 184 protein-coding and 278 noncoding transcripts. The test results with this data set indicated that CPAT (sensitivity: 0.962, specificity: 0.842) still outperformed PhyloCSF (sensitivity: 0.832, specificity: 0.588, Supplementary Figure S2). However, when tested on PhyloCSF’s original data set in Lin et al. (25), PhyloCSF (sensitivity 0.91, specificity 0.99) has better performance than CPAT (sensitivity 0.50, specificity 0.98). This is reasonable because lncRNAs in our test data set are poorly conserved, whereas lncRNAs in Lin et al. test data set are highly conserved because they are taken from multiple-sequence alignments of three closely related Drosophila species. Hence, we argue that PhyloCSF works better if the transcripts are highly conserved, which are rare to find in lncRNAs (5,6). This also highlights the Achilles’ heel of the alignment-based methods for detecting lncRNAs. In contrast, the dramatic decrease in CPAT’s sensitivity is due to the lack of ORF information in Lin et al. test data set, which is largely composed of individual exons, and not exon-length complete transcripts. This, however, will not limit the application scope of CPAT because most full-length transcripts can be constructed at the current sequencing depth (8). We measured the computational speed of CPAT, CPC and PhyloCSF on a sample of 200 sequences randomly selected from the test data set. CPAT took 0.67 s to process the data, and it was four orders of magnitudes faster than both CPC [11 945 s (3.3 h)] and PhyloCSF [11 737 s (3.3 h)]. Furthermore, computational time for the PhyloCSF did not include the time spent preparing multiple-alignment files for analysis. PORTRAIT was significantly faster than CPC and PhyloCSF, and therefore all 8000 test genes were used to evaluate its speed: CPAT took 23.83 s to process the test set, and it was 48 times faster than PORTRAIT [1146.30 s (19 min)].

DISCUSSION

A number of linguistic features characterizing coding RNA sequences have been developed over the past 30 years. These include maximum ORF size, dinucleotide usage, codon usage bias, hexamer usage bias, nucleotide composition bias between codon positions and imperfect periodicity in base occurrences (23,26). Among these features, we selected ORF features (size and coverage) because of their discriminatory power and ease of calculation (21). In-frame hexamer score was selected because it has the highest prediction accuracy (average of sensitivity and specificity) as evaluated by Fickett and Tung in 1992 (23). Fickett score was selected because it simultaneously captures the compositional bias and position asymmetry, which are orthogonal to the ORF features. Supplementary Figure S3 shows the performance of these individual features as well as the combined feature set. The combined feature set has very high sensitivity and specificity (>0.966), leaving very little room for further improvement. Annotation of genomes has always been a challenging task for biologists, and these efforts have been accelerated by deep transcriptome sequencing. Distinguishing between protein-coding and noncoding sequences is the first and arguably the most crucial step in genome annotation. Most novel transcripts are less conserved and species-specific ncRNAs. Detecting the coding-potential of these transcripts via alignment-based software is intractable. We developed CPAT, a highly accurate alignment-free method, which uses a logistic regression model to discriminate between coding and noncoding transcripts using pure linguistic features. Compared with other tools, CPAT is more robust, markedly faster and more convenient to use. Taken together, CPAT is able to accurately assess the coding potential of tens of thousands of transcripts in real-time, and will be a valuable tool for the rapidly growing RNA-seq community.

AVAILABILITY AND IMPLEMENTATION

Source code was implemented in C and Python and is freely available at: http://code.google.com/p/cpat/. The web server was implemented in PHP, MYSQL and Apache, with support for all major browsers: http://lilab.research.bcm.edu/cpat/index.php.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online: Supplementary Tables 1 and 2 and Supplementary Figures 1–3.

FUNDING

Department of Defense Prostate Cancer Program [PC094421 to W.L.]; the Cancer Prevention and Research Institute of Texas [RP110471-C3 to W.L.]; the Center for Individualized Medicine (CIM) at Mayo Clinic (to J.P.K.). Funding for open access charge: Cancer Prevention and Research Institute of Texas [RP110471-C3 to W.L.]. Conflict of interest statement. None declared.

26 in total

1. Discrimination of non-protein-coding transcripts from protein-coding mRNA.

Authors: Martin C Frith; Timothy L Bailey; Takeya Kasukawa; Flavio Mignone; Sarah K Kummerfeld; Martin Madera; Sirisha Sunkara; Masaaki Furuno; Carol J Bult; John Quackenbush; Chikatoshi Kai; Jun Kawai; Piero Carninci; Yoshihide Hayashizaki; Graziano Pesole; John S Mattick
Journal: RNA Biol Date: 2006-04-03 Impact factor: 4.652

2. ROCR: visualizing classifier performance in R.

Authors: Tobias Sing; Oliver Sander; Niko Beerenwinkel; Thomas Lengauer
Journal: Bioinformatics Date: 2005-08-11 Impact factor: 6.937

3. RNA maps reveal new RNA classes and a possible function for pervasive transcription.

Authors: Philipp Kapranov; Jill Cheng; Sujit Dike; David A Nix; Radharani Duttagupta; Aarron T Willingham; Peter F Stadler; Jana Hertel; Jörg Hackermüller; Ivo L Hofacker; Ian Bell; Evelyn Cheung; Jorg Drenkow; Erica Dumais; Sandeep Patel; Gregg Helt; Madhavan Ganesh; Srinka Ghosh; Antonio Piccolboni; Victor Sementchenko; Hari Tammana; Thomas R Gingeras
Journal: Science Date: 2007-05-17 Impact factor: 47.728

4. Recognition of protein coding regions in DNA sequences.

Authors: J W Fickett
Journal: Nucleic Acids Res Date: 1982-09-11 Impact factor: 16.971

5. Targeted RNA sequencing reveals the deep complexity of the human transcriptome.

Authors: Tim R Mercer; Daniel J Gerhardt; Marcel E Dinger; Joanna Crawford; Cole Trapnell; Jeffrey A Jeddeloh; John S Mattick; John L Rinn
Journal: Nat Biotechnol Date: 2011-11-13 Impact factor: 54.908

6. Global identification of human transcribed sequences with genome tiling arrays.

Authors: Paul Bertone; Viktor Stolc; Thomas E Royce; Joel S Rozowsky; Alexander E Urban; Xiaowei Zhu; John L Rinn; Waraporn Tongprasit; Manoj Samanta; Sherman Weissman; Mark Gerstein; Michael Snyder
Journal: Science Date: 2004-11-11 Impact factor: 47.728

7. The transcriptional landscape of the mammalian genome.

Authors: P Carninci; T Kasukawa; S Katayama; J Gough; M C Frith; N Maeda; R Oyama; T Ravasi; B Lenhard; C Wells; R Kodzius; K Shimokawa; V B Bajic; S E Brenner; S Batalov; A R R Forrest; M Zavolan; M J Davis; L G Wilming; V Aidinis; J E Allen; A Ambesi-Impiombato; R Apweiler; R N Aturaliya; T L Bailey; M Bansal; L Baxter; K W Beisel; T Bersano; H Bono; A M Chalk; K P Chiu; V Choudhary; A Christoffels; D R Clutterbuck; M L Crowe; E Dalla; B P Dalrymple; B de Bono; G Della Gatta; D di Bernardo; T Down; P Engstrom; M Fagiolini; G Faulkner; C F Fletcher; T Fukushima; M Furuno; S Futaki; M Gariboldi; P Georgii-Hemming; T R Gingeras; T Gojobori; R E Green; S Gustincich; M Harbers; Y Hayashi; T K Hensch; N Hirokawa; D Hill; L Huminiecki; M Iacono; K Ikeo; A Iwama; T Ishikawa; M Jakt; A Kanapin; M Katoh; Y Kawasawa; J Kelso; H Kitamura; H Kitano; G Kollias; S P T Krishnan; A Kruger; S K Kummerfeld; I V Kurochkin; L F Lareau; D Lazarevic; L Lipovich; J Liu; S Liuni; S McWilliam; M Madan Babu; M Madera; L Marchionni; H Matsuda; S Matsuzawa; H Miki; F Mignone; S Miyake; K Morris; S Mottagui-Tabar; N Mulder; N Nakano; H Nakauchi; P Ng; R Nilsson; S Nishiguchi; S Nishikawa; F Nori; O Ohara; Y Okazaki; V Orlando; K C Pang; W J Pavan; G Pavesi; G Pesole; N Petrovsky; S Piazza; J Reed; J F Reid; B Z Ring; M Ringwald; B Rost; Y Ruan; S L Salzberg; A Sandelin; C Schneider; C Schönbach; K Sekiguchi; C A M Semple; S Seno; L Sessa; Y Sheng; Y Shibata; H Shimada; K Shimada; D Silva; B Sinclair; S Sperling; E Stupka; K Sugiura; R Sultana; Y Takenaka; K Taki; K Tammoja; S L Tan; S Tang; M S Taylor; J Tegner; S A Teichmann; H R Ueda; E van Nimwegen; R Verardo; C L Wei; K Yagi; H Yamanishi; E Zabarovsky; S Zhu; A Zimmer; W Hide; C Bult; S M Grimmond; R D Teasdale; E T Liu; V Brusic; J Quackenbush; C Wahlestedt; J S Mattick; D A Hume; C Kai; D Sasaki; Y Tomaru; S Fukuda; M Kanamori-Katayama; M Suzuki; J Aoki; T Arakawa; J Iida; K Imamura; M Itoh; T Kato; H Kawaji; N Kawagashira; T Kawashima; M Kojima; S Kondo; H Konno; K Nakano; N Ninomiya; T Nishio; M Okada; C Plessy; K Shibata; T Shiraki; S Suzuki; M Tagami; K Waki; A Watahiki; Y Okamura-Oho; H Suzuki; J Kawai; Y Hayashizaki
Journal: Science Date: 2005-09-02 Impact factor: 47.728

Review 8. Differentiating protein-coding and noncoding RNA: challenges and ambiguities.

Authors: Marcel E Dinger; Ken C Pang; Tim R Mercer; John S Mattick
Journal: PLoS Comput Biol Date: 2008-11-28 Impact factor: 4.475

9. Performance and scalability of discriminative metrics for comparative gene identification in 12 Drosophila genomes.

Authors: Michael F Lin; Ameya N Deoras; Matthew D Rasmussen; Manolis Kellis
Journal: PLoS Comput Biol Date: 2008-04-18 Impact factor: 4.475

10. CPC: assess the protein-coding potential of transcripts using sequence features and support vector machine.

Authors: Lei Kong; Yong Zhang; Zhi-Qiang Ye; Xiao-Qiao Liu; Shu-Qi Zhao; Liping Wei; Ge Gao
Journal: Nucleic Acids Res Date: 2007-07 Impact factor: 16.971

639 in total

1. Long Non-Coding RNAs (lncRNAs) of Sea Cucumber: Large-Scale Prediction, Expression Profiling, Non-Coding Network Construction, and lncRNA-microRNA-Gene Interaction Analysis of lncRNAs in Apostichopus japonicus and Holothuria glaberrima During LPS Challenge and Radial Organ Complex Regeneration.

Authors: Chuang Mu; Ruijia Wang; Tianqi Li; Yuqiang Li; Meilin Tian; Wenqian Jiao; Xiaoting Huang; Lingling Zhang; Xiaoli Hu; Shi Wang; Zhenmin Bao
Journal: Mar Biotechnol (NY) Date: 2016-07-09 Impact factor: 3.619

2. Novel Transcriptional Activity and Extensive Allelic Imbalance in the Human MHC Region.

Authors: Elizabeth Gensterblum-Miller; Weisheng Wu; Amr H Sawalha
Journal: J Immunol Date: 2018-01-08 Impact factor: 5.422

3. Interrogating translational efficiency and lineage-specific transcriptomes using ribosome affinity purification.

Authors: Pingzhu Zhou; Yijing Zhang; Qing Ma; Fei Gu; Daniel S Day; Aibin He; Bin Zhou; Jing Li; Sean M Stevens; Daniel Romo; William T Pu
Journal: Proc Natl Acad Sci U S A Date: 2013-09-03 Impact factor: 11.205

4. Myogenin promoter-associated lncRNA Myoparr is essential for myogenic differentiation.

Authors: Keisuke Hitachi; Masashi Nakatani; Akihiko Takasaki; Yuya Ouchi; Akiyoshi Uezumi; Hiroshi Ageta; Hidehito Inagaki; Hiroki Kurahashi; Kunihiro Tsuchida
Journal: EMBO Rep Date: 2019-01-08 Impact factor: 8.807

5. An Osteoporosis Risk SNP at 1p36.12 Acts as an Allele-Specific Enhancer to Modulate LINC00339 Expression via Long-Range Loop Formation.

Authors: Xiao-Feng Chen; Dong-Li Zhu; Man Yang; Wei-Xin Hu; Yuan-Yuan Duan; Bing-Jie Lu; Yu Rong; Shan-Shan Dong; Ruo-Han Hao; Jia-Bin Chen; Yi-Xiao Chen; Shi Yao; Hlaing Nwe Thynn; Yan Guo; Tie-Lin Yang
Journal: Am J Hum Genet Date: 2018-04-26 Impact factor: 11.025

6. lncRNA Functional Networks in Oligodendrocytes Reveal Stage-Specific Myelination Control by an lncOL1/Suz12 Complex in the CNS.

Authors: Danyang He; Jincheng Wang; Yulan Lu; Yaqi Deng; Chuntao Zhao; Lingli Xu; Yinhuai Chen; Yueh-Chiang Hu; Wenhao Zhou; Q Richard Lu
Journal: Neuron Date: 2016-12-29 Impact factor: 17.173

7. Risk-Associated Long Noncoding RNA FOXD3-AS1 Inhibits Neuroblastoma Progression by Repressing PARP1-Mediated Activation of CTCF.

Authors: Xiang Zhao; Dan Li; Dandan Huang; Huajie Song; Hong Mei; Erhu Fang; Xiaojing Wang; Feng Yang; Liduan Zheng; Kai Huang; Qiangsong Tong
Journal: Mol Ther Date: 2017-12-22 Impact factor: 11.454

8. Genome-wide screening identifies oncofetal lncRNA Ptn-dt promoting the proliferation of hepatocellular carcinoma cells by regulating the Ptn receptor.

Authors: Jin-Feng Huang; Hong-Yue Jiang; Hui Cai; Yan Liu; Yi-Qing Zhu; Sha-Sha Lin; Ting-Ting Hu; Tian-Tian Wang; Wen-Jun Yang; Bang Xiao; Shu-Han Sun; Li-Ye Ma; Hui-Rong Yin; Fang Wang
Journal: Oncogene Date: 2019-01-14 Impact factor: 9.867

9. Integrative Transcriptome Analyses of Metabolic Responses in Mice Define Pivotal LncRNA Metabolic Regulators.

Authors: Ling Yang; Ping Li; Wenjing Yang; Xiangbo Ruan; Kurtis Kiesewetter; Jun Zhu; Haiming Cao
Journal: Cell Metab Date: 2016-09-22 Impact factor: 27.287

10. Insertion of an Alu element in a lncRNA leads to primate-specific modulation of alternative splicing.

Authors: Shanshan Hu; Xiaolin Wang; Ge Shan
Journal: Nat Struct Mol Biol Date: 2016-10-03 Impact factor: 15.369