Literature DB >> 26586799

NONCODE 2016: an informative and valuable data source of long non-coding RNAs.

Yi Zhao¹, Hui Li², Shuangsang Fang², Yue Kang³, Wei Wu³, Yajing Hao³, Ziyang Li⁴, Dechao Bu⁴, Ninghui Sun⁴, Michael Q Zhang⁵, Runsheng Chen⁶.

Abstract

NONCODE (http://www.bioinfo.org/noncode/) is an interactive database that aims to present the most complete collection and annotation of non-coding RNAs, especially long non-coding RNAs (lncRNAs). The recently reduced cost of RNA sequencing has produced an explosion of newly identified data. Revolutionary third-generation sequencing methods have also contributed to more accurate annotations. Accumulative experimental data also provides more comprehensive knowledge of lncRNA functions. In this update, NONCODE has added six new species, bringing the total to 16 species altogether. The lncRNAs in NONCODE have increased from 210 831 to 527,336. For human and mouse, the lncRNA numbers are 167,150 and 130,558, respectively. NONCODE 2016 has also introduced three important new features: (i) conservation annotation; (ii) the relationships between lncRNAs and diseases; and (iii) an interface to choose high-quality datasets through predicted scores, literature support and long-read sequencing method support. NONCODE is also accessible through http://www.noncode.org/.

Entities: Chemical Disease Gene Species

Mesh：

Substances：
RNA, Long Noncoding

Year: 2015 PMID： 26586799 PMCID： PMC4702886 DOI： 10.1093/nar/gkv1252

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Recent whole transcriptome studies have revealed that about three quarters of the human genome is capable of being transcribed, while protein-coding regions account for just 2% of the genome (1–3). Therefore, the vast majority of transcribed sequences do not encode proteins, and are called non-coding RNA (ncRNA). Accumulating evidence shows that non-coding RNAs play key roles in various biological processes, such as imprinting control, the circuitry controlling pluripotency and differentiation, immune responses, and chromosome dynamics (4). ncRNAs are as important as protein-coding genes to cellular functions (5,6). Notably, a growing number of long ncRNAs (lncRNAs), which are considered to be >200 nt in length and are often multiexonic (7), have been implicated in disease etiology (8–10). It is therefore of great importance to collect lncRNA information and store this information in a one-stop knowledge gateway for lncRNAs, the NONCODE database. The development of high-throughput sequencing methodologies has reduced the cost of RNA sequencing, and as a result there has been an explosive rise in the number of newly identified lncRNAs. For example, in 2015, Chinnaiyan et al. established a consensus set of 384 066 predicted transcripts from 7256 RNA-seq libraries, which were designated as the MiTranscriptome assembly (11). Since then the revolutionary advancement of sequencing methods, such as single-molecule long-read techniques, leads us closer to the real lncRNA transcriptome. Given sufficient material, amplification-free sequencing of full-length cDNA molecules provides a more direct view of RNA molecules (12). NONCODE has collected data from literature published since the last update and includes the latest versions of several public databases (Ensembl (13), RefSeq(14), lncRNAdb (15) and GENCODE (16)). After the removal of false and redundant lncRNAs, NONCODE contains a total of 527,336 transcripts. In addition to the identification of new lncRNAs, data on the genetics and biochemical properties of lncRNAs has accumulated rapidly. Of the papers retrieved from Pubmed for lncRNAs, we found that the vast majority studied lncRNA function, especially the relationship between lncRNAs and disease (8–10). In large-scale searches for single-base differences between diseased and healthy individuals, about 40% of the disease-related differences show up in genomic regions outside of protein-coding genes. This implicates non-coding regions as vital for genetic risk factors of disease (2). In order to enable a systematic compilation and integration of this information, we added the relationships between lncRNAs and diseases to the annotations of NONCODE. The sources for these annotations were derived from literature mining, differential lncRNA analysis utilizing public RNA-seq data and microarray data and mutation analysis from public genome-wide association study (GWAS) data. Along with the ever increasing number of lncRNAs and the amount and functional study data, genome-wide conservation information is required for biologists to study the mechanisms of lncRNA actions. In order to explore the conservation information of lncRNAs, NONCODE collected six new mammalian species (chimpanzee, gorilla, orangutan, rhesus macaque, opossum and platypus) (17). Conservation annotation is available on the information page of each NONCODE lncRNA gene. Users can browse the conserved counterparts of any human lncRNA gene in other species through a phylogenetic tree layout. This conservation information should greatly increase the convenience of studying lncRNA functions.

DATA COLLECTION AND PROCESSING

Similar to the former iterations of NONCODE (18–21), the source of NONCODE 2016 includes the previous versions of NONCODE, the collated literature and other public databases. We searched PubMed using the key words ‘ncrna’, ‘noncoding’, ‘non-coding’, ‘no code’, ‘non-code’, ‘lncrna’ and ‘lincrna’, and found 6532 new articles since 1 June 2013 (the last collection date for NONCODE). We retrieved the newly identified lncRNAs and their annotations from the supplementary material or web site of these articles. Together with the newest data from Ensembl, RefSeq, lncRNAdb, GENCODE and the old versions of NONCODE data, literature data were processed through a standard pipeline for each species. The pipeline included the following six steps: Format normalization. All of the input data were processed into bed or gtf formats based on one assembly version, for example, hg38 for human and mm10 for mouse. Combination. All of the normalized data files were combined together using the Cuffcompare program in the Cufﬂinks suite (22). After eliminating redundancy, every new transcript ID and the accompanying resources were extracted. Filtering protein-coding RNA. We filtered out protein-coding RNA using two methods. Firstly, the RNA was compared with the coding RNA in RefSeq and Ensembl, and the ‘=’ and ‘c’ transcripts were excluded. Secondly, the RNA was filtered through the Coding-Non-Coding Index (CNCI) (23) program and only the RNAs considered non-coding by CNCI were kept. Information retrieval. We assigned each transcript a name according to the criterion of NONCODE v4 and extracted basic information such as location (24), exons, length, assembly sequence, source, etc. Advanced annotation. Advanced annotations included expression profiles, predicted functions, conservation, disease information, etc. Human expression profiles were collected from 16 tissues of the Human BodyMap 2.0 data (ENA archive: ERP000546) and eight cell lines (GEO accession no. GSE30554), while mouse data was collected from six different tissues (ENA archive: ERP000591). Functions for the lncRNA genes were predicted by lnc-GFP (25), a coding–non-coding co-expression network (26,27) based global function predictor. Web presence. The new NONCODE has provided completely new web pages. More annotation information has been added and a more user-friendly interface has been introduced.

STATISTICS OF NONCODE

NONCODE contains 527,336 lncRNA transcripts from 16 species (human, mouse, cow, rat, chimpanzee, gorilla, orangutan, rhesus macaque, opossum, platypus, chicken, zebrafish, fruitfly, Caenorhabditiselegans, yeast and Arabidopsis,). According to the definition of lncRNA genes (18), NONCODE collected 337,880 genes altogether. A total of 101,700 and 86,935 genes were generated from 167,150 and 130,558 lncRNAs from human and mouse (shown in Table 1), respectively. Following the nomenclature of NONCODE v4 (18), both lncRNA transcripts and genes were designated systematically: NON+ three characters (representing a species) +T (transcript) or G (gene) + six sequential numbers. NONCODE has annotated expression profiles from all the human and mouse transcripts and genes, and a large number of these genes were annotated with predicted functions.

Table 1.

Transcript and gene statistics for NONCODE

Species		Number of lncRNA transcripts	Number of lncRNA genes
human	Homo sapiens	167,150	101,700
mouse	Mus musculus	130,558	86,935
cow	Bos taurus	23 599	18 189
rat	Rattus rattus	29 070	25 114
chimpanzee	Pan troglodytes	18 604	13 224
gorilla	Gorilla gorilla	20 785	17 140
orangutan	Pongo pygmaeus	15 601	13 432
rhesus macaque	Macaca mulatta	9325	6125
opossum	Monodelphis domestica	21 014	14 135
platypus	Ornithorhynchus anatinus	11 518	9394
chicken	Gallus gallus	13 085	9688
zebrafish	Danio rerio	5000	3635
fruitfly	Drosophila melanogaster	54 818	13 890
C. elegans	Caenorhabditis elegans	3269	2746
yeast	Saccharomyces cerevisiae	60	56
Arabidopsis	Arabidopsis thaliana	3853	2477
Total		527,336	337,880

LNCRNAS AND DISEASES

Definitive evidence has proven that transcription of the non-coding genome has produced functional RNAs (1). In particular, lncRNAs have been implicated in biological, developmental, and pathological processes, and acted through mechanisms such as chromatin reprogramming, cis regulation at enhancers, and post-transcriptional regulation of mRNA processing (28). lncRNAs are therefore considered to be important regulators of tissue physiology and disease processes including cancer (11). Although we have collected functional interactions between ncRNAs and biomolecules in NPInter (29–31), we think it is also necessary to include disease information into NONCODE. The data retrieval pipeline is listed in Figure 1. Recent published papers have been explored, and the proven associations between NONCODE transcripts and diseases has been integrated into the latest version. The NONCODE assembly also assessed the overlaps of transcripts with the unique disease-associated Single-Nucleotide Polymorphisms (SNPs) from a catalog of GWASs (32) and the SNP database (dbSNP) (33). There were also a lot of relational data between lncRNAs and diseases which were analyzed from RNA-seq and microarray data. After collecting the basic data, we compared it with the lncRNAs in NONCODE and retained data that overlapped with NONCODE lncRNAs. NONCODE 2016 contains 1110 lncRNAs which were related to 284 diseases. Among these associations, 153, 440, 101 and 429 lncRNAs were collected from ‘literature’, ‘RNA-seq’, ‘microarray’ and ‘GWAS’, respectively.

Figure 1.

Disease related data acquisition pipeline

Disease related data acquisition pipeline In the lncRNA gene description pages, users can retrieve the related diseases of the entry, and also get the source of the information, such as the PMID(s) of the reference paper(s). There is also mutational information retrieved from the literature, GWASs and the dbSNP database.

lNCRNA CONSERVATION

Compared to protein-coding genes and small RNAs (e.g. miRNAs and snoRNAs), several reports have suggested that lncRNAs are modestly conserved (11). Most lncRNAs are less conserved in sequence (34), but there are still many lncRNAs that are conserved in their genomic loci, exonic sequences and promoter regions (35). These are preserved across multiple species, attesting to their important functional potentials (36). Benefiting from next-generation sequencing technologies, ncRNAs are now more easily identiﬁed via transcript sequencing. NONCODE has added six new species, mainly from multi-species RNA-Seq data (37,38). An evolutionary tree from 12 commonly studied species (human, mouse, cow, rat, chicken, zebrafish, chimpanzee, gorilla, orangutan, rhesus macaque, opossum and platypus) was constructed using methods introduced in phyloNONCODE (39). Each human lncRNA gene counterpart from the other listed species can be retrieved through browsing the evolutionary tree (shown in Figure 2). The counterpart of each lncRNA was computed using the UCSC LiftOver tool (40). In brief, LiftOver utilized BLASTZ (41), an independent implementation of the Gapped BLAST algorithm specifically designed for aligning two long genomic sequences, as a core algorithm to detect homologous regions in other genomes. After mapping to the second species, the counterpart region was intersected with the second species transcript. Users can browse the transcript information by clicking the ‘T’ following the counterpart region (shown in Figure 2).

Figure 2.

Conservation annotation for NONHSAG200087.

QUALITY OF LNCRNAS

Short-read sequencing technology allows high-throughput identification of lncRNAs. However, a proper method of in silico transcript reconstruction is an ongoing challenge. According to an assessment by the Paul Bertone group, <40% of known transcripts were well assembled from Homo sapiens RNA-seq data. The complexity of higher eukaryotic genomes imposes severe limitations on transcript recall and splice product discrimination that are likely to remain limiting factors for the analysis of current-generation RNA-seq data (42). Furthermore, multiple amplification steps during library preparation complicate the quantification of expression levels. To some extent, third-generation sequencing technologies reduced the noise. This provides a more comprehensive assessment of the true complexity of the transcriptome. Given sufficient material, amplification-free and fragmentation-free sequencing of full-length cDNA molecules provides a more direct view of RNA molecules (12). NONCODE contacted the authors of the third-generation single-molecule long-read survey of the human transcriptome paper (12). After our analysis, the single-molecule lncRNA transcripts were included into NONCODE. To meet the quality demands of researchers, NONCODE provides a subset searching interface. Users can choose the subset which is considered high quality. The quality controls include the source of the data, literature support, other database support and long-read sequencing method support. The controls also include selection of exon numbers, the lengths of the transcripts and prediction tools support. The web interface will return the subset according to the conditions users chose and allow users to download the data.

DISCUSSION

NONCODE 2016 contains 527,336 lncRNAs from 16 different species, this compares favorably with other lncRNA databases. For example, LNCipedia (human only) contains 111 685 transcripts (43), lncRNAtor (human, mouse, fly, zebrafish, worm and yeast) contains 34 605 transcripts (44), while the lncRNAWiki (human only) contains 105 255 transcripts (45). As mentioned above, technical limitations imposed by short-read sequencing lead to a number of computational challenges in transcript reconstruction and quantification. For many transcripts, automated methods failed to identify all of the constituent exons, and in cases in which all exons were reported, the protocols tested often failed to assemble the exons into complete isoforms (42). Considering this point, NONCODE filtered out some datasets. For example, although we have obtained all the data from MiTranscriptome (11), which contains 384 066 human lncRNAs from 7256 RNA-seq libraries, the detection of precise RefSeq splicing patterns from MiTranscriptome was only 31%, and the fraction of annotated genes within the entire MiTranscriptome was only 46%. Although it is reasonable to assume that unannotated transcription is unique to specific lineages, the low RefSeq detection rate was unusual. We therefore made a decision that NONCODE would not include MiTranscriptome data in the current version. In the future, we will attempt to make clear the real reason(s). Perhaps a more comprehensive construction tool is required to answer this question.

45 in total

Review 1. Molecular mechanisms of long noncoding RNAs.

Authors: Kevin C Wang; Howard Y Chang
Journal: Mol Cell Date: 2011-09-16 Impact factor: 17.970

2. Integrative annotation of human large intergenic noncoding RNAs reveals global properties and specific subclasses.

Authors: Moran N Cabili; Cole Trapnell; Loyal Goff; Magdalena Koziol; Barbara Tazon-Vega; Aviv Regev; John L Rinn
Journal: Genes Dev Date: 2011-09-02 Impact factor: 11.361

Review 3. lincRNAs: genomics, evolution, and mechanisms.

Authors: Igor Ulitsky; David P Bartel
Journal: Cell Date: 2013-07-03 Impact factor: 41.582

Review 4. Expression and function of a large non-coding RNA gene XIST in human cancer.

Authors: Sarah M Weakley; Hao Wang; Qizhi Yao; Changyi Chen
Journal: World J Surg Date: 2011-08 Impact factor: 3.352

5. Human-mouse alignments with BLASTZ.

Authors: Scott Schwartz; W James Kent; Arian Smit; Zheng Zhang; Robert Baertsch; Ross C Hardison; David Haussler; Webb Miller
Journal: Genome Res Date: 2003-01 Impact factor: 9.043

6. Large-scale prediction of long non-coding RNA functions in a coding-non-coding gene co-expression network.

Authors: Qi Liao; Changning Liu; Xiongying Yuan; Shuli Kang; Ruoyu Miao; Hui Xiao; Guoguang Zhao; Haitao Luo; Dechao Bu; Haitao Zhao; Geir Skogerbø; Zhongdao Wu; Yi Zhao
Journal: Nucleic Acids Res Date: 2011-01-18 Impact factor: 16.971

7. ncFANs: a web server for functional annotation of long non-coding RNAs.

Authors: Qi Liao; Hui Xiao; Dechao Bu; Chaoyong Xie; Ruoyu Miao; Haitao Luo; Guoguang Zhao; Kuntao Yu; Haitao Zhao; Geir Skogerbø; Runsheng Chen; Zhongdao Wu; Changning Liu; Yi Zhao
Journal: Nucleic Acids Res Date: 2011-07 Impact factor: 16.971

8. Clustered microRNAs' coordination in regulating protein-protein interaction network.

Authors: Xiongying Yuan; Changning Liu; Pengcheng Yang; Shunmin He; Qi Liao; Shuli Kang; Yi Zhao
Journal: BMC Syst Biol Date: 2009-06-26

9. Long non-coding RNAs function annotation: a global prediction method based on bi-colored networks.

Authors: Xingli Guo; Lin Gao; Qi Liao; Hui Xiao; Xiaoke Ma; Xiaofei Yang; Haitao Luo; Guoguang Zhao; Dechao Bu; Fei Jiao; Qixiang Shao; RunSheng Chen; Yi Zhao
Journal: Nucleic Acids Res Date: 2012-11-05 Impact factor: 16.971

10. Assessment of transcript reconstruction methods for RNA-seq.

Authors: Josep F Abril; Pär G Engström; Felix Kokocinski; Tamara Steijger; Tim J Hubbard; Roderic Guigó; Jennifer Harrow; Paul Bertone
Journal: Nat Methods Date: 2013-11-03 Impact factor: 28.547

251 in total

Review 1. Long Noncoding RNA Discovery in Cardiovascular Disease: Decoding Form to Function.

Authors: Tamer Sallam; Jaspreet Sandhu; Peter Tontonoz
Journal: Circ Res Date: 2018-01-05 Impact factor: 17.367

Review 2. Statistical analysis of non-coding RNA data.

Authors: Qianchuan He; Yang Liu; Wei Sun
Journal: Cancer Lett Date: 2018-01-04 Impact factor: 8.679

3. FARNA: knowledgebase of inferred functions of non-coding RNA transcripts.

Authors: Tanvir Alam; Mahmut Uludag; Magbubah Essack; Adil Salhi; Haitham Ashoor; John B Hanks; Craig Kapfer; Katsuhiko Mineta; Takashi Gojobori; Vladimir B Bajic
Journal: Nucleic Acids Res Date: 2017-03-17 Impact factor: 16.971

4. Widespread Dysregulation of Long Noncoding Genes Associated With Fatty Acid Metabolism, Cell Division, and Immune Response Gene Networks in Xenobiotic-exposed Rat Liver.

Authors: Kritika Karri; David J Waxman
Journal: Toxicol Sci Date: 2020-04-01 Impact factor: 4.849

Review 5. Long non-coding RNA: Functional agent for disease traits.

Authors: Sriyans Jain; Nirav Thakkar; Jagamohan Chhatai; Manika Pal Bhadra; Utpal Bhadra
Journal: RNA Biol Date: 2016-05-26 Impact factor: 4.652

6. The Magnitude of IFN-γ Responses Is Fine-Tuned by DNA Architecture and the Non-coding Transcript of Ifng-as1.

Authors: Franziska Petermann; Aleksandra Pękowska; Catrina A Johnson; Dragana Jankovic; Han-Yu Shih; Kan Jiang; William H Hudson; Stephen R Brooks; Hong-Wei Sun; Alejandro V Villarino; Chen Yao; Kentner Singleton; Rama S Akondy; Yuka Kanno; Alan Sher; Rafael Casellas; Rafi Ahmed; John J O'Shea
Journal: Mol Cell Date: 2019-07-31 Impact factor: 17.970

Review 7. A critical overview of long non-coding RNA in glioma etiology 2016: an update.

Authors: Yuan-Feng Gao; Zhi-Bin Wang; Tao Zhu; Chen-Xue Mao; Xiao-Yuan Mao; Ling Li; Ji-Ye Yin; Hong-Hao Zhou; Zhao-Qian Liu
Journal: Tumour Biol Date: 2016-09-15

8. Genome-Wide Analysis of the FOXA1 Transcriptional Network Identifies Novel Protein-Coding and Long Noncoding RNA Targets in Colorectal Cancer Cells.

Authors: Sarah B Lazar; Lorinc Pongor; Xiao Ling Li; Ioannis Grammatikakis; Bruna R Muys; Emily A Dangelmaier; Christophe E Redon; Sang-Min Jang; Robert L Walker; Wei Tang; Stefan Ambs; Curtis C Harris; Paul S Meltzer; Mirit I Aladjem; Ashish Lal
Journal: Mol Cell Biol Date: 2020-10-13 Impact factor: 4.272

9. MYOSLID Is a Novel Serum Response Factor-Dependent Long Noncoding RNA That Amplifies the Vascular Smooth Muscle Differentiation Program.

Authors: Jinjing Zhao; Wei Zhang; Mingyan Lin; Wen Wu; Pengtao Jiang; Emiley Tou; Min Xue; Angelene Richards; David Jourd'heuil; Arif Asif; Deyou Zheng; Harold A Singer; Joseph M Miano; Xiaochun Long
Journal: Arterioscler Thromb Vasc Biol Date: 2016-07-21 Impact factor: 8.311

Review 10. Noncoding RNAs in neurodegeneration.

Authors: Evgenia Salta; Bart De Strooper
Journal: Nat Rev Neurosci Date: 2017-08-17 Impact factor: 34.870