Literature DB >> 34723326

GreeNC 2.0: a comprehensive database of plant long non-coding RNAs.

Marco Di Marsico1, Andreu Paytuvi Gallart2, Walter Sanseverino2, Riccardo Aiese Cigliano2.   

Abstract

The Green Non-Coding Database (GreeNC) is one of the reference databases for the study of plant long non-coding RNAs (lncRNAs). Here we present our most recent update where 16 species have been updated, while 78 species have been added, resulting in the annotation of more than 495 000 lncRNAs. Moreover, sequence clustering was applied providing information about sequence conservation and gene families. The current version of the database is available at: http://greenc.sequentiabiotech.com/wiki2/Main_Page.
© The Author(s) 2021. Published by Oxford University Press on behalf of Nucleic Acids Research.

Entities:  

Mesh:

Substances:

Year:  2022        PMID: 34723326      PMCID: PMC8728176          DOI: 10.1093/nar/gkab1014

Source DB:  PubMed          Journal:  Nucleic Acids Res        ISSN: 0305-1048            Impact factor:   16.971


INTRODUCTION

Long non-coding RNAs (lncRNAs) used to be considered as transcriptional noise in the past decades, but lately, this class of molecules has gained increasing attention in epigenetic research and it is now recognized to have an important role in mediating the transmission and the expression of genetic information (1). LncRNAs are RNA molecules longer than 200 nucleotides with no protein-coding ability (2), despite this, they are involved in fundamental biological processes, and their activities are complex and diverse. In fact, lncRNAs could help in the regulation of protein modification but also chromatin remodelling, RNA metabolism, transcription, DNA methylation, and many other processes (1). Due to their activity in a wide number of pathways, lncRNA are very well studied in human and clinical applications (3). In plants, many lncRNAs have been characterized in model organisms as Arabidopsis thaliana, Zea mays, and Solanum lycopersicum. For instance, Arabidopsis lncRNA APOLO (AUXIN-REGULATED PROMOTER LOOP) regulates the expression of un-related distant auxin-responsive genes during the lateral root development by modulating local chromatin conformation (4). Zea mays PILNCR1 is involved in the plant adaptation to phosphate deficiency (5). LncRNA1459, detected in Solanum lycopersicum, has been shown to be involved in the fruit ripening process (6). In order to help the scientific community to study plant lncRNA sequences and functions, we developed and published the GreeNC database in 2015 (7). Since then, the database has been accessed >250 000 times and it has become a reference for the plant scientific community working on lncRNAs. In the last years tens of new plant species have been sequenced and for many of the species shown in GreeNC new or updated reference genomes have been published, for this reason we present GreeNC 2.0 a new update where lncRNAs from new 78 species were added and 16 species were updated. In addition, we performed an extensive sequence clustering in order to detect orthologous groups of lncRNAs both between species and within species. With this additional information researchers will be able to detect whether candidate lncRNAs belong to gene families and if they are conserved across species.

MATERIALS AND METHODS

Genome and annotations

FASTA sequences of transcripts were downloaded from Phytozome v13 and Plants Ensembl version 51. The assembly version of each species is reported in Supplementary Table S1. Only un-restricted genomic data were used (8–85). For Oryza and for Brassica rapa, transcripts were downloaded from Plant Ensembl 51 and Phytozome v13, respectively.

Identification of lncRNAs

As in the previous version of GreeNC (7), two bash scripts were used to identify lncRNAs among the downloaded transcript sequences (Figure 1). With the first script coding potential is calculated maintaining only transcripts with a minimum length of 200 nucleotides and an ORF shorter than 120 amino acids by using Ugene (38.1). Sequences were blasted (v2.9.0) against SwissProt (2021/04). CPC (0.9-r2) was used to assess the protein-coding potential of transcripts. To discriminate other non-coding transcripts from lncRNAs, and to identify possible miRNA precursors, a second script was used. Transcripts were analyzed by cmscan (Infernal 1.1rc4) against the RFAM database (release 14.6). BLASTn (2.6.0) was used against a database of mature plant miRNA sequences from miRBase (release 22.1). The final list of lncRNAs was divided into high- and low confidence. Transcripts without hits in BLASTX described as non-coding by CPC and not considered as miRNA precursors, were classified as high-confidence lncRNAs. Those without hits in BLASTX but considered coding by CPC, those with BLASTX hits considered noncoding by CPC, and those considered miRNA precursors, were marked as low-confidence lncRNAs. To exclude putative transposons, RepeatMasker was used, in order to identify transcripts containing predicted repetitive regions. These transcripts are also classified as low-confidence. RepeatMasker (4.1.0) was executed with a custom library obtained by RepBase (86) with the following parameters -no_is, -gff, -nolow.
Figure 1.

Overview of the in-house developed computational pipeline for lncRNA annotation, which consists of script 1 (A) and script 2 (B).

Overview of the in-house developed computational pipeline for lncRNA annotation, which consists of script 1 (A) and script 2 (B).

Relational database

Data was imported into a MySQL-based relational database stored on an Ubuntu server (Ubuntu 18.04.4 LTS). This database was then integrated into a MediaWiki by mapping relational data fields against predefined templates via Semantic MediaWiki. Transcript sequences in a FASTA file were formatted using makeblastdb. Sequence retrieval is based on blastdbcmd. An Express Node.js API web service was created to expose both sequence retrieval and BLAST searches via client JavaScript from the MediaWiki interface.

OrthoFinder

To evaluate sequence similarity and cluster lncRNAs in orthogrups, an OrthoFinder (87) analysis was executed with the following parameters -d, -f, -S diamond_ultra_sens. As input files, lncRNA sequences from all the species were used.

RESULTS

The previous version of GreeNC (7) included 43 species, resulting in a total of 120 000 annotated lncRNAs. After this update, GreeNC 2.0 includes information on >495 000 transcripts from 94 species between plants and algae (Figure 2). More than 327 000 transcripts were annotated as high confidence lncRNA. With this update, the highest percentages of lncRNAs were annotated in Triticum dicoccoides (7.7%), and Aegilops tauschii (6.9%) and Hordeum vulgare (4.8%), while the lowest in Juglans regia (0.13%), Chara braunii (0.12%) and Cyanidioschyzon merolae (0.02%).
Figure 2.

A snapshot of a Cucumis melo entry from the GreeNC database. (A) Header, to navigate through the website and access to the tools and the pages of the species; (B) table of gene information reporting genomic coordinates, genome version, the source of the genome assembly and if the gene encodes at least one coding transcript; (C) table of transcript features reporting the kind of lncRNA (low-/high-confidence), if it is a precursor of miRNAs, length, orthologous group, sequence and links to get the Open Reading Frame (ORF), the Coding Potential, the folding energy and the GC content; (D) an optional table that provides links to other databases, when applicable, and giving information about the version of the database and the e-value of the match; (E) table of transcripts belonging to the same orthogroup reporting the kind of lncRNA, length, folding energies (AMFE, MFEI), GC content (F) a schematic representation of the gene and transcript models.

A snapshot of a Cucumis melo entry from the GreeNC database. (A) Header, to navigate through the website and access to the tools and the pages of the species; (B) table of gene information reporting genomic coordinates, genome version, the source of the genome assembly and if the gene encodes at least one coding transcript; (C) table of transcript features reporting the kind of lncRNA (low-/high-confidence), if it is a precursor of miRNAs, length, orthologous group, sequence and links to get the Open Reading Frame (ORF), the Coding Potential, the folding energy and the GC content; (D) an optional table that provides links to other databases, when applicable, and giving information about the version of the database and the e-value of the match; (E) table of transcripts belonging to the same orthogroup reporting the kind of lncRNA, length, folding energies (AMFE, MFEI), GC content (F) a schematic representation of the gene and transcript models. Even if it is known that lncRNAs do not show high conservation at nucleotide level (88), we decided to perform a sequence clustering based on the Orthofinder algorithm in order to provide information about highly conserved lncRNAs. About 39% of the 542 656 identified transcripts were assigned to orthogroups. In total, 65 191 orthogroups were identified however, as expected, no orthogroups were present in all the species. Despite this, shared orthogroups were identified between species of the same genus, suggesting the presence of genus-specific lineages of lncRNAs (i.e. Triticum, Arabidopsis thaliana, Oryza, Gossypium, Brassica). Moreover, the presence of species-specific orthogroups highlights that long non-coding transcripts may be organized in gene families. A total of 24 743 orthogroups were identified as species-specific, with a mean of 242 orthogroups per species. The highest number of species-specific orthogroups was recorded in Triticum dicoccoides (3487 orthogroups), while the lowest was detected in Cyanidioschyzon merolae (2). A total of 81 446 transcripts (15% of the total) were classified as species-specific, with a mean of 798 transcripts per species. Also in this case, the highest and lowest values were recorded in Triticum dicoccoides (17 234 transcripts) and Cyanidioschyzon merolae (4), respectively.

DATA AVAILABILITY

The GreeNC database is a MySQL relational database and it is freely accessible at: http://greenc.sequentiabiotech.com/wiki2/Main_Page. The pipeline for lncRNA prediction is available at: https://github.com/sequentiabiotech/GreeNC. Click here for additional data file.
  87 in total

1.  The grapevine genome sequence suggests ancestral hexaploidization in major angiosperm phyla.

Authors:  Olivier Jaillon; Jean-Marc Aury; Benjamin Noel; Alberto Policriti; Christian Clepet; Alberto Casagrande; Nathalie Choisne; Sébastien Aubourg; Nicola Vitulo; Claire Jubin; Alessandro Vezzi; Fabrice Legeai; Philippe Hugueney; Corinne Dasilva; David Horner; Erica Mica; Delphine Jublot; Julie Poulain; Clémence Bruyère; Alain Billault; Béatrice Segurens; Michel Gouyvenoux; Edgardo Ugarte; Federica Cattonaro; Véronique Anthouard; Virginie Vico; Cristian Del Fabbro; Michaël Alaux; Gabriele Di Gaspero; Vincent Dumas; Nicoletta Felice; Sophie Paillard; Irena Juman; Marco Moroldo; Simone Scalabrin; Aurélie Canaguier; Isabelle Le Clainche; Giorgio Malacrida; Eléonore Durand; Graziano Pesole; Valérie Laucou; Philippe Chatelet; Didier Merdinoglu; Massimo Delledonne; Mario Pezzotti; Alain Lecharny; Claude Scarpelli; François Artiguenave; M Enrico Pè; Giorgio Valle; Michele Morgante; Michel Caboche; Anne-Françoise Adam-Blondon; Jean Weissenbach; Francis Quétier; Patrick Wincker
Journal:  Nature       Date:  2007-08-26       Impact factor: 49.962

2.  Genome expansion of Arabis alpina linked with retrotransposition and reduced symmetric DNA methylation.

Authors:  Eva-Maria Willing; Vimal Rawat; Terezie Mandáková; Florian Maumus; Geo Velikkakam James; Karl J V Nordström; Claude Becker; Norman Warthmann; Claudia Chica; Bogna Szarzynska; Matthias Zytnicki; Maria C Albani; Christiane Kiefer; Sara Bergonzi; Loren Castaings; Julieta L Mateos; Markus C Berns; Nora Bujdoso; Thomas Piofczyk; Laura de Lorenzo; Cristina Barrero-Sicilia; Isabel Mateos; Mathieu Piednoël; Jörg Hagmann; Romy Chen-Min-Tao; Raquel Iglesias-Fernández; Stephan C Schuster; Carlos Alonso-Blanco; François Roudier; Pilar Carbonero; Javier Paz-Ares; Seth J Davis; Ales Pecinka; Hadi Quesneville; Vincent Colot; Martin A Lysak; Detlef Weigel; George Coupland; Korbinian Schneeberger
Journal:  Nat Plants       Date:  2015-02-02       Impact factor: 15.793

3.  Genome assembly and annotation of Arabidopsis halleri, a model for heavy metal hyperaccumulation and evolutionary ecology.

Authors:  Roman V Briskine; Timothy Paape; Rie Shimizu-Inatsugi; Tomoaki Nishiyama; Satoru Akama; Jun Sese; Kentaro K Shimizu
Journal:  Mol Ecol Resour       Date:  2016-10-26       Impact factor: 7.090

4.  The tomato genome sequence provides insights into fleshy fruit evolution.

Authors: 
Journal:  Nature       Date:  2012-05-30       Impact factor: 49.962

5.  The TIGR Rice Genome Annotation Resource: improvements and new features.

Authors:  Shu Ouyang; Wei Zhu; John Hamilton; Haining Lin; Matthew Campbell; Kevin Childs; Françoise Thibaud-Nissen; Renae L Malek; Yuandan Lee; Li Zheng; Joshua Orvis; Brian Haas; Jennifer Wortman; C Robin Buell
Journal:  Nucleic Acids Res       Date:  2006-12-01       Impact factor: 16.971

6.  The genomic landscape of molecular responses to natural drought stress in Panicum hallii.

Authors:  John T Lovell; Jerry Jenkins; David B Lowry; Sujan Mamidi; Avinash Sreedasyam; Xiaoyu Weng; Kerrie Barry; Jason Bonnette; Brandon Campitelli; Chris Daum; Sean P Gordon; Billie A Gould; Albina Khasanova; Anna Lipzen; Alice MacQueen; Juan Diego Palacio-Mejía; Christopher Plott; Eugene V Shakirov; Shengqiang Shu; Yuko Yoshinaga; Matt Zane; Dave Kudrna; Jason D Talag; Daniel Rokhsar; Jane Grimwood; Jeremy Schmutz; Thomas E Juenger
Journal:  Nat Commun       Date:  2018-12-06       Impact factor: 14.919

7.  De novo assembly, annotation, and comparative analysis of 26 diverse maize genomes.

Authors:  Matthew B Hufford; Arun S Seetharam; Margaret R Woodhouse; Kapeel M Chougule; Shujun Ou; Jianing Liu; William A Ricci; Tingting Guo; Andrew Olson; Yinjie Qiu; Rafael Della Coletta; Silas Tittes; Asher I Hudson; Alexandre P Marand; Sharon Wei; Zhenyuan Lu; Bo Wang; Marcela K Tello-Ruiz; Rebecca D Piri; Na Wang; Dong Won Kim; Yibing Zeng; Christine H O'Connor; Xianran Li; Amanda M Gilbert; Erin Baggs; Ksenia V Krasileva; John L Portwood; Ethalinda K S Cannon; Carson M Andorf; Nancy Manchanda; Samantha J Snodgrass; David E Hufnagel; Qiuhan Jiang; Sarah Pedersen; Michael L Syring; David A Kudrna; Victor Llaca; Kevin Fengler; Robert J Schmitz; Jeffrey Ross-Ibarra; Jianming Yu; Jonathan I Gent; Candice N Hirsch; Doreen Ware; R Kelly Dawe
Journal:  Science       Date:  2021-08-06       Impact factor: 47.728

8.  Evolution of red algal plastid genomes: ancient architectures, introns, horizontal gene transfer, and taxonomic utility of plastid markers.

Authors:  Jan Janouškovec; Shao-Lun Liu; Patrick T Martone; Wilfrid Carré; Catherine Leblanc; Jonas Collén; Patrick J Keeling
Journal:  PLoS One       Date:  2013-03-25       Impact factor: 3.240

9.  Transposons played a major role in the diversification between the closely related almond and peach genomes: results from the almond genome sequence.

Authors:  Tyler Alioto; Konstantinos G Alexiou; Amélie Bardil; Fabio Barteri; Raúl Castanera; Fernando Cruz; Amit Dhingra; Henri Duval; Ángel Fernández I Martí; Leonor Frias; Beatriz Galán; José L García; Werner Howad; Jèssica Gómez-Garrido; Marta Gut; Irene Julca; Jordi Morata; Pere Puigdomènech; Paolo Ribeca; María J Rubio Cabetas; Anna Vlasova; Michelle Wirthensohn; Jordi Garcia-Mas; Toni Gabaldón; Josep M Casacuberta; Pere Arús
Journal:  Plant J       Date:  2019-10-22       Impact factor: 6.417

10.  A genome resource for green millet Setaria viridis enables discovery of agronomically valuable loci.

Authors:  Sujan Mamidi; Adam Healey; Pu Huang; Jane Grimwood; Jerry Jenkins; Kerrie Barry; Avinash Sreedasyam; Shengqiang Shu; John T Lovell; Maximilian Feldman; Jinxia Wu; Yunqing Yu; Cindy Chen; Jenifer Johnson; Hitoshi Sakakibara; Takatoshi Kiba; Tetsuya Sakurai; Rachel Tavares; Dmitri A Nusinow; Ivan Baxter; Jeremy Schmutz; Thomas P Brutnell; Elizabeth A Kellogg
Journal:  Nat Biotechnol       Date:  2020-10-05       Impact factor: 54.908

View more
  6 in total

Review 1.  Drought tolerance improvement in Solanum lycopersicum: an insight into "OMICS" approaches and genome editing.

Authors:  Sima Taheri; Saikat Gantait; Parisa Azizi; Purabi Mazumdar
Journal:  3 Biotech       Date:  2022-02-08       Impact factor: 2.406

2.  Transcriptome-guided annotation and functional classification of long non-coding RNAs in Arabidopsis thaliana.

Authors:  Jose Antonio Corona-Gomez; Evelia Lorena Coss-Navarrete; Irving Jair Garcia-Lopez; Christopher Klapproth; Jaime Alejandro Pérez-Patiño; Selene L Fernandez-Valverde
Journal:  Sci Rep       Date:  2022-08-18       Impact factor: 4.996

Review 3.  Non-Coding RNAs in Tuberculosis Epidemiology: Platforms and Approaches for Investigating the Genome's Dark Matter.

Authors:  Ahmad Almatroudi
Journal:  Int J Mol Sci       Date:  2022-04-17       Impact factor: 6.208

4.  The 2022 Nucleic Acids Research database issue and the online molecular biology database collection.

Authors:  Daniel J Rigden; Xosé M Fernández
Journal:  Nucleic Acids Res       Date:  2022-01-07       Impact factor: 16.971

Review 5.  Biogenesis, Functions, Interactions, and Resources of Non-Coding RNAs in Plants.

Authors:  Haoyu Chao; Yueming Hu; Liang Zhao; Saige Xin; Qingyang Ni; Peijing Zhang; Ming Chen
Journal:  Int J Mol Sci       Date:  2022-03-28       Impact factor: 5.923

6.  medna-metadata: an open-source data management system for tracking environmental DNA samples and metadata.

Authors:  M Kimble; S Allers; K Campbell; C Chen; L M Jackson; B L King; S Silverbrand; G York; K Beard
Journal:  Bioinformatics       Date:  2022-08-12       Impact factor: 6.931

  6 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.