| Literature DB >> 25250049 |
Damien Ulveling1, Marcel E Dinger2, Claire Francastel1, Florent Hubé1.
Abstract
To date, the main criterion by which long ncRNAs (lncRNAs) are discriminated from mRNAs is based on the capacity of the transcripts to encode a protein. However, it becomes important to identify non-ORF-based sequence characteristics that can be used to parse between ncRNAs and mRNAs. In this study, we first established an extremely selective workflow to define a highly refined database of lncRNAs which was used for comparison with mRNAs. Then using this highly selective collection of lncRNAs, we found the CG dinucleotide frequencies were clearly distinct. In addition, we showed that the bias in CG dinucleotide frequency was conserved in human and mouse genomes. We propose that this sequence feature will serve as a useful classifier in transcript classification pipelines. We also suggest that our refined database of "bona fide" lncRNAs will be valuable for the discovery of other sequence characteristics distinct to lncRNAs.Entities:
Keywords: CG dinucleotide; database; exon; intron; mRNA; ncRNA; pseudogene; sequence biais
Year: 2014 PMID: 25250049 PMCID: PMC4158813 DOI: 10.3389/fgene.2014.00316
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
Figure 1Characterization of specific features of the “bona fide” lncRNA database. (A) Frequencies of occurrence of dinucleotides amongst the “bona fide” lncRNAs compared to that in mRNAs and pseudogenic RNAs (pseudoRNA) and compared to published dinucleotide frequencies in intronic and exonic sequences (Bulmer, 1987) (gray text). Frequencies of underrepresented dinucleotides are framed in gray where no difference is observed, or yellow where differences between mRNA, pseudoRNA and lncRNA are observed. (B) The CG dinucleotide signature for mRNAs, pseudoRNAs and lncRNAs is expressed as a% enrichment over the frequency of CG dinucleotide in the whole human genome. Histograms represent mean values ± s.e.m. ***p-value < 0.005 (student's t-test, two-sided). (C) Raw data obtained from CPC (Coding Potential Calculator; http://cpc.cbi.pku.edu.cn) using the three databases (mRNA, pseudoRNA and lncRNA) were plotted according to the number of sequences presenting negative (non-coding prediction) or positive (coding capacity) scores. (D) Using data extracted from EMBOSS CUSP tool (http://emboss.sourceforge.net), which creates a codon usage table from a nucleotide sequence, the number of stop codons per 1000 bases is represented for the three databases and a set of random sequences generated using the Random DNA Sequence Generator software (http://users-birc.au.dk/biopv/php/fabox).
Figure 2Use of the CG dinucleotide frequency to categorize whole genome transcripts. Distribution of transcript signature scores (CG) obtained from ncRNA, mRNA and all grouped transcripts in human and murine sequences. Human and mouse transcripts in were downloaded from NCBI (human.rna.fna and mouse.rna.fna, respectively) and filtered to specifically select lncRNA and mRNA sequences. Briefly, lncRNAs were selected using NR_ as RefSeq accession number filter, and mRNAs were depleted using “partial,” “predicted,” “RIKEN,” “transcript variant” (with a number >1 to only keep the first one) and “NR_” as keywords. Number of RNA sequences used for the distribution plots, including the mean, median, and standard error for each dataset. ***p-value < 0.005 (student's t-test, two-sided).