| Literature DB >> 30400819 |
Teresa M R Noviello1,2, Antonella Di Liddo3, Giovanna M Ventola4, Antonietta Spagnuolo5, Salvatore D'Aniello5, Michele Ceccarelli1,2, Luigi Cerulo6,7.
Abstract
BACKGROUND: Long non-coding RNAs (lncRNAs) represent a novel class of non-coding RNAs having a crucial role in many biological processes. The identification of long non-coding homologs among different species is essential to investigate such roles in model organisms as homologous genes tend to retain similar molecular and biological functions. Alignment-based metrics are able to effectively capture the conservation of transcribed coding sequences and then the homology of protein coding genes. However, unlike protein coding genes the poor sequence conservation of long non-coding genes makes the identification of their homologs a challenging task.Entities:
Keywords: Homology; Long ncRNA; String similarity
Mesh:
Substances:
Year: 2018 PMID: 30400819 PMCID: PMC6220562 DOI: 10.1186/s12859-018-2441-6
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Definition of the adopted homology metrics (Alignment–based)
| Metric | Definition | Description |
|---|---|---|
| Smith–Waterman similarity |
| The Smith–Waterman similarity |
| Damerau–Levenshtein distance |
| The Damerau–Levenshtein distance |
X and Y are two candidate long non coding genes, seq(X) and seq(Y) are the sets of representative sequences of X and Y respectively (promoter or transcript), len(x) and len(y) are the lengths of sequences x and y respectively. Where applicable a metric is normalized with respect to the sum of sequence length [42] and is minimized (maximized) for distance (similarity) metrics among all couple of transcript sequences x∈seq(X),y∈seq(Y)
Definition of the adopted homology metrics (Alignment–free)
| Metric | Definition | Description |
|---|---|---|
| n-gram distance |
| A |
| Cosine similarity |
| The cosine similarity is the cosine of the angle between the two |
| Jaccard similarity |
| The Jaccard coefficient measures the similarity between two finite sets, and is defined as the size of the intersection divided by the size of the union of the sample sets [ |
| Base–base correlation distance |
| The Base–base correlation measures the sequence similarity by computing the euclidean distance between two 16-dimensional feature vectors, |
| Average common substring distance |
| The average common substring is the average lengths of maximum common substrings for constructing phylogenetic trees [ |
| Lempel–Ziv complexity distance |
| The Lempel–Ziv complexity distance is defined by considering the minimum number of components over all production histories of |
| Jensen–Shannon distance |
| The Jensen–Shannon distance is computed by averaging the Kullback–Leibler Divergence ( |
| Hamming distance |
| The Hamming distance is defined between two strings of the same length as the number of positions in which corresponding values are different. We adopt two bit strings of length |
X and Y are two candidate long non coding genes, seq(X) and seq(Y) are the sets of representative sequences of X and Y respectively (promoter or transcript), len(x) and len(y) are the lengths of sequences x and y respectively. Where applicable a metric is normalized with respect to the sum of sequence length [42] and is minimized (maximized) for distance (similarity) metrics among all couple of transcript sequences x∈seq(X),y∈seq(Y)
Fig. 1P-value barplot for permutation test in Human-Mouse. -log10(p-values) estimated by permutation test over a null distribution of random non–homologous pairs in Human-Mouse on promoter (blue bars) and transcript sequences (red bars) for each considered metric. Homologous lncRNA couples are ranked according to the best prediction computed on promoter sequences among metrics. The x-axis reports true homologous pairs for the two species
Fig. 2P-value barplot for permutation test in Mouse-Zebrafish. -log10(p-values) estimated by permutation test over a null distribution of random non–homologous pairs in Mouse-Zebrafish on promoter (blue bars) and transcript sequences (red bars) for each considered metric. Homologous lncRNA couples are ranked according to the best prediction computed on promoter sequences among metrics. The x-axis reports true homologous pairs for the two species
Fig. 3P-value barplot for permutation test in Human-Zebrafish. -log10(p-values) estimated by permutation test over a null distribution of random non–homologous pairs in Human-Zebrafish on promoter (blue bars) and transcript sequences (red bars) for each considered metric. Homologous lncRNA couples are ranked according to the best prediction computed on promoter sequences among metrics. The x-axis reports true homologous pairs for the two species
Fig. 4NONCODE AUPR plots. Metric prediction performance computed on promoter and transcript sequences for NONCODE lncRNA homologs (AUPR on y-axis and n, the number of consecutive nucleotides in n-gram metrics, on x-axis)
Fig. 5ZFLNC AUPR plots. Metric prediction performance computed on promoter and transcript sequences for ZFLNC lncRNA homologs (AUPR on y-axis and n, the number of consecutive nucleotides in n-gram metrics, on x-axis)
Fig. 6Functional concordance plots. GO Biological Process (BP) terms enrichment of flanking protein coding genes of lncRNAs overlapping the conserved elements in Zebrafish (green bars) and predicted to be homologs according to Jaccard similarity with n=12 (red bars) in Human and Mouse. Blue bars indicate the percentages from the entire transcriptome of the specific specie of the BP terms
Fig. 7Distribution of conserved and non conserved flanking genes
Annotated homologous genes between species in manual curated gold-standard
| Gene class | Gene class | Human | Human | Mouse |
|---|---|---|---|---|
| Specie1 | Specie2 | Mouse | Zebrafish | Zebrafish |
| Antisense | Antisense | 12 | 2 | 1 |
| Antisense | lincRNA | 8 | 2 | 0 |
| lincRNA | Antisense | 1 | 1 | 2 |
| lincRNA | lincRNA | 20 | 2 | 2 |
| Overlapping | Overlapping | 1 | 1 | 1 |
| Total lncRNAs | 42 | 8 | 6 | |
| Protein coding | Protein coding | 12998 | 10209 | 10126 |