| Literature DB >> 26980512 |
Miguel Pignatelli1, Albert J Vilella2, Matthieu Muffato2, Leo Gordon2, Simon White3, Paul Flicek4, Javier Herrero5.
Abstract
Annotation of orthologous and paralogous genes is necessary for many aspects of evolutionary analysis. Methods to infer these homology relationships have traditionally focused on protein-coding genes and evolutionary models used by these methods normally assume the positions in the protein evolve independently. However, as our appreciation for the roles of non-coding RNA genes has increased, consistently annotated sets of orthologous and paralogous ncRNA genes are increasingly needed. At the same time, methods such as PHASE or RAxML have implemented substitution models that consider pairs of sites to enable proper modelling of the loops and other features of RNA secondary structure. Here, we present a comprehensive analysis pipeline for the automatic detection of orthologues and paralogues for ncRNA genes. We focus on gene families represented in Rfam and for which a specific covariance model is provided. For each family ncRNA genes found in all Ensembl species are aligned using Infernal, and several trees are built using different substitution models. In parallel, a genomic alignment that includes the ncRNA genes and their flanking sequence regions is built with PRANK. This alignment is used to create two additional phylogenetic trees using the neighbour-joining (NJ) and maximum-likelihood (ML) methods. The trees arising from both the ncRNA and genomic alignments are merged using TreeBeST, which reconciles them with the species tree in order to identify speciation and duplication events. The final tree is used to infer the orthologues and paralogues following Fitch's definition. We also determine gene gain and loss events for each family using CAFE. All data are accessible through the Ensembl Comparative Genomics ('Compara') API, on our FTP site and are fully integrated in the Ensembl genome browser, where they can be accessed in a user-friendly manner. Database URL: http://www.ensembl.org.Entities:
Mesh:
Substances:
Year: 2016 PMID: 26980512 PMCID: PMC4792531 DOI: 10.1093/database/bav127
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Figure 1.Distribution of Ensembl ncRNA genes in the Rfam database. (A) Distribution of Ensembl ncRNA gene families present in Rfam by family type. (B) Distribution of Ensembl ncRNA genes present in Rfam by family type. (C) Distribution of ncRNA genes by species.
Figure 2.Schematic representation of the main steps in the ncRNA tree analysis pipeline.
Figure 3.Distribution of number of species in the different sub-trees after splitting the super-trees.
Figure 4.Summary of the PRANK alignment for the mir-652 gene family (17 genes) using either PRANK (default internal tree) or MAFFT + RAxML to build the guide tree. For each position in the alignment (x axis), we represent the fraction of gaps in flanking regions (dark green), aligned flanking sequence (light green), gaps in the ncRNA regions (light red) and aligned ncRNA regions (dark red). The figure shows, using MAFFT + RAxML to produce the guide tree, how we obtain an alignment where the ncRNA and the flanking regions are well segregated.
Figure 5.Analysis of tree reconciliation. (A) Intermediate tree support for each branch in the final tree. For each final branch in the final gene trees, the number of times a given intermediate tree supports a branch is calculated and divided by the total times that tree appears. The dark regions of each bar indicate the fraction of times the branch is supported only by that tree. (B) Heatmap representing the overlap between model support. The support for each model in all final branches in the final trees is divided by the union of models supporting them, i.e. when two models support the same final branches, this ratio is 1 and when no overlap is found, this ratio is 0. (C) Venn diagram showing the overlap between branches supported by trees based on secondary structure or genomic sequences. Fast trees are included in the corresponding category.
Figure 6.Simplified species-tree showing the support of all the internal duplications (coloured pie charts) and their numbers (black and white pie charts). ‘Mixed’ signifies that the duplication is supported by multiple kinds of intermediate trees, as opposite to the other labels such as ‘Secondary-structure trees’ which indicate that a duplication has been identified by a single kind of intermediate trees.
Figure 7.Ranking frequency of the different intermediate trees compared with the merged final tree based on their K tree scores.
Figure 8.Analysis of duplication confidence scores in the resulting trees. (A) Distribution of confidence scores for non-species specific duplications determined by the ncRNA analysis pipeline including secondary structure trees, genomic-based trees and fast trees in Ensembl release 82. (B) Improvement of confidence scores for all duplications when genomic based intermediate trees are added to secondary structure-based trees in the merging step. Each data point in the heat map represents the average scores for a family.
. Number of one-to-one, one-to-many, many-to-many determined in the ncRNA pipeline for all the human ncRNAs
| Human - VS | 1-to-1 | 1-to-many | many-to-many |
|---|---|---|---|
| Chimp | 5497 | 288 | 132 |
| Marmoset | 3293 | 821 | 235 |
| Mouse | 881 | 914 | 278 |
| Zebra finch | 202 | 468 | 141 |
| Zebrafish | 133 | 355 | 341 |
. Number of ncRNA pair of orthologs in or near protein orthologs with the same orthology relationship in the selected pairs of species
| Orthologs | % Syntenic protein orthologues (intronic) | % Syntenic protein orthologues (5 kb) |
|---|---|---|
| Human–Chimp | 1870/1948 (96.0%) | 387/430 (90.0%) |
| Human–Marmoset | 956/1256 (76.1%) | 191/313 (61.0%) |
| Human–Mouse | 205/682 (30.1%) | 83/219 (37.9%) |
| Human–Zebra finch | 121/233 (51.9%) | 30/89 (33.7%) |
| Human–Chicken | 175/302 (58.0%) | 46/112 (41.1%) |
| Human–Zebrafish | 114/434 (26.3%) | 15/85 (17.6%) |
Figure 9.Gene family expansions and contractions. The tree on the left shows the species used in the gene family evolution of ncRNA trees. The pie charts show the number of gene families expanded (red) and contracted (blue) in each node of the tree. The size of the pie chart is proportional to the number of families that have expanded or contracted. The table on the right shows the families expanded in the mammal lineage. The numbers indicate the number of genes in each extant species.
Primates-specific microRNAs
| Gene | Copies in Human | Target Genes (miRNAmap) | Description of target genes (miRNAmap) | Target Genes (microRNA) | Description of target genes (microRNA) | Location of miRNA |
|---|---|---|---|---|---|---|
| mir-550 | 3 | PRDM2 | PR domain zinc finger protein 2 | LGI1 | leucine-rich, glioma inactivated 1 (LGI1) | + Inside intron of gene ZNRF2 (zinc and ring finger 2) |
| + Inside intron of AVL9 homolog from S.cerevisiae | ||||||
| FUSIP1 | FUS-interacting serine-arginine-rich protein 1 | SHISA2 | Shisa homolog 2 (Xenopus laevis) | |||
| POU2F1 | POU class 2 homeobox 1 | |||||
| mir-556 | 1 | KIF1B | Kinesin-like protein KIF1B (Klp) | + Inside intron of gene NOS1AP (nitric oxide synthase 1 (neuronal) adaptor protein) | ||
| TARDBP | TAR DNA-binding protein 43 (TDP-43) | |||||
| LYPLA2 | Acyl-protein thioesterase 2 | |||||
| mir-573 | 1 | CPSF3L | Integrator complex subunit 11 | LARP1B | La ribonucleoprotein domain family, member 1B | +Intergenic |
| C10orf118 | Chromosome 10 open reading frame 118 | |||||
| SLC25A26 | Solute carrier family 25, member 26 | |||||
| CCDC62 | coiled-coil domain containing 62 | |||||
| ST6GALNAC3 | ST6 (alpha-N-acetyl-neuraminyl-2,3-beta-galactosyl-1, 3)-N-acetylgalactosaminide alpha-2,6 sialyltransferase 3 | |||||
| mir-580 | 1 | PARK7 | Protein DJ-1 (Oncogene DJ1) (Parkinson disease protein 7) | ZBTB1 | Zinc finger and BTB domain containing 1 | +Inside intron of gene LMBR1D2 (LMBR1 domain containing 2) |
| ALPL | Alkaline phosphatase, tissue-nonspecific isozyme precursor | EPB41L2 | Erythrocyte membrane protein band 4.1-like 2 | |||
| MRPL42 | Mitochondrial ribosomal protein L42 | |||||
| PYROXD1 | Pyridine nucleotide-disulphide oxidoreductase domain 1 | |||||
| mir-581 | 1 | FRAP1 | FKBP12-rapamycin complex-associated protein (FK506-binding protein 12- rapamycin complex-associated protein 1) (Rapamycin target protein) (RAPT1) (Mammalian target of rapamycin) (mTOR) | THAP1 | THAP domain containing, apoptosis associated protein 1 | +Inside intron of gene ARL15 (ADP-ribosylation factor-like 15) |
| RBM6 | RNA binding motif protein 6 | |||||
| MYO10 | myosin X | |||||
| CCDC66 | Coiled-coil domain containing 66 | |||||
| mir-583 | 1 | SPSB1 | SPRY domain-containing SOCS box protein 1 (SSB-1) | CCDC141 | Coiled-coil domain containing 141 | +Intergenic |
| SPEN | Msx2-interacting protein (SPEN homolog) (SMART/HDAC1-associated repressor protein). | ZNF512 | zinc finger protein 512 | |||
| CDC42 | Cell division control protein 42 homolog precursor (G25K GTP-binding protein). | WNK3 | WNK lysine deficient protein kinase 3 | |||
| mir-586 | 1 | ZUBR1 | retinoblastoma-associated factor 600 | FAM164A/ZC2HC1A | zinc finger, C2HC-type containing 1A | +Inside intron of gene SUPT3H (suppressor of Ty 3 homolog (S. cerevisiae)) |
| FAM76A | Protein FAM76A | MMP13 | Matrix metallopeptidase 13 (collagenase 3) | |||
| STX12 | Syntaxin-12 | ITPR1 | Inositol 1,4,5-trisphosphate receptor, type 1 | |||
| mir-597 | 1 | CCNL2 | Cyclin-L2 (Paneth cell-enhanced expression protein) | SMARCE1 | SWI/SNF related, matrix associated, actin dependent regulator of chromatin, subfamily e, member 1 | +Inside intron of gene TNKS1 (tankyrase, TRF1-interacting ankyrin-related ADP-ribose polymerase) |
| UBE4B | Ubiquitin conjugation factor E4 B (Ubiquitin fusion degradation protein 2) (Homozygously deleted in neuroblastoma 1) | |||||
| mir-601 | 1 | PPP1R8 | Nuclear inhibitor of protein phosphatase 1 (NIPP-1) (Protein phosphatase 1 regulatory inhibitor subunit 8) | PMCHL2 | pro-melanin-concentrating hormone-like 2, pseudogene (ncRNA) | +Inside intron of gene DENND1A (DENN/MADD domain containing 1A) |
| ZBTB8 | Zinc finger and BTB domain-containing protein 8 | |||||
| FHL3 | Four and a half LIM domains protein 3 (FHL-3) (Skeletal muscle LIM- protein 2) (SLIM 2) | |||||
| mir-605 | 1 | TARDBP | TAR DNA-binding protein 43 (TDP-43) | MTRR | 5-methyltetrahydrofolate-homocysteine methyltransferase reductase | +Inside intron of gene PRKG1 (protein kinase, cGMP-dependent, type I) |
| RPA2 | Replication protein A 32 kDa subunit (RP-A) (RF-A) (Replication factor-A protein 2) (p32) (p34) | SERTAD4 | SERTA domain containing 4 | |||
| TAF12 | Transcription initiation factor TFIID subunit 12 (Transcription initiation factor TFIID 20/15 kDa subunits) (TAFII-20/TAFII-15) (TAFII20/TAFII15) | LUC7L3 | LUC7-like 3 (S. cerevisiae) | |||
| mir-624 | 1 | PANK4 | Pantothenate kinase 4 | NBEA | Neurobeachin | +Inside intron of gene STRN3 (striatin, calmodulin binding protein 3) |
| TARDBP | TAR DNA-binding protein 43 (TDP-43) | SS18 | Synovial sarcoma translocation, chromosome 18 | |||
| ARID1A | AT-rich interactive domain-containing protein 1A (ARID domain-containing protein 1A) (SWI/SNF-related, matrix-associated, actin-dependent regulator of chromatin subfamily F member 1) (SWI-SNF complex protein p270) (B120) (SWI-like protein) | PPP6R3 | Protein phosphatase 6, regulatory subunit 3 | |||
| mir-640 | 1 | KIF1B | Kinesin-like protein KIF1B (Klp) | ZDHHC17 | Zinc finger, DHHC-type containing 17 | +Inside intron of gene GATAD2A (GATA zinc finger domain containing 2A) |
| VPS13D | Vacuolar protein sorting-associated protein 13D | SLC30A5 | Solute carrier family 30 (zinc transporter), member 5 | |||
| KCTD18 | Potassium channel tetramerisation domain containing 18 | |||||
| mir-648 | 1 | RERE | Arginine-glutamic acid dipeptide repeats protein (Atrophin-1-like protein) (Atrophin-1-related protein) | HBP1 | HMG-box transcription factor 1 | +Intergenic |
| KIF1B | Kinesin-like protein KIF1B (Klp) | MOG | Myelin oligodendrocyte glycoprotein | |||
| mir-651 | 1 | RER1 | Protein RER1 | ITGB1 | integrin, beta 1 (fibronectin receptor, beta polypeptide, antigen CD29 includes MDF2, MSK12) | +Intergenic |
| SPDYE8P | Speedy homolog E8 (Xenopus laevis) | |||||
| BMP2K | BMP2 inducible kinase | |||||
| mir-887 | 1 | P2RY12 | Purinergic receptor P2Y, G-protein coupled, 12 | +Inside intron of gene FBXL7 (F-box and leucine-rich repeat protein 7) |
The target genes predicted by miRNAmap2.0 and microRNA and their description are shown.
Figure 10.Example gene tree displayed in the Ensembl genome browser.