Literature DB >> 23457042

RCPedia: a database of retrocopied genes.

Fábio C P Navarro1, Pedro A F Galante.   

Abstract

MOTIVATION: Retrocopies are copies of mature RNAs that are usually devoid of regulatory sequences and introns. They have routinely been classified as processed pseudo-genes with little or no biological relevance. However, recent findings have revealed functional roles for retrocopies, as well as their high frequency in some organisms, such as primates. Despite their increasing importance, there is no user-friendly and publicly available resource for the study of retrocopies.
RESULTS: Here, we present RCPedia, an integrative and user-friendly database designed for the study of retrocopied genes. RCPedia contains a complete catalogue of the retrocopies that are known to be present in human and five other primate genomes, their genomic context, inter-species conservation and gene expression data. RCPedia also offers a streamlined data representation and an efficient query system.
AVAILABILITY AND IMPLEMENTATION: RCPedia is available at http://www.bioinfo.mochsl.org.br/rcpedia.

Entities:  

Mesh:

Year:  2013        PMID: 23457042      PMCID: PMC3634192          DOI: 10.1093/bioinformatics/btt104

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


1 INTRODUCTION

Retrocopies are gene copies that are generated by reverse transcription and genomic integration of transcribed mRNAs. Although retrocopies have been described since the early 1980s (Vanin, 1985), their functional roles have only recently been revealed (Ciomborowska ; McEntee ; Poliseno ). Retrocopies occur frequently in many genomes, including those of primates (Marques ), and some retrocopies are transcribed and have putative functions [see (Kaessmann ) for a review]. Interestingly, retrocopies have idiosyncrasies that simplify their identification. The four main characteristics are as follows: (i) an original multi-exonic parental gene copy in the genome; (ii) a mono-exonic region, without intronic regions; (iii) a poly-A stretch located in the 3′-most region; or (iv) direct repeats of 8–12 nucleotides (nt) flanking them [see (Kaessmann ) for a review]. These characteristics make retrocopy identification through computational pipelines reasonably straightforward, especially for species for which well-assembled genomes and transcriptomes are available. Despite this, there is still a lack a publicly available and easy-to-use resources dedicated to the study of retrocopies (Kaessmann ), making it necessary either to use manual and multi-step approaches to explore retrocopies or to use non-specialized databases, such as the pseudogene databases (e.g. http://www.pseudogene.org/), that contain only basic and/or restricted information. Here, we describe RCPedia, a publicly available database that was developed for the study of retrocopies. RCPedia contains a myriad of information on retrocopied genes from six primate genomes (human, chimp, gorilla, orangutan, rhesus and marmoset), as well as a streamlined graphical data representation and an efficient information query system.

2 DATA RETRIEVAL AND CURATION

2.1 Data sources

The detection of retrocopies in eukaryotic genomes relies on two fundamental datasets: (i) a reference genome sequence and (ii) a set of known transcripts from each organism. The current version of RCPedia is based on genomic data from the UCSC Genome Browser (http://genome.ucsc.edu): human (hg19), chimpanzee (panTro3), gorilla (gorGor3), orangutan (ponAbe2), rhesus (rheMac2) and marmoset (calJac3). We used RefSeq sequences (http://www.ncbi.nih.gov/RefSeq) as the source of known transcripts, except for gorilla for which there are no RefSeq data. For gorilla, we used Ensembl transcripts (http://www.ensembl.org/). To evaluate retrocopy expression, we re-analysed the publicly available RNA-seq data from six tissues (brain, cerebellum, heart, liver, kidney and testis) of five primates (human, chimp, gorilla, orangutan and rhesus) (Brawand ).

2.2 Identifying orthologous retrocopies

The next step was to determine retrocopy conservation among the six primates. To avoid misidentification, we defined orthologous retroposition events based on conservation of the retrocopy and the flanking genomic regions. All retrocopies and their flanking regions (3 kb up- and downstream, without repetitive sequences) were aligned against the other primate genomes using BLAT [(Kent, 2002) with the following parameters: -mask = lower; tileSize = 12; -minScore = 50; -minIdentity = 0]. Only loci that matched the retrocopy and its flanking regions were considered as orthologous and, therefore, conserved.

2.3 Expression data

To detect retrocopies that were expressed, we developed a stringent multi-step pipeline. First, we searched for chimeric transcripts by analysing all intragenic retrocopies. We used GSNAP (parameters: -t 30; -B 4; –nofails; -A sam; -m 2; -n 1) to align all RNA-seq reads against genomic loci containing intragenic retrocopies (Wu and Nacu, 2010). Then, we selected only the alignments (alignment score >20) that showed two separated blocks (distance between blocks: >42 nt), where one read overlapped the retrocopy and the other aligned with the host gene. Alignments that were not defined by a canonical splicing site (GT-AG) were also filtered out. Intragenic retrocopies that contained at least five reads and showed this alignment pattern were considered to be expressed. Second, we searched for retrocopy expression per se by aligning all the reads against their respective genomes and transcriptomes. The alignment against the transcriptome data was important for removing false positive alignments derived from exon–exon junctions. Only unique genome matches (alignment score: >40) that were filtered by aligning them with the transcriptome data were used for gene expression analysis. At least five supporting reads were required for a retrocopy to be considered as expressed.

3 DATABASE IMPLEMENTATION

RCPedia is a database and a front-end interface. The database was build over MySQL (http://www.mysql.com). The website was developed mainly using PHP (http://www.php.net) based on CakePHP (http://cakephp.org) as the framework for the development of an efficient Model-View-Controller front-end. All genomic annotation and gene expression data were processed using Perl (http://www.perl.org) scripts developed in-house. Briefly, all coding transcripts from RefSeq (and Ensembl for gorilla) were downloaded and aligned against their respective reference genomes using BLAT [(Kent, 2002) with the following parameters: -mask = lower; -tileSize = 12; -minIdentity = 75; -minScore = 100]. All alignments were processed and sequences with >75% identity, and either a sequence alignment length >50% or, at least, 120 matched nucleotides, were selected. Based on the expected genomic characteristics for retrocopies, we designed a four-step strategy to identify them. First, any alignment containing gaps >15 kb in length was eliminated. This step eliminated transcripts with large (large) introns but kept retroelements, such as Long Interspersed Elements (LINEs) (∼6 kb) and Short Interspersed Elements (SINEs) (<1 kb), that are frequently inserted inside retrocopied loci. Second, we retrieved the exon–exon boundary positions from the parental genes. Next, we mapped these boundary positions onto the retrocopies and searched for gaps between them. Putative retrocopy alignments that contained one or more gaps were excluded because they are unlikely to have been derived from retroduplications. Third, only gene copies that contained >50 nt from two or more exons of the parental genes were selected. Finally, we defined the retrocopy set by selecting all remaining alignments and, if necessary, grouping any alignments that were mapped onto the same genomic locus (Supplementary Fig. S1).

4 DATABASE QUERY INTERFACE AND OUTPUT VISUALIZATION

4.1 The query system

The RCPedia query system is easy-to-use, complete and fast. It includes gene (e.g. GAPDH), chromosome (e.g. chr17), genomic position orientation (e.g. chr17:28 102 500–29 112 200), gene alias (e.g. RAS) and gene annotation keyword (e.g. kinase or oncogene) searches, making it easy for the user to explore the genes and genomic locations that match their retrocopy events.

4.2 Results

Because there are many unnamed retrocopies, the search output results in RCPedia are based on parental gene names. The results of a query can be presented from two data visualization perspectives: (i) the parental gene perspective, which helps the user to visualize all retrocopied events of a given parental gene, as well as their genomic loci, and their identity to retrocopies, for example (for the full dataset, see the website) and (ii) the retrocopy perspective, which displays information, such as their genomic context, identity to the parental gene, conservation in other species, and retrocopy expression (see Supplementary Fig. S2 for a schematic view).

5 USING RCPedia

To show how RCPedia can be used, we selected the human gene DHFR as a sample query. RCPedia reported five retrocopies for DHFR in the human genome (Supplementary Fig. S2). Interestingly, one of the retrocopies was present only in the human genome. Another retrocopy was expressed in four human tissues (Supplementary Fig. S2), and it was reported previously that this locus is expressed and has a putative function (McEntee ).

6 CONCLUSION

RCPedia is a well-organized, user-friendly and streamlined graphical representation resource dedicated to the study of retrocopies in primate genomes. To the best of our knowledge, RCPedia is the most comprehensive and publicly available database in this field, although some resources providing similar information (Karro ; Khelifi ; Ortutay and Vihinen, 2008). We strongly believe that RCPedia will significantly improve the annotation and functional characterization of retrocopies present in primate genomes.
  12 in total

1.  BLAT--the BLAST-like alignment tool.

Authors:  W James Kent
Journal:  Genome Res       Date:  2002-04       Impact factor: 9.043

Review 2.  Processed pseudogenes: characteristics and evolution.

Authors:  E F Vanin
Journal:  Annu Rev Genet       Date:  1985       Impact factor: 16.830

3.  The former annotated human pseudogene dihydrofolate reductase-like 1 (DHFRL1) is expressed and functional.

Authors:  Gráinne McEntee; Stefano Minguzzi; Kirsty O'Brien; Nadia Ben Larbi; Christine Loscher; Ciarán O'Fágáin; Anne Parle-McDermott
Journal:  Proc Natl Acad Sci U S A       Date:  2011-08-26       Impact factor: 11.205

4.  Fast and SNP-tolerant detection of complex variants and splicing in short reads.

Authors:  Thomas D Wu; Serban Nacu
Journal:  Bioinformatics       Date:  2010-02-10       Impact factor: 6.937

Review 5.  RNA-based gene duplication: mechanistic and evolutionary insights.

Authors:  Henrik Kaessmann; Nicolas Vinckenbosch; Manyuan Long
Journal:  Nat Rev Genet       Date:  2009-01       Impact factor: 53.242

6.  A coding-independent function of gene and pseudogene mRNAs regulates tumour biology.

Authors:  Laura Poliseno; Leonardo Salmena; Jiangwen Zhang; Brett Carver; William J Haveman; Pier Paolo Pandolfi
Journal:  Nature       Date:  2010-06-24       Impact factor: 49.962

7.  Pseudogene.org: a comprehensive database and comparison platform for pseudogene annotation.

Authors:  John E Karro; Yangpan Yan; Deyou Zheng; Zhaolei Zhang; Nicholas Carriero; Philip Cayting; Paul Harrrison; Mark Gerstein
Journal:  Nucleic Acids Res       Date:  2006-11-11       Impact factor: 16.971

8.  Emergence of young human genes after a burst of retroposition in primates.

Authors:  Ana Claudia Marques; Isabelle Dupanloup; Nicolas Vinckenbosch; Alexandre Reymond; Henrik Kaessmann
Journal:  PLoS Biol       Date:  2005-10-11       Impact factor: 8.029

9.  HOPPSIGEN: a database of human and mouse processed pseudogenes.

Authors:  Adel Khelifi; Khelifi Adel; Laurent Duret; Duret Laurent; Dominique Mouchiroud; Mouchiroud Dominique
Journal:  Nucleic Acids Res       Date:  2005-01-01       Impact factor: 16.971

10.  "Orphan" retrogenes in the human genome.

Authors:  Joanna Ciomborowska; Wojciech Rosikiewicz; Damian Szklarczyk; Wojciech Makałowski; Izabela Makałowska
Journal:  Mol Biol Evol       Date:  2012-10-12       Impact factor: 16.240

View more
  15 in total

1.  Computational Methods for Pseudogene Annotation Based on Sequence Homology.

Authors:  Paul M Harrison
Journal:  Methods Mol Biol       Date:  2021

Review 2.  Overcoming challenges and dogmas to understand the functions of pseudogenes.

Authors:  Seth W Cheetham; Geoffrey J Faulkner; Marcel E Dinger
Journal:  Nat Rev Genet       Date:  2019-12-17       Impact factor: 53.242

3.  SinEx DB: a database for single exon coding sequences in mammalian genomes.

Authors:  Roddy Jorquera; Rodrigo Ortiz; F Ossandon; Juan Pablo Cárdenas; Rene Sepúlveda; Carolina González; David S Holmes
Journal:  Database (Oxford)       Date:  2016-06-07       Impact factor: 3.451

4.  RetrogeneDB-a database of plant and animal retrocopies.

Authors:  Wojciech Rosikiewicz; Michal Kabza; Jan G Kosinski; Joanna Ciomborowska-Basheer; Magdalena R Kubiak; Izabela Makalowska
Journal:  Database (Oxford)       Date:  2017-01-01       Impact factor: 3.451

5.  The Genomic Impact of Gene Retrocopies: What Have We Learned from Comparative Genomics, Population Genomics, and Transcriptomic Analyses?

Authors:  Claudio Casola; Esther Betrán
Journal:  Genome Biol Evol       Date:  2017-06-01       Impact factor: 3.416

Review 6.  Protein-Coding Genes' Retrocopies and Their Functions.

Authors:  Magdalena Regina Kubiak; Izabela Makałowska
Journal:  Viruses       Date:  2017-04-13       Impact factor: 5.048

7.  Repeat associated mechanisms of genome evolution and function revealed by the Mus caroli and Mus pahari genomes.

Authors:  David Thybert; Maša Roller; Fábio C P Navarro; Ian Fiddes; Ian Streeter; Christine Feig; David Martin-Galvez; Mikhail Kolmogorov; Václav Janoušek; Wasiu Akanni; Bronwen Aken; Sarah Aldridge; Varshith Chakrapani; William Chow; Laura Clarke; Carla Cummins; Anthony Doran; Matthew Dunn; Leo Goodstadt; Kerstin Howe; Matthew Howell; Ambre-Aurore Josselin; Robert C Karn; Christina M Laukaitis; Lilue Jingtao; Fergal Martin; Matthieu Muffato; Stefanie Nachtweide; Michael A Quail; Cristina Sisu; Mario Stanke; Klara Stefflova; Cock Van Oosterhout; Frederic Veyrunes; Ben Ward; Fengtang Yang; Golbahar Yazdanifar; Amonida Zadissa; David J Adams; Alvis Brazma; Mark Gerstein; Benedict Paten; Son Pham; Thomas M Keane; Duncan T Odom; Paul Flicek
Journal:  Genome Res       Date:  2018-03-21       Impact factor: 9.043

8.  Genome-wide analysis of European sea bass provides insights into the evolution and functions of single-exon genes.

Authors:  Mbaye Tine; Heiner Kuhl; Peter R Teske; Richard Reinhardt
Journal:  Ecol Evol       Date:  2021-04-02       Impact factor: 2.912

9.  A Genome-Wide Landscape of Retrocopies in Primate Genomes.

Authors:  Fábio C P Navarro; Pedro A F Galante
Journal:  Genome Biol Evol       Date:  2015-07-29       Impact factor: 3.416

10.  RetrogeneDB--a database of animal retrogenes.

Authors:  Michał Kabza; Joanna Ciomborowska; Izabela Makałowska
Journal:  Mol Biol Evol       Date:  2014-04-16       Impact factor: 16.240

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.