Literature DB >> 25628646

A gene mapping bottleneck in the translational route from zebrafish to human.

Niek de Klein¹, Mark Ibberson², Isaac Crespo², Sophie Rodius³, Francisco Azuaje³.

Abstract

Among a diversity of animal models of disease, the zebrafish is a promising model organism for enabling novel translational biomedical research. To fully achieve the latter, a key requirement is to match molecular readouts measured in zebrafish with information relevant to health and disease in humans. A fundamental step in this direction is to accurately map gene sequences from zebrafish to humans. Despite significant progress in genome annotation, this remains an intricate and time-consuming challenge. Here we discuss major obstacles that we had to overcome to systematically map genes from zebrafish to human. We identified important disparities, as well as partial agreements, between five public zebrafish-to-human homology resources. There is still a need for standardized, comprehensive genomic mappings between zebrafish and humans. Without this, efforts to use zebrafish as a powerful translational research tool will be stalled.

Entities: Chemical Disease Species

Keywords: genome annotation; orthology inference; translational research; zebrafish; zebrafish-to-human gene mapping

Year: 2015 PMID： 25628646 PMCID： PMC4290677 DOI： 10.3389/fgene.2014.00470

Source DB: PubMed Journal: Front Genet ISSN： 1664-8021 Impact factor: 4.599

LAYING OUT THE ROAD

The zebrafish (Danio rerio) is a powerful and relatively low-cost tool for fundamental and translational biomedical research. It offers in vivo models with potential clinical relevance, which are valuable to elucidate disease mechanisms, novel therapeutic targets, and candidate therapeutics (Mione and Trede, 2010; Gemberling et al., 2013; White et al., 2013; Kikuchi, 2014). Substantial efforts to enable zebrafish research have been reflected in ongoing initiatives in the USA, Europe, and elsewhere[1]. Examples of recent significant outcomes include The Zebrafish Model Organism Database (ZFIN[2]; Bradford et al., 2011) and the publication of the zebrafish reference genome sequence (Howe et al., 2013). The latter estimated that about 70% of human genes have at least one unambiguous zebrafish ortholog (Howe et al., 2013). Notwithstanding the quality and applicability of these advances, key challenges remain to translate findings from zebrafish to humans on the basis of genome-wide sequence mapping. In this translational route, we are facing a heavy bottleneck early.

THE NEED FOR ACCURATE AND COMPREHENSIVE GENE MAPPING

A crucial task for investigating translational research applications is the linking of molecular readouts measured in zebrafish to information relevant to human health and disease. To accomplish this, a key requirement is the matching of gene sequences from zebrafish to humans at a genome-wide scale. This step goes beyond the automated conversion of gene symbols, and often involves the association between multiple homologous sequences that are included in gene expression microarrays or RNA sequencing experiments. A realistic scenario begins, for example, with the identification of a set of genes that are differentially expressed between pathological and control states, e.g., disease vs. healthy phenotypes. The resulting gene list may be mapped to homologous genes in humans using HomoloGene[3] (Wheeler et al., 2001; Acland et al., 2014) and ZFIN (Bradford et al., 2011). Additionally, researchers may require to map sequences from specific microarray platforms, for example: from Affymetrix’s GeneChip Zebrafish Genome Array to GeneChip Human Genome U133A (goo.gl/d3lCPL). This is an essential prerequisite to perform search and matching of “omics” profiles related to disease and drug responses, which are stored in different databases. Prominent examples of the latter are the Gene Expression Omnibus (GEO[4]; Barrett et al., 2013), ArrayExpress[5] (Rustici et al., 2013) and the Connectivity Map (cMAP[6]; Lamb et al., 2006; Lamb, 2007).

PRACTICAL CHALLENGES REMAIN

We recently implemented this process as part of a project in drug repositioning that applies the zebrafish as in vivo model of heart regeneration. In this particular case, we aimed to match sequence probes from zebrafish to human using the microarray chips indicated above. Initially, we expected that a single mapping resource, such as HomoloGene (Wheeler et al., 2001; Acland et al., 2014), could allow us to go from zebrafish gene symbols to human homolog symbols (e.g., Entrez database IDs) in a relatively straightforward way. After testing options available, we understood that a single resource does not provide up-to-date, comprehensive genome-scale mappings. To overcome this obstacle, we implemented a pipeline that incorporated five gene mapping resources: (1) HomoloGene (Wheeler et al., 2001; Acland et al., 2014), (2) Biomart[7] (Kasprzyk, 2011), (3) conversion file provided by Affymetrix, (4) ZFIN (Bradford et al., 2011), and (5) BLAST homology searches performed at our laboratory (Altschul et al., 1990, 1997; Ye et al., 2006; Figure 1). Most of these resources (HomoloGene, ZFIN, and Biomart) apply a combination of: expert curation of orthology relationships found in the literature, manual orthology analysis, and (only) computational prediction of orthology. The other two resources are based on associations provided by the microarray manufacturer and our own computational predictions without deep expert curation. The human homolog Entrez IDs resulting from each procedure were compared, and overlaps among them were identified (Figure 2).

FIGURE 1

FIGURE 2

Gene mapping agreement between the different homology annotation resources. Number of mapped genes are shown for each resource and between-resource intersection.

Mapping zebrafish to human sequences via five annotated resources. Sequence probes from Affymetrix’s GeneChip Zebrafish Genome Array were mapped to probes in GeneChip Human Genome U133A. Each mapping pipeline is based on a single resource independently: (1) HomoloGene (Wheeler et al., 2001; Acland et al., 2014), (2) Biomart (Kasprzyk, 2011), (3) conversion file provided by Affymetrix, (4) ZFIN (Bradford et al., 2011), and (5) BLAST homology searches performed at our laboratory (Altschul et al., 1990, 1997; Ye et al., 2006). In the latter method we focused on the most statistically significant BLAST match per query. Gene mapping agreement between the different homology annotation resources. Number of mapped genes are shown for each resource and between-resource intersection. Using the five methods, a total of 12,593 human genes (Entrez IDs) were mapped from the 13,287 zebrafish sequence probes used as inputs (94.8%). Among the public resources, ZFIN (Bradford et al., 2011) reported the largest number of mapped genes (8533), followed by the Affymetrix conversion file (7580), HomoloGene (Wheeler et al., 2001; Acland et al., 2014) (7001), and Biomart (Kasprzyk, 2011) (6221). Our BLAST-based analysis (Altschul et al., 1990, 1997; Ye et al., 2006) generated 11,605 matches, including 1991 mappings that were not found in any of the other resources. Together, the five resources jointly agreed on 3074 mappings only (24.4%). Also there were relatively narrow overlaps between HomoloGene (Wheeler et al., 2001; Acland et al., 2014), a high-quality expert-annotated database, and the other resources (Figure 2). When two or more resources produced a mapping for any given zebrafish gene there was always an agreement in their mappings. Considering these disparities and incompleteness, we decided to incorporate all available mapping evidence into the subsequent stages of our project. Although we assigned more confidence to mappings originating from expert-curated databases or to those supported by multiple resources (Figure 2), we also had to consider on a case-by-case basis those instances only mapped by a single source (e.g., Affymetrix file) or those only available in our BLAST search (Altschul et al., 1990, 1997; Ye et al., 2006). This semi-automatic, time-consuming conversion process was required for every zebrafish-derived candidate signature obtained in our project. The major take-home message from Figure 2 is that whilst a large number of orthology predictions overlap between the five resources, this only amounts to just over 24% of the total number of annotated genes. This means that a straightforward voting approach only assigning orthologs commonly assigned by all resources, would mean ignoring 75% of genes, which is unacceptable for any genome-wide experiment. Even when taking just two resources such as Homologene (Wheeler et al., 2001; Acland et al., 2014) and ZFIN (Bradford et al., 2011), 1648 genes (24%) of the Homologene mappings are not found in ZFIN. The situation is even more striking considering ZFIN, where 3186 (37%) of mappings are not found in Homologene. Thus, the interpretation of whole genome experiments from zebrafish in a human context will be strongly affected simply by the choice of resource used for the orthology mapping.

OVERCOMING OBSTACLES TO ENABLING RESEARCH

Our experience illustrates that performing zebrafish-to-human gene mapping remains a major challenge. This is a critical requirement for enabling research in different application domains. As zebrafish becomes a widely adopted genetic and systems biology model of disease, comprehensive and accurate zebrafish-to-human gene mapping represents a fundamental need. The complexity of this endeavor is magnified by intertwined evolutionary and genomics factors, including the considerable levels of gene similarity at the genome and gene family levels. Finding maximal, high-quality sets of orthology relationships is currently constrained by the incompleteness of the zebrafish genome assembly, and in general by the evolutionary separation between species. Although our analysis considered (zebrafish) microarray probes that are linked to zebrafish genes previously annotated, this is an important factor to consider regardless of gene selection scheme or the lack of agreement between databases. The de novo implementation of this process may be too time-consuming or even impractical for many laboratories, in particular those with limited bioinformatics resources. Moreover, even if bioinformatics capacity is available, the required information will continue maturing and evolving. This is very likely as the annotation of the zebrafish genome goes deeper and new evidence about gene function emerges. Furthermore, progress will be accompanied by a fast-growing interest in non-coding RNA sequences. On one level, the agreement between gene homology resources highlight the confidence strength for such zebrafish-to-human mappings. On a gene-by-gene basis, databases that make major efforts to incorporate expert curation, ZFIN in particular, are likely to offer the highest quality relationships when those mappings are available. On another level, the considerable complementarity among these resources underlines the need for further annotation efforts, as well as their integration and standardization. Future comparisons could include other resources of orthology inference that were not considered in our analysis, such as the PANTHER classification system[8]. The incorporation of curation “evidence codes” (e.g., literature-extracted vs. manual orthology analysis) may also benefit the usage and integration of available resources. Future work could also benefit from incorporating phylogenetic evidence using multiple animal models and species. Until a more standardized solution exists, researchers should not rely on a single resource for zebrafish ortholog mapping. Rather, we recommend using a combination of several resources and performing focused manual annotation on subsets of genes falling between annotation categories. In order to keep such numbers manageable this annotation could be restricted to small subsets of genes showing key biological relevance for the experiment in question. Ideally such annotation would be fed back into manually curated resources such as ZFIN (Bradford et al., 2011), thus making the annotations available to other researchers in the field. Researchers are welcome to request from us the multi-source mapping data discussed in this article. The zebrafish can provide us with significant biological insights and novel directions for therapeutic interventions in a wide range of disease domains, including cardiovascular disease and cancers. To accomplish this vision, comprehensive and accurate zebrafish-to-human gene mapping is still necessary. Further public standardized efforts are needed. This will greatly depend on stronger support from research funders, researchers and other stakeholders.

Conflict of Interest Statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

16 in total

1. Basic local alignment search tool.

Authors: S F Altschul; W Gish; W Miller; E W Myers; D J Lipman
Journal: J Mol Biol Date: 1990-10-05 Impact factor: 5.469

2. The Connectivity Map: using gene-expression signatures to connect small molecules, genes, and disease.

Authors: Justin Lamb; Emily D Crawford; David Peck; Joshua W Modell; Irene C Blat; Matthew J Wrobel; Jim Lerner; Jean-Philippe Brunet; Aravind Subramanian; Kenneth N Ross; Michael Reich; Haley Hieronymus; Guo Wei; Scott A Armstrong; Stephen J Haggarty; Paul A Clemons; Ru Wei; Steven A Carr; Eric S Lander; Todd R Golub
Journal: Science Date: 2006-09-29 Impact factor: 47.728

3. The zebrafish as a model for cancer.

Authors: Marina C Mione; Nikolaus S Trede
Journal: Dis Model Mech Date: 2010-03-30 Impact factor: 5.758

Review 4. Advances in understanding the mechanism of zebrafish heart regeneration.

Authors: Kazu Kikuchi
Journal: Stem Cell Res Date: 2014-07-19 Impact factor: 2.020

5. Database resources of the National Center for Biotechnology Information.

Authors: D L Wheeler; D M Church; A E Lash; D D Leipe; T L Madden; J U Pontius; G D Schuler; L M Schriml; T A Tatusova; L Wagner; B A Rapp
Journal: Nucleic Acids Res Date: 2001-01-01 Impact factor: 16.971

6. The zebrafish reference genome sequence and its relationship to the human genome.

Authors: Kerstin Howe; Matthew D Clark; Carlos F Torroja; James Torrance; Camille Berthelot; Matthieu Muffato; John E Collins; Sean Humphray; Karen McLaren; Lucy Matthews; Stuart McLaren; Ian Sealy; Mario Caccamo; Carol Churcher; Carol Scott; Jeffrey C Barrett; Romke Koch; Gerd-Jörg Rauch; Simon White; William Chow; Britt Kilian; Leonor T Quintais; José A Guerra-Assunção; Yi Zhou; Yong Gu; Jennifer Yen; Jan-Hinnerk Vogel; Tina Eyre; Seth Redmond; Ruby Banerjee; Jianxiang Chi; Beiyuan Fu; Elizabeth Langley; Sean F Maguire; Gavin K Laird; David Lloyd; Emma Kenyon; Sarah Donaldson; Harminder Sehra; Jeff Almeida-King; Jane Loveland; Stephen Trevanion; Matt Jones; Mike Quail; Dave Willey; Adrienne Hunt; John Burton; Sarah Sims; Kirsten McLay; Bob Plumb; Joy Davis; Chris Clee; Karen Oliver; Richard Clark; Clare Riddle; David Elliot; David Eliott; Glen Threadgold; Glenn Harden; Darren Ware; Sharmin Begum; Beverley Mortimore; Beverly Mortimer; Giselle Kerry; Paul Heath; Benjamin Phillimore; Alan Tracey; Nicole Corby; Matthew Dunn; Christopher Johnson; Jonathan Wood; Susan Clark; Sarah Pelan; Guy Griffiths; Michelle Smith; Rebecca Glithero; Philip Howden; Nicholas Barker; Christine Lloyd; Christopher Stevens; Joanna Harley; Karen Holt; Georgios Panagiotidis; Jamieson Lovell; Helen Beasley; Carl Henderson; Daria Gordon; Katherine Auger; Deborah Wright; Joanna Collins; Claire Raisen; Lauren Dyer; Kenric Leung; Lauren Robertson; Kirsty Ambridge; Daniel Leongamornlert; Sarah McGuire; Ruth Gilderthorp; Coline Griffiths; Deepa Manthravadi; Sarah Nichol; Gary Barker; Siobhan Whitehead; Michael Kay; Jacqueline Brown; Clare Murnane; Emma Gray; Matthew Humphries; Neil Sycamore; Darren Barker; David Saunders; Justene Wallis; Anne Babbage; Sian Hammond; Maryam Mashreghi-Mohammadi; Lucy Barr; Sancha Martin; Paul Wray; Andrew Ellington; Nicholas Matthews; Matthew Ellwood; Rebecca Woodmansey; Graham Clark; James D Cooper; James Cooper; Anthony Tromans; Darren Grafham; Carl Skuce; Richard Pandian; Robert Andrews; Elliot Harrison; Andrew Kimberley; Jane Garnett; Nigel Fosker; Rebekah Hall; Patrick Garner; Daniel Kelly; Christine Bird; Sophie Palmer; Ines Gehring; Andrea Berger; Christopher M Dooley; Zübeyde Ersan-Ürün; Cigdem Eser; Horst Geiger; Maria Geisler; Lena Karotki; Anette Kirn; Judith Konantz; Martina Konantz; Martina Oberländer; Silke Rudolph-Geiger; Mathias Teucke; Christa Lanz; Günter Raddatz; Kazutoyo Osoegawa; Baoli Zhu; Amanda Rapp; Sara Widaa; Cordelia Langford; Fengtang Yang; Stephan C Schuster; Nigel P Carter; Jennifer Harrow; Zemin Ning; Javier Herrero; Steve M J Searle; Anton Enright; Robert Geisler; Ronald H A Plasterk; Charles Lee; Monte Westerfield; Pieter J de Jong; Leonard I Zon; John H Postlethwait; Christiane Nüsslein-Volhard; Tim J P Hubbard; Hugues Roest Crollius; Jane Rogers; Derek L Stemple
Journal: Nature Date: 2013-04-17 Impact factor: 49.962

7. BioMart: driving a paradigm change in biological data management.

Authors: Arek Kasprzyk
Journal: Database (Oxford) Date: 2011-11-13 Impact factor: 3.451

8. BLAST: improvements for better sequence analysis.

Authors: Jian Ye; Scott McGinnis; Thomas L Madden
Journal: Nucleic Acids Res Date: 2006-07-01 Impact factor: 16.971

9. Database resources of the National Center for Biotechnology Information.

Authors:
Journal: Nucleic Acids Res Date: 2013-11-19 Impact factor: 16.971

10. ArrayExpress update--trends in database growth and links to data analysis tools.

Authors: Gabriella Rustici; Nikolay Kolesnikov; Marco Brandizi; Tony Burdett; Miroslaw Dylag; Ibrahim Emam; Anna Farne; Emma Hastings; Jon Ison; Maria Keays; Natalja Kurbatova; James Malone; Roby Mani; Annalisa Mupo; Rui Pedro Pereira; Ekaterina Pilicheva; Johan Rung; Anjan Sharma; Y Amy Tang; Tobias Ternent; Andrew Tikhonov; Danielle Welter; Eleanor Williams; Alvis Brazma; Helen Parkinson; Ugis Sarkans
Journal: Nucleic Acids Res Date: 2012-11-27 Impact factor: 16.971

1 in total

1. Development of a Zebrafish S1500+ Sentinel Gene Set for High-Throughput Transcriptomics.

Authors: Michele R Balik-Meisner; Deepak Mav; Dhiral P Phadke; Logan J Everett; Ruchir R Shah; Tamara Tal; Peter J Shepard; B Alex Merrick; Richard S Paules
Journal: Zebrafish Date: 2019-06-12 Impact factor: 1.985

1 in total