Literature DB >> 21596783

AnnotQTL: a new tool to gather functional and comparative information on a genomic region.

F Lecerf¹, A Bretaudeau, O Sallou, C Desert, Y Blum, S Lagarrigue, O Demeure.

Abstract

AnnotQTL is a web tool designed to aggregate functional annotations from different prominent web sites by minimizing the redundancy of information. Although thousands of QTL regions have been identified in livestock species, most of them are large and contain many genes. This tool was therefore designed to assist the characterization of genes in a QTL interval region as a step towards selecting the best candidate genes. It localizes the gene to a specific region (using NCBI and Ensembl data) and adds the functional annotations available from other databases (Gene Ontology, Mammalian Phenotype, HGNC and Pubmed). Both human genome and mouse genome can be aligned with the studied region to detect synteny and segment conservation, which is useful for running inter-species comparisons of QTL locations. Finally, custom marker lists can be included in the results display to select the genes that are closest to your most significant markers. We use examples to demonstrate that in just a couple of hours, AnnotQTL is able to identify all the genes located in regions identified by a full genome scan, with some highlighted based on both location and function, thus considerably increasing the chances of finding good candidate genes. AnnotQTL is available at http://annotqtl.genouest.org.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2011 PMID： 21596783 PMCID： PMC3125768 DOI： 10.1093/nar/gkr361

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

The final steps of genetic mapping research programs require close analysis of several QTL regions to select candidate genes for further studies. Despite several websites (NCBI genome browser, Ensembl Browser, UCSC Genome Browser) or web tools (Biomart, Galaxy) developed to achieve this task, the selection of candidate genes remains a laborious process. The information made available on the more prominent web sites differs slightly in terms of gene prediction and functional annotation, while other websites provide extra information that researchers may want to use (HGNC approved gene symbols, Gene Ontology (GO) Annotation or functional data, conservation of synteny with other species, etc.). It is possible to manually merge and compare this information for one QTL containing few genes, but not for many different QTL regions containing dozens of genes. Here, we propose a web tool that, for a given region of interest, merges the list of genes available in NCBI and Ensembl, removes redundancy, adds functional annotations from different prominent web sites, and highlights the genes for which functional annotation fits the biological function or diseases of interest. The tool is dedicated to sequenced species of livestock including cattle, pig, chicken and horse as well as dog, i.e. species that have been extensively studied (with over 8000 QTLs detected; see http://www.animalgenome.org/cgi-bin/QTLdb/index). Nevertheless, because of the family designs and the low number of animals used in these species, most of the studies use linkage analysis, and the QTL regions identified remain large (containing dozens of genes). Conversely, in human and model species, most analyses now draw heavily on association studies involving large cohorts, thus providing more power and accuracy, and the web tools already available focus on these species through functional annotation of SNPs in association with the trait (1–8). Most of these tools focus on the SNP annotation itself, describing whether the SNP is located in a gene, or even in a coding sequence, and defining if it could have a functional effect. While these web tools are highly efficient in providing a good annotation for specific SNPs, they clearly cannot be used to collect information on the large regions obtained in livestock species.

METHODS

The main objective of AnnotQTL is to minimize redundancy so as to display the maximum amount of information from several sources on the genes in the region of interest. The main AnnotQTL program is implemented in PERL. The data are downloaded from several FTPs or websites (see Figure 1 for details on the data and fields used) and stored on our server for further computation. Location and annotation data from Ensembl are downloaded via BioMart (9) using MartService. AnnotQTL gives several sets of information from comparative mapping of selected species against the human and mouse genomes using a local dump of the data provided by the Narcisse web site (10) and orthologous gene information from the Ensembl comparative database. PERL scripts import the downloaded files into our SQL databases. All PERL scripts and official GO database updates are inserted into a BioMaj (11) workflow to automate the updating process. Updates are performed monthly.

Figure 1.

Schematic diagram of the database and source data files. The table map_seq is filled using file xxx_seq_gene.md.gz, where xxx is the species name, located in the NCBI FTP directory: /genomes/xxx/mapview. The tables gene_info, gene2accession, gene2go, gene2ensembl and gene2pubmed are filled using data files stored in the NCBI FTP directory: /gene/DATA. The table biomart_xxx is filled using the BioMart service for the Ensembl databases. For each species, a SQL table is created to store SNP data (here, only one is detailed). The data files are downloaded from NCBI FTP directory: /snp/organisms/xxx/chr_rpts (where xxx is species name). The tables mp2mp, mp2term and mp2assoc are filled using files HMD_HumanPhenotype.rpt and MPheno_OBO.ontology available from the MGI FTP site (ftp.informatics.jax.org/pub/reports). The hgnc table is filled using the data stored at the GeneName website together with its LWP agent (http://www.genenames.org/cgi-bin/hgnc_downloads.cgi). The table omim_genemap is filled using the data file located in the NCBI FTP directory /repository/OMIM. The table narcisse_synt is filled using the comparative data provided by the Narcisse website (http://narcisse.toulouse.inra.fr). The principle and workflow of AnnotQTL is depicted in Figure 2. Starting with the genome coordinates entered by the user, the program extracts the NCBI GeneID of genes contained in the region, the corresponding annotations (name, description, symbol and so on), plus the associated cross-references (RNA accession number, protein accession number and Ensembl identifier) and the Pubmed identifier. This Pubmed identifier is specific of the requested species and does not list the publications related to this gene in other species. Using the same genome coordinates, the program then extracts Ensembl ID, gene annotation and human and mouse ortholog gene identifiers from the Ensembl database.

Figure 2.

AnnotQTL—principle and workflow. Boxes shaded in gray represent user input or database (i.e. Narcisse) input. Boxes shaded in yellow show the main processes in the AnnotQTL workflow. Boxes shaded in orange represent intermediate results. MP: Mammalian Phenotype. The main step now is to remove the redundancy between NCBI and Ensembl data while keeping the specific annotation of both databases. As there is a slight difference in gene location between the two web sites, the filtering process cannot be based on gene location, which leaves two approach options. The first is to use the Ensembl cross-reference provided by NCBI. However, this approach is not exhaustive since few cross-references are missing even for genes annotated in both databases. A second strategy has therefore been developed based on a textual query search in the annotation fields provided by the two sites. Values for the symbol, synonyms, RNA accession and protein accession fields from NCBI are compared against the values in the Ensembl gene name field for the gene of the species of interest. When one or more of these fields match, all the information is combined under one record, thus removing duplicates and enriching the annotation (without losing the annotation specific to both sites). If available, the gene annotations of human and mouse orthologs are also included in this comparison. Each record is also filtered for potential intra-redundancy of annotation between the gene and its orthologs (i.e. the same gene description is found between the requested species and Human or mouse orthologs). This set of genes combining NCBI- and Ensembl-specific information is then compared to the HGNC database. The goal of this procedure is to retrieve the HGNC approved symbol by searching for correspondences between annotation fields and the symbol or aliases fields of the HGNC database. Then, the values found in the HGNC database (symbol and OMIM identifier, if any) are included in the final results output displayed. If the OMIM identifier is still undefined, a search through the OMIM symbol fields is performed using the HGNC symbol or aliases. Where relevant, OMIM identifier, title and related disorders are retrieved from the OMIM database. Finally, the user-specified genome region is cross-compared against the Narcisse database to fetch the human or mouse-orthologous genomic region. To clarify the output and adapt it to the scientist's query, certain information is only available through menu options. Human and mouse orthology information from Ensembl can be used to more accurately define certain genomic regions left undefined in Narcisse data. Users can also select level of synteny (synt order, see (10) for more details) between studied species and target species (human or mouse). Another option is to upload a set of genetic markers (which can be of any type provided physical location is given) to be inserted in the final results display. User can choose to keep their own marker locations or re-map markers to NCBI genome coordinates (only available with approved marker identifiers). A fourth non-processed column is available for displaying user-defined information. Adding the markers to the results display should ease the identification of the genes that most closely match the most significant markers. Finally, AnnotQTL can highlight genes based on functional annotations provided by GO, Mammalian Phenotype (MP), or OMIM disorders. For GO or MP terms organized in a hierarchical ‘parent–children’ directory structure, user-inputted keywords provide options for selecting the corresponding terms and associated children. For OMIM, a query is performed against OMIM disorders data retrieved in the previous step with user-input keywords: if the keywords matched, then the OMIM disorders are highlighted in the display. For GO, the genes are highlighted if their GeneID matches with the GO association provided by NCBI. As they do not have a GeneID, the match-up between GO annotation and genes specific to the Ensembl database is based on their HGNC name, where available. Users can improve this ‘GOA highlight function’ by adding the GOA from human and/or mouse species from orthologs to current genes. For MP, genes are highlighted if their approved symbols match the HGNC approved symbols stored in the MP database. The aim here is to provide functional information and facilitate the identification of genes linked to the trait-of-interest (i.e. functional candidate genes).

APPLICATION

To demonstrate the utility of AnnotQTL and test the efficiency of this web tool, we present different examples using real data aimed at identifying functional and positional candidate genes. The first example focuses on a bovine mutation controlling muscular hypertrophy. In 1995, the mutation was mapped to the extremity of the BTA2 in a 12 cM interval (12). Using AnnotQTL on this region of the bovine chromosome (0–8 Mb) retrieved the location and functional annotation of 95 genes. We then applied the ‘GO highlight function’ on this region in two separate queries, using ‘muscle’ and ‘growth’ as keywords best describing the observed phenotype. These two terms highlighted two and three genes, respectively, from these 95 genes. Both lists highlighted the MSTN (GDF8) gene, which has been demonstrated as the validated causal gene (myostatin) (13). A second step analyzed a more extensive set of 21 QTL regions shaping abdominal fatness in chickens (14,15). Average length of these regions was 4.8 Mb. After running AnnotQTL, all the regions were enriched with genes by comparing NCBI and Ensembl information against information provided by either NCBI or Ensembl only (Table 1). For all the genomic regions, working from an initial set of 1734 genes from the NCBI database and 1902 genes from the Ensembl database, AnnotQTL retrieved a non-redundant set of 2220 genes. On this large dataset, we applied the ‘highlight function’ on each region to underline genes whose functional annotation was related to the studied phenotype. Among the 2220 genes located in these 21 QTL regions, 127 were highlighted using the GO term, ‘lipid’ and the MP term ‘adipose’ as keywords, with an average 5.4 genes highlighted per region.

Table 1.

Statistics of the QTL/eQTL regions analyzed using AnnotQTL

	Number of regions	Regions mean size (Mb)	NCBI genes	Ensembl genes	AnnotQTL genes obtained merging NCBI and Ensembl	GO and MP terms screening results
	Number of regions	Regions mean size (Mb)	NCBI genes	Ensembl genes	AnnotQTL genes obtained merging NCBI and Ensembl	Genes found	Average of genes found per region
QTL	21	4.8	1734	1902	2220	127	5.8
eQTL	25	3.4	1198	1283	1506	93	3.7

Statistics of the QTL/eQTL regions analyzed using AnnotQTL Finally, AnnotQTL can also be exploited to look at eQTL regions. Strategies combining transcriptomics and genotyping data have recently been developed to better characterize QTL regions for traits of interest by identifying co-localized eQTLs and QTLs (16–21). Whatever the context, this strategy identifies a much higher number of eQTL regions than in QTL studies, thus creating a need for tools that can efficiently find positional and functional candidate genes. Here, we focus on 25 chicken eQTL regions affecting 70 genes involved in lipid metabolism (i.e. sharing the GO term GO:0006629 ‘lipid metabolic process’). Average length of these regions is 3.4 Mb. Running AnnotQTL found similar results to those obtained for the QTL regions. All the regions were enriched with genes by comparing NCBI and Ensembl information against information provided by either NCBI or Ensembl only (Table 1): working from an initial set of 1,198 genes from the NCBI database and 1283 genes from the Ensembl database, AnnotQTL retrieved a non-redundant set of 1506 genes. Again, in order to select possible candidate genes, we used the ‘highlight function’ to pinpoint the genes related to the studied phenotype. Among these 1506 genes, and using the same GO term ‘lipid’ and MP term ‘adipose’ as keywords, a total of 93 genes were identified, with an average 3.7 genes highlighted per region. These examples corresponding to two different contexts (QTL and eQTL analyses) clearly demonstrate how in just a couple of hours, AnnotQTL can accurately analyze the gene content of numerous regions identified by a full genome scan and go on to highlight some of these genes based on both their location and function, whereas in the same time period, a manually run procedure would only have been able to analyze one single region.

CONCLUSION

AnnotQTL is a web tool designed to gather the functional annotation of different prominent web sites while minimizing redundant information. Using all known information substantially accelerates the gene analysis of QTL regions for livestock species traits and improves the selection of candidate genes.

FUNDING

INRA, Agrocampus Ouest, the Regional Council of Brittany; French Ministry in charge of Agriculture (DGER). Funding for open access charge: INRA. Conflict of interest statement. None declared.

21 in total

1. SNPeffect v2.0: a new step in investigating the molecular phenotypic effects of human non-synonymous SNPs.

Authors: Joke Reumers; Sebastian Maurer-Stroh; Joost Schymkowitz; Frederic Rousseau
Journal: Bioinformatics Date: 2006-06-29 Impact factor: 6.937

2. SNP Function Portal: a web database for exploring the function implication of SNP alleles.

Authors: Pinglang Wang; Manhong Dai; Weijian Xuan; Richard C McEachin; Anne U Jackson; Laura J Scott; Brian Athey; Stanley J Watson; Fan Meng
Journal: Bioinformatics Date: 2006-07-15 Impact factor: 6.937

3. Genetic validation of whole-transcriptome sequencing for mapping expression affected by cis-regulatory variation.

Authors: Tomas Babak; Philip Garrett-Engele; Christopher D Armour; Christopher K Raymond; Mark P Keller; Ronghua Chen; Carol A Rohl; Jason M Johnson; Alan D Attie; Hunter B Fraser; Eric E Schadt
Journal: BMC Genomics Date: 2010-08-13 Impact factor: 3.969

4. A deletion in the bovine myostatin gene causes the double-muscled phenotype in cattle.

Authors: L Grobet; L J Martin; D Poncelet; D Pirottin; B Brouwers; J Riquet; A Schoeberlein; S Dunner; F Ménissier; J Massabanda; R Fries; R Hanset; M Georges
Journal: Nat Genet Date: 1997-09 Impact factor: 38.330

5. Mapping quantitative trait loci affecting fatness and breast muscle weight in meat-type chicken lines divergently selected on abdominal fatness.

Authors: Sandrine Lagarrigue; Frédérique Pitel; Wilfrid Carré; Behnam Abasht; Pascale Le Roy; André Neau; Yves Amigues; Michel Sourdioux; Jean Simon; Larry Cogburn; Sammy Aggrey; Bernard Leclercq; Alain Vignal; Madeleine Douaire
Journal: Genet Sel Evol Date: 2006 Jan-Feb Impact factor: 4.297

6. Snap: an integrated SNP annotation platform.

Authors: Shengting Li; Lijia Ma; Heng Li; Søren Vang; Yafeng Hu; Lars Bolund; Jun Wang
Journal: Nucleic Acids Res Date: 2006-11-29 Impact factor: 16.971

7. BioMAJ: a flexible framework for databanks synchronization and processing.

Authors: Olivier Filangi; Yoann Beausse; Anthony Assi; Ludovic Legrand; Jean-Marc Larré; Véronique Martin; Olivier Collin; Christophe Caron; Hugues Leroy; David Allouche
Journal: Bioinformatics Date: 2008-06-30 Impact factor: 6.937

8. SNPs3D: candidate gene and SNP selection for association studies.

Authors: Peng Yue; Eugene Melamud; John Moult
Journal: BMC Bioinformatics Date: 2006-03-22 Impact factor: 3.169

9. Narcisse: a mirror view of conserved syntenies.

Authors: Emmanuel Courcelle; Yoann Beausse; Sébastien Letort; Olivier Stahl; Romain Fremez; Catherine Ngom-Bru; Jérôme Gouzy; Thomas Faraut
Journal: Nucleic Acids Res Date: 2007-11-02 Impact factor: 16.971

Review 10. A SNP-centric database for the investigation of the human genome.

Authors: Alberto Riva; Isaac S Kohane
Journal: BMC Bioinformatics Date: 2004-03-26 Impact factor: 3.169

7 in total

1. A maximum likelihood QTL analysis reveals common genome regions controlling resistance to Salmonella colonization and carrier-state.

Authors: Tran Thanh-Son; Beaumont Catherine; Salmon Nigel; Fife Mark; Kaiser Pete; Le Bihan-Duval Elisabeth; Vignal Alain; Velge Philippe; Calenge Fanny
Journal: BMC Genomics Date: 2012-05-21 Impact factor: 3.969

2. QTL detection for coccidiosis (Eimeria tenella) resistance in a Fayoumi × Leghorn F₂ cross, using a medium-density SNP panel.

Authors: Nicola Bacciu; Bertrand Bed'Hom; Olivier Filangi; Hélène Romé; David Gourichon; Jean-Michel Répérant; Pascale Le Roy; Marie-Hélène Pinard-van der Laan; Olivier Demeure
Journal: Genet Sel Evol Date: 2014-02-19 Impact factor: 4.297

3. Genome-wide interval mapping using SNPs identifies new QTL for growth, body composition and several physiological variables in an F2 intercross between fat and lean chicken lines.

Authors: Olivier Demeure; Michel J Duclos; Nicola Bacciu; Guillaume Le Mignon; Olivier Filangi; Frédérique Pitel; Anne Boland; Sandrine Lagarrigue; Larry A Cogburn; Jean Simon; Pascale Le Roy; Elisabeth Le Bihan-Duval
Journal: Genet Sel Evol Date: 2013-09-30 Impact factor: 4.297

4. Re-sequencing data for refining candidate genes and polymorphisms in QTL regions affecting adiposity in chicken.

Authors: Pierre-François Roux; Morgane Boutin; Colette Désert; Anis Djari; Diane Esquerré; Christophe Klopp; Sandrine Lagarrigue; Olivier Demeure
Journal: PLoS One Date: 2014-10-21 Impact factor: 3.240

Review 5. Gene family matters: expanding the HGNC resource.

Authors: Louise C Daugherty; Ruth L Seal; Mathew W Wright; Elspeth A Bruford
Journal: Hum Genomics Date: 2012-07-05 Impact factor: 4.639

6. The Mammalian Phenotype Ontology as a unifying standard for experimental and high-throughput phenotyping data.

Authors: Cynthia L Smith; Janan T Eppig
Journal: Mamm Genome Date: 2012-09-09 Impact factor: 2.957

7. Fine mapping of complex traits in non-model species: using next generation sequencing and advanced intercross lines in Japanese quail.

Authors: Laure Frésard; Sophie Leroux; Patrice Dehais; Bertrand Servin; Hélène Gilbert; Olivier Bouchez; Christophe Klopp; Cédric Cabau; Florence Vignoles; Katia Feve; Amélie Ricros; David Gourichon; Christian Diot; Sabine Richard; Christine Leterrier; Catherine Beaumont; Alain Vignal; Francis Minvielle; Frédérique Pitel
Journal: BMC Genomics Date: 2012-10-15 Impact factor: 3.969

7 in total