Literature DB >> 19615115

IMAD: flexible annotation of microarray sequences.

Abstract

BACKGROUND: Accurate and current functional annotation of microarray probes is essential for the analysis and interpretation of the biological processes involved. As gene structures and functional annotation are updated in genome databases, the annotation attached to microarray probes must be updated so that scientists have access to the latest information with which to analyse their data.
RESULTS: We have designed a pipeline and database for the annotation of microarray probes using publically available databases. The pipeline is based on NCBI BLAST, Perl and MySQL. The pipeline was used to annotate a subset of 791 differentially expressed ArkGenomics chicken probes from an experiment involving chickens infected with the protozoan parasite Eimeria. Using our pipeline, 770 of the probes were assigned at least one entry in either the Ensembl, UniGene or the DFCI gene indices databases.
CONCLUSION: The pipeline described here provides a simple and robust way of maintaining up-to-date and accurate annotation for microarray probes. The pipeline is designed in such a way as to be flexible and easy to update with new information.

Entities: Chemical Disease Gene Species

Year: 2009 PMID： 19615115 PMCID： PMC2712745 DOI： 10.1186/1753-6561-3-S4-S2

Source DB: PubMed Journal: BMC Proc ISSN： 1753-6561

Background

The use of microarrays plays an important role in biomedical research, producing large quantities of data on genes that are differentially expressed under various conditions. Although annotation provided with the microarray may be current at the time of manufacture, regular reannotation of the microarray is essential to keep the annotation current. Additionally, probes may be designed from annotation based on incomplete genomes and incorrect or incomplete annotation. This may result in an incomplete coverage of the genome, non-specific probes, incorrect annotation, and orphan probes. ProbeLynx [1] is a software system that has been published to accomplish the task of linking microarray sequences to annotation data. However, ProbeLynx uses certain tables directly from the Ensembl database and is therefore sensitive to schema changes. At the time of writing, ProbeLynx uses Ensembl version 47 (we are currently on version 52). Our objective is to design a flexible, up-to-date annotation pipeline that can be used to regularly update the annotation of microarray probes using publicly accessible databases which provide coverage of the genome. This paper is part of a workshop to compare different annotation pipelines, the results of which have been published in conjunction with this paper [2].

Results

The pipeline has default filters such that only hits that match at greater than 80% identity across at least 20% of the length of the query sequence are counted. These values can be changed depending on requirements; for example, users would choose different values for a cDNA array compared to an oligo array. With these default values, 770 probes had at least one matching hit in at least one of the Ensembl, UniGene or DFCI gene indices databases. Applying the selection criteria to the data presented here resulted in 750 probes having at least one matching hit in at least one of the Ensembl [3], UniGene [4] or DFCI gene indices [5] databases. The results from this study and the other studies on this dataset can be found on the EADGENE website [6].

Ensembl

Using the Ensembl database, annotation could be provided for 472 probes (60%). Of those, 438 matched a single Ensembl gene id and 34 probes matched multiple genes. A total of 426 probes had perfect matches. Of these, 396 were unique hits. Gene descriptions were provided for 405 probes and 198 probes were matched to at least one Gene Ontology [7] term.

DFCI gene indices

Using the DFCI gene indices, annotation was provided for 683 probes (86%). Of these, 249 matched a single gene index, 434 matched multiple indices, and 548 probes had perfect matches, 195 of which had single unique hits. Using the DFCI gene indices annotation, a gene description was provided for 466 probes and 66 probes were matched to at least one GO term [7].

UNIGENE

Of the 791 probes, 715 (90%) could be assigned to at least one UniGene cluster, of which 593 were assigned uniquely (and therefore 122 were assigned to multiple clusters). Perfect matches were seen in 560 cases, of which 478 were unique. All 715 of the annotated probes had a cluster title (gene description).

Discussion

When linking microarray probes to genome databases, we are attempting to do two things. Firstly, we are attempting to define just how many genes might be hybridising to each spot and contributing to the signal intensity. Secondly, we are attempting to inform scientists about gene function. Ideally there should be a one-to-one relationship between probe and gene. However, this is clearly not the case. Using the selection criteria, the best results come from UniGene, where 75% of probes have a single, contributory gene; the worst results are from DFCI gene indices, where the figure is 31%. Probes with more than one hit may be due to shared domains, overlapping genes, misannotation, misassembly, low complexity regions, and/or repeat regions. There are several reasons why probes may have no hits. The microarray used in this study was designed in 2005 using the first draft of the chicken genome, Ensembl version 30, and annotated with Ensembl version 42. Since then there have been 20 subsequent versions of the Ensembl database and a second draft of the chicken genome. Regular reannotation of the probes using the information provided with new genebuilds and Ensembl releases allows us to maintain up-to-date information. In addition, only the core Ensembl gene set was searched; had we searched against the genome itself, or against the EST gene set, the number of unannotated probes would be reduced. It is not surprising that the number of unannotated probes is lower in the two EST databases. However, even with UniGene, the best in terms of probe coverage, one in ten probes did not have a hit above the threshold. This may mean that the sequence that the probe was originally designed to is no longer publicly available (or never was) or that it did not meet the quality criteria applied before the database was built. In terms of functional annotation, all three databases provided a functional description for over half of the probes. UniGene again performed the best, although no attempt has been made to judge the quality of the description. Disappointingly, a maximum of 25% of probes were assigned GO terms. Future improvements in the assembly of the chicken genome and annotation should help to increase the level of annotation. The IMAD pipeline could be improved by allowing searches against the genome assembly, and against further databases such as the Ensembl EST genes, KEGG [8], and RefSeq [9]. This study is part of a workshop to compare different annotation strategies and the results of this have been published in conjunction with this study [2].

Conclusion

We have created a pipeline that can be used to maintain the annotation of microarray probes using publicly available databases. The analysis of a set of differentially expressed probes revealed problems with annotation that may be due to a probe design based on incomplete annotation of the chicken genome. As improvements in the annotation of the chicken genome are made, improvements in the design of chicken microarrays are sure to follow.

Materials and methods

Software organisation

IMAD consists of a flexible relational database in MySQL, designed to store the hits of any set of sequences against any number of BLAST [10] databases, and any annotation associated with those databases; Perl scripts for downloading, updating and inserting Ensembl, UniGene and DFCI gene indices databases; a Perl API for querying the database programmatically; and a Perl CGI script for web-based querying.

Workflow

The probe set was searched against multiple databases using NCBI BLAST, followed by parsing of the BLAST results. Where a single HSP exists between the query and hit, filters are applied and statistics are calculated and stored in the database. Where there are multiple HSPs, any overlap with respect to the query and the hit is removed. Statistics are then applied across all HSPs, filters applied and then stored in the database. Results (top hit for each probe for each database) in spreadsheet format are extracted using the API.

Microarray dataset

The microarray used in this study was the Arkgenomics chicken 20 K oligo microarray, consisting 20,460 probes designed against a unique set of chicken transcripts in 2005, primarily 70 mer oligos [11]. A subset of 791 probes was selected for analysis in conjunction with the EADGENE post analysis workshop of microarray data [6] with the aim of evaluating several annotation pipelines for the quality of improved annotation. This represents a set of differentially expressed probes from an experiment of Eimeria infected chickens [12].

Dataset sources for annotation

Ensembl chicken version 50, UniGene chicken build 39 and DFCI chicken gene indices version 11 were used. Gene Ontology terms were obtained through Ensembl BioMart [13]. These three databases provide

Selection criteria

Cutoff values for positive hits were any target with a contiguous matching stretch greater than 20 bases and an overall percentage identity greater than 80%. A perfect match is defined where there is a 100% match over the entire length of the oligo with the target sequence.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

MW and DP co-wrote IMAD.

10 in total

1. ProbeLynx: a tool for updating the association of microarray probes to genes.

Authors: Fiona M Roche; Karsten Hokamp; Michael Acab; Lorne A Babiuk; Robert E W Hancock; Fiona S L Brinkman
Journal: Nucleic Acids Res Date: 2004-07-01 Impact factor: 16.971

Review 2. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.

Authors: S F Altschul; T L Madden; A A Schäffer; J Zhang; Z Zhang; W Miller; D J Lipman
Journal: Nucleic Acids Res Date: 1997-09-01 Impact factor: 16.971

3. Database resources of the National Center for Biotechnology.

Authors: David L Wheeler; Deanna M Church; Scott Federhen; Alex E Lash; Thomas L Madden; Joan U Pontius; Gregory D Schuler; Lynn M Schriml; Edwin Sequeira; Tatiana A Tatusova; Lukas Wagner
Journal: Nucleic Acids Res Date: 2003-01-01 Impact factor: 16.971

4. The TIGR Gene Indices: clustering and assembling EST and known genes and integration with eukaryotic genomes.

Authors: Y Lee; J Tsai; S Sunkara; S Karamycheva; G Pertea; R Sultana; V Antonescu; A Chan; F Cheung; J Quackenbush
Journal: Nucleic Acids Res Date: 2005-01-01 Impact factor: 16.971

5. NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins.

Authors: Kim D Pruitt; Tatiana Tatusova; Donna R Maglott
Journal: Nucleic Acids Res Date: 2005-01-01 Impact factor: 16.971

6. KEGG Atlas mapping for global analysis of metabolic pathways.

Authors: Shujiro Okuda; Takuji Yamada; Masami Hamajima; Masumi Itoh; Toshiaki Katayama; Peer Bork; Susumu Goto; Minoru Kanehisa
Journal: Nucleic Acids Res Date: 2008-05-13 Impact factor: 16.971

7. Comparison of three microarray probe annotation pipelines: differences in strategies and their effect on downstream analysis.

Authors: Pieter Bt Neerincx; Pierrot Casel; Dennis Prickett; Haisheng Nie; Michael Watson; Christophe Klopp; Jack Am Leunissen; Martien Am Groenen
Journal: BMC Proc Date: 2009-07-16

8. BioMart--biological queries made easy.

Authors: Damian Smedley; Syed Haider; Benoit Ballester; Richard Holland; Darin London; Gudmundur Thorisson; Arek Kasprzyk
Journal: BMC Genomics Date: 2009-01-14 Impact factor: 3.969

9. Ensembl 2009.

Authors: T J P Hubbard; B L Aken; S Ayling; B Ballester; K Beal; E Bragin; S Brent; Y Chen; P Clapham; L Clarke; G Coates; S Fairley; S Fitzgerald; J Fernandez-Banet; L Gordon; S Graf; S Haider; M Hammond; R Holland; K Howe; A Jenkinson; N Johnson; A Kahari; D Keefe; S Keenan; R Kinsella; F Kokocinski; E Kulesha; D Lawson; I Longden; K Megy; P Meidl; B Overduin; A Parker; B Pritchard; D Rios; M Schuster; G Slater; D Smedley; W Spooner; G Spudich; S Trevanion; A Vilella; J Vogel; S White; S Wilder; A Zadissa; E Birney; F Cunningham; V Curwen; R Durbin; X M Fernandez-Suarez; J Herrero; A Kasprzyk; G Proctor; J Smith; S Searle; P Flicek
Journal: Nucleic Acids Res Date: 2008-11-25 Impact factor: 16.971

10. The Gene Ontology project in 2008.

Authors:
Journal: Nucleic Acids Res Date: 2007-11-04 Impact factor: 16.971

10 in total

6 in total

1. Comparative microarray analysis of intestinal lymphocytes following Eimeria acervulina, E. maxima, or E. tenella infection in the chicken.

Authors: Duk Kyung Kim; Hyun Lillehoj; Wongi Min; Chul Hong Kim; Myeong Seon Park; Yeong Ho Hong; Erik P Lillehoj
Journal: PLoS One Date: 2011-11-28 Impact factor: 3.240

2. SigReannot-mart: a query environment for expression microarray probe re-annotations.

Authors: François Moreews; Gaelle Rauffet; Patrice Dehais; Christophe Klopp
Journal: Database (Oxford) Date: 2011-09-18 Impact factor: 3.451

3. The EADGENE and SABRE post-analyses workshop.

Authors: Florence Jaffrezic; Jakob Hedegaard; Magali Sancristobal; Christophe Klopp; Dirk-Jan de Koning
Journal: BMC Proc Date: 2009-07-16

4. Methods for interpreting lists of affected genes obtained in a DNA microarray experiment.

Authors: Cristina Arce; Silvio Bicciato; Agnès Bonnet; Bart Buitenhuis; Melania Collado-Romero; Lene N Conley; Magali SanCristobal; Francesco Ferrari; Juan J Garrido; Martien Am Groenen; Henrik Hornshøj; Ina Hulsegge; Li Jiang; Ángeles Jiménez-Marín; Arun Kommadath; Sandrine Lagarrigue; Jack Am Leunissen; Laurence Liaubet; Pieter Bt Neerincx; Haisheng Nie; Jan van der Poel; Dennis Prickett; María Ramirez-Boo; Johanna Mj Rebel; Christèle Robert-Granié; Axel Skarman; Mari A Smits; Peter Sørensen; Gwenola Tosser-Klopp; Michael Watson; Jakob Hedegaard
Journal: BMC Proc Date: 2009-07-16

5. Comparison of three microarray probe annotation pipelines: differences in strategies and their effect on downstream analysis.

Authors: Pieter Bt Neerincx; Pierrot Casel; Dennis Prickett; Haisheng Nie; Michael Watson; Christophe Klopp; Jack Am Leunissen; Martien Am Groenen
Journal: BMC Proc Date: 2009-07-16

6. Use of GenMAPP and MAPPFinder to analyse pathways involved in chickens infected with the protozoan parasite Eimeria.

Authors: Dennis Prickett; Michael Watson
Journal: BMC Proc Date: 2009-07-16

6 in total