Literature DB >> 19906711

PLANdbAffy: probe-level annotation database for Affymetrix expression microarrays.

Ramil N Nurtdinov¹, Mikhail O Vasiliev, Anna S Ershova, Ilia S Lossev, Anna S Karyagina.

Abstract

Standard Affymetrix technology evaluates gene expression by measuring the intensity of mRNA hybridization with a panel of the 25-mer oligonucleotide probes, and summarizing the probe signal intensities by a robust average method. However, in many cases, signal intensity of the probe does not correlate with gene expression. This could be due to the hybridization of the probe to a transcript of another gene, mapping of the probe to an intron, alternative splicing, single nucleotide polymorphisms and other reasons. We have developed a database, PLANdbAffy (available at http://affymetrix2.bioinf.fbb.msu.ru), that contains the results of the alignment of probe sequences from five Affymetrix expression microarrays to the human genome. We have determined the probes matching the transcript-coding regions in the correct orientation. For each such probe alignment region, we determined the mRNA and EST sequences that contain the probe sequence. In the textual part of the database interface we summarize the data on the sequences that cover the probe alignment region and SNPs that are located inside it. The graphical part of our database interface is implemented as custom tracks to the UCSC genome browser that allows one to utilize all the data that are offered by UCSC browser.

Entities: Chemical Disease Species

Mesh：

Year: 2009 PMID： 19906711 PMCID： PMC2808952 DOI： 10.1093/nar/gkp969

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Affymetrix 3′ Gene as well as Exon and Gene level microarrays are widely used in gene expression studies. HG-U133A, HG-U133B and HG-U133 Plus 2.0 arrays consist of probe sets developed for each annotated human gene. A probe set is typically a set of 11 25-mer oligonucleotide probes, with a small number of probe sets consisting of more or less than 11 probes. The majority of Human Exon 1.0 probe sets consist of four probes; these probes are developed to target all known and predicted human exons. The Human Gene 1.0 chip is based on Human Exon 1.0 data combining together all highly expressed probes that are confirmed by the transcriptome data for a particular gene. Thus the number of probes in a probe set depends on the transcript length. The Affymetrix probes and probe sets remained unchanged for the past several years but our knowledge of their genome and transcriptome context has improved with every paper in this field. The first annotation of Affymetrix probes was provided by Affymetrix staff in NetAffx database (1). This database contains information about transcripts that are recognized by the corresponding probe sets and fixes the problem of the absence of representative sequences in new versions of UniGene (2). A careful analysis of HG-U133A probes was done by Gautier and colleagues (3). The authors aligned probe sequences with RefSeq (4) mRNAs, and found some discrepancies for 64% of the HG-U133A probes. Using a similar approach, Harbig and colleagues (5) showed that ∼37% of the probes of the HG-U133 Plus 2.0 array should be redefined and more than 5000 probe sets detect multiple transcripts. Similar analyses for different expression arrays (6–10) brought similar results. Non-specific hybridization is another big problem of microarray experiments. Several papers showed that the rule that a perfect match probe has a high signal level and a mismatch probe has a low signal level does not work in many cases (11–13). In a subsequent paper (14), Zhang and colleagues developed a model of molecular interaction on short oligonucleotide arrays and applied it in their next work (15). It was shown that a significant amount of probes could give high signal level by a non-specific hybridization with short 10–16-nucleotide fragments. Alternative splicing is another source of the inconsistency in microarray experiments. Recent articles showed that up to 93% of human intron-containing genes undergo alternative splicing (16,17) and up to 90% of the genome sequence is transcribed (18). An additional source of the inconsistency is the presence of single nucleotide polymorphisms (SNPs) within probe alignment positions. There are several publicly available databases that contain annotation of the Affymetrix data: the official NetAffx (1), GeneAnnot (19), ADAPT (20) and X:Map (21) databases. GeneAnnot and ADAPT align probe sequences to the RefSeq and Ensembl mRNAs, NetAffx additionally considers GenBank (22) and UniGene (2) mRNAs. The main problem of the common approach used by these three databases arises when a particular probe, in addition to the original position, recognizes another transcribed region that is absent in the considering mRNA sequences. This results in the incomplete probe set (probe) annotation. The X:Map and presented here PLANdbAffy databases fix the above shortcoming. The authors of X:Map have aligned probe sequences with the genome and also took into account the ESTs. Unfortunately, X:Map contains data only for exon-level arrays, leaving other widely used arrays (HG-U133A&B and Human Gene 1.0) uncovered. The interface of X:Map is based on Google Maps API covering the whole chromosome. To obtain the EST transcription state of a particular probe one has to calculate the ESTs manually. This is rather difficult, and becomes much more laborious for the exon-junction probes and probes that are close to splicing sites. Also the X:Map database uses only the Ensembl genome annotation and Ensembl EST accessions, which brings difficulties to the NCBI-oriented users. Our PLANdbAffy database considers five widely used Affymetrix human microarrays: HG-U133A, HG-U133B, HG-U133 Plus 2.0, Human Exon 1.0 and Human Gene 1.0. Database provides user with information on all alignment places of the individual Affymetrix probes with the genome considering alignments with up to two mismatches, and also support each probe alignment region with all known to-date transcriptome data. Unlike the above databases (except NetAffx), PLANdbAffy also contains data on SNPs. Graphical information about each probe alignment region and gene is implemented as custom tracks to UCSC genome browser. After moving to the UCSC site it becomes possible to utilize the whole set of data and tools provided by the UCSC browser.

DATABASE CONSTRUCTION AND STRUCTURE

Data source

The files containing information about Affymetrix microarrays were downloaded from the official Affymetrix site (http://www.affymetrix.com/products_services/index.affx). For this analysis we selected three 3′ Gene arrays, Affymetrix HG-U133A, HG-U133B and HG-U133 Plus 2.0, and two Exon&Gene level arrays, Human Gene 1.0 and Human Exon 1.0. The NCBI36 (hg18) genome assembly was download from UCSC ftp site. Also, we have downloaded EST and mRNA exon–intron structures (http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/ all_mrna.txt.gz and all_est.txt.gz files) that were obtained by Blat (23) alignment of the corresponding sequences with the genome. We used the NCBI annotation of the genome sequences. Refseq (4) and Unigene (2) were used to assign mRNA and EST sequences to the genes. We used dbSNP (24) build 130 as a source of SNPs, the human readable text files were downloaded from the ftp site (ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/ASN1_flat/) and parsed.

Database development

Each probe and probe set within five chips under consideration was assigned a unique ID. It was done because some probe sets in different chips have the same identification numbers. The Affimetrix numbers were also stored and could be used to search the database. Probe sequences were mapped to the genome using Blat (23). We allowed alignments with no more than two mismatches and required 40- and more nucleotide introns for potential exon-junction probes. The hits found (‘probe alignment regions’) were stored and subjected to further analysis. We assigned a probe to a particular gene (‘the probe match the gene’) if the probe alignment region intersected with the annotated gene region and was in the correct orientation. We also took into account possible mistakes in the gene annotation extending the 3′-end of each gene by 1000 nucleotides. We annotated each probe alignment region using the mRNA and EST alignments provided by UCSC, considering only the sequences that were present in UniGene (219 build) for corresponding genes. For each probe alignment region, we have calculated the number of mRNA and EST that either support (mrna_in, spliced_est_in, unspliced_est_in fields) or do not support (mrna_out, spliced_est_out fields) occurrence of the probe alignment region in an exon (see the database web site for further explanation). To present the quality of a probe we divided all probes into four classes, and assigned a color to each class (Figure 1). Green probes (the best ones) are the probes meeting three conditions. First, the probe is aligned to the target gene without mismatches. Second, there are no matches of the probe to other genes. Third, there are no perfect alignments of the probe to non-coding regions. Unlike the green probes, a yellow probe has a perfect match to uncoding region. The yellow probes still have a perfect match to the target gene and no matches with other genes. The red probes are the probes that have a perfect match to the target gene and at least one alignment to other genes with no more than one mismatch. Finally, a black probe is aligned to the target gene with at least one mismatch.

Figure 1.

Examples of textual (A) and graphical (B) interface of the PLANdbAffy database. The textual interface of the database consists of three sections. The first section (left five columns) contains the original information about the probes from Affymetrix and probe status (highlighted by green), the second section (6–8th columns) describes the probe alignment and the last section (rightmost five columns) describes the numbers of the ESTs and mRNAs either supporting or not supporting the occurrence of the probe in an exon (see ‘Database development’ section). An example of graphical interface was manually processed to reduce the image size.

Database interface

The database is available at http://affymetrix2.bioinf.fbb.msu.ru. Text files containing the information about the mapping and annotation of the good (green) probes can be downloaded from web site. The title page contains several search boxes. One may either search the database with an Affymetrix probe set identifier, or get all probes for the particular gene using the gene search boxes. The EntrezGene, HUGO and Ensembl identifiers, the gene symbol, a word or a phrase in the gene name can be used. It must be noted, however, that since our database is based on the RefSeq annotations, some of the Ensembl and HUGO identifiers could be missed. Querying a probe set or a gene one could see the textual part of the database interface (Figure 1A). The textual part of the interface consists of probe information section, probe alignment section and transcription state section. The probe information section has four fields presenting the probe name (‘text’), the probe position on the chip (‘X’, ‘Y’, ‘inter pos’) and the color representation of the probe’s quality status (‘sts’), see Database development section. The probe alignment section contains information about the probe alignment and its mismatches. Positions of SNPs within the probe alignment region are marked and supported by the links to their descriptions in dbSNP. For each probe, the information about EST and mRNA sequences that cover the probe alignment region is available at the transcription state section. The explanation of the corresponding fields for the exon and exon-junction probes is given in the ‘Database development’ section. Each gene and each probe alignment region are supported by the graphical part of the database interface. It is organized as custom tracks to UCSC genome browser (Figure 1B), that allows one to utilize all information that is offered by UCSC browser for the corresponding region.

Data analysis

In the Figure 2, we present frequencies of each type of probes for all five arrays. Among the 3′ gene arrays HG-U133A has the highest frequency (70%) of good (green) probes. HG-U133B array has ∼53% of good probes and HG-U133 Plus 2.0 array that was designed basically by combining the HG-U133A and HG-U133B arrays data is located in between and has 59% of good probes.

Figure 2.

Frequencies of probes with different alignment and cross-hybridization state for all five considered Affymetrix arrays. See colours’ definitions in the text.

Frequencies of probes with different alignment and cross-hybridization state for all five considered Affymetrix arrays. See colours’ definitions in the text. Table 1 contains summary information about the transcriptome annotation for good (green) probes. Probe is marked as ‘exon’ if it is confirmed by more than 90% of mRNA and EST sequences that cover this region. It is marked as ‘intron’ if it is confirmed by <10% of the sequences, whereas the probes that are in between are marked as ‘exon/intron’ ones.

Table 1.

Genome and transcriptome annotation for good (green) probes

	HG-U133A	HG-U133B	HG-U133_Plus2	HuGene	HuEx
Exon	142 427	73 148	236 427	436 548	915 886
Exon/intron	19 567	20 198	50 154	81 189	288 217
Intron	12 374	38 680	70 605	67 619	1 271 110
SNP	19 249	11 380	34 724	70 525	274 208
Total genes	16 480	11 697	23 255	24 010	30 915

Genome and transcriptome annotation for good (green) probes The HG-U133A array contains the lowest amount of intron and exon/intron probes (18.3%). Considerably greater amount of such probes was observed for HG-U133B (44.6%) and HG-U133 Plus 2.0 (33.8%) arrays. As Human Exon 1.0 chip was designed to recognize all potential transcribed segments, it contains the greatest amount of the intron and exon/intron probes (63.0%). The Human Gene 1.0 array has a similar to the HG-U133A array level of the intron and exon/intron probes (25.4%). All five arrays have almost an equal amount of SNPs (8–12%) in probe align region of good (green) probes. Similar results were described in different research and database papers. Zhang and colleagues (7) have shown that HG-U133A array contains 12.1 and 8.0% of non-specific and mistargeted probes, respectively. GeneAnnot database summary (19) reports that ∼16% of HG-U133A array probe sets recognize multiple genes. ADAPT database summary (20) reports ∼23.1% of HG-U133 Plus 2.0 array probe sets, which match more than one RefSeq transcript. X:Map database publication (21) contains detailed statistics for Human Exon 1.0 chip. The authors observed 9% of multitarget probe sets and 45% of intergenic probe sets. Very similar values were observed in PLANdbAffy database: 9.1% of multitarget (red and yellow) probes and 45.2% of intergenic (black) probes. X:Map annotates 21 and 23% of all studied probe sets as exon and intron ones respectively, and the similar values is observed in PLANdbAffy (Table 1).

DATABASE USAGE

The database can be used for interpretation of results of gene expression experiments, and also to perform the delicate analysis of expression in certain areas of genome. For example, it is a common situation that different probe sets of one gene demonstrate quite different expression values and it is not clear what is true. Careful analysis of the genomic probe alignment regions can help to explain the difference. It may appear due to some discrepancies in microarray design, the probe can be aligned into the spliced region of a gene, existence of SNPs in probe align regions may cause the decrease of probe signal intensity. In contrast, much more often observed cross-hybridization of a probe will increase the probe signal. PLANdbAffy textual summary page of particular probe set or gene contains the information on transcription, cross-hybridization and SNP status for each probe. From this page one can move to UCSC Genome Browser and see the considered Affymetrix probes as a custom track. This browser contains different annotations for corresponding genome regions, e.g. mapping and sequencing annotation, phenotype and disease annotation, gene, protein, mRNA and EST annotation, etc. This information allows one to perform a qualitative analysis of microarray results and may suit as a good starting point for additional molecular studies.

FUTURE PLANS

We are planning to move our data from hg18 to hg19 version of human genome and update it twice a year by the new mRNA and EST alignments. We also are planning to perform this analysis for the mouse and rat exon-level arrays.

FUNDING

Open Access charges were waived by Oxford University Press. Conflict of interest statement. None declared.

24 in total

Review 1. Dark matter in the genome: evidence of widespread transcription detected by microarray tiling experiments.

Authors: Jason M Johnson; Stephen Edwards; Daniel Shoemaker; Eric E Schadt
Journal: Trends Genet Date: 2005-02 Impact factor: 11.639

2. Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing.

Authors: Qun Pan; Ofer Shai; Leo J Lee; Brendan J Frey; Benjamin J Blencowe
Journal: Nat Genet Date: 2008-11-02 Impact factor: 38.330

3. Quality assessment of the Affymetrix U133A&B probesets by target sequence mapping and expression data analysis.

Authors: Yuriy L Orlov; Jiangtao Zhou; Leonard Lipovich; Atif Shahab; Vladimir A Kuznetsov
Journal: In Silico Biol Date: 2007

4. Evolving gene/transcript definitions significantly alter the interpretation of GeneChip data.

Authors: Manhong Dai; Pinglang Wang; Andrew D Boyd; Georgi Kostov; Brian Athey; Edward G Jones; William E Bunney; Richard M Myers; Terry P Speed; Huda Akil; Stanley J Watson; Fan Meng
Journal: Nucleic Acids Res Date: 2005-11-10 Impact factor: 16.971

5. Hybridization interactions between probesets in short oligo microarrays lead to spurious correlations.

Authors: Michał J Okoniewski; Crispin J Miller
Journal: BMC Bioinformatics Date: 2006-06-02 Impact factor: 3.169

6. NCBI Reference Sequences: current status, policy and new initiatives.

Authors: Kim D Pruitt; Tatiana Tatusova; William Klimke; Donna R Maglott
Journal: Nucleic Acids Res Date: 2008-10-16 Impact factor: 16.971

7. Alternative isoform regulation in human tissue transcriptomes.

Authors: Eric T Wang; Rickard Sandberg; Shujun Luo; Irina Khrebtukova; Lu Zhang; Christine Mayr; Stephen F Kingsmore; Gary P Schroth; Christopher B Burge
Journal: Nature Date: 2008-11-27 Impact factor: 49.962

8. X:Map: annotation and visualization of genome structure for Affymetrix exon array analysis.

Authors: Tim Yates; Michał J Okoniewski; Crispin J Miller
Journal: Nucleic Acids Res Date: 2007-10-11 Impact factor: 16.971

9. Transcript-level annotation of Affymetrix probesets improves the interpretation of gene expression data.

Authors: Hui Yu; Feng Wang; Kang Tu; Lu Xie; Yuan-Yuan Li; Yi-Xue Li
Journal: BMC Bioinformatics Date: 2007-06-11 Impact factor: 3.169

10. Alternative mapping of probes to genes for Affymetrix chips.

Authors: Laurent Gautier; Morten Møller; Lennart Friis-Hansen; Steen Knudsen
Journal: BMC Bioinformatics Date: 2004-08-14 Impact factor: 3.169

16 in total

1. The DeISGylase USP18 limits TRAIL-induced apoptosis through the regulation of TRAIL levels: Cellular levels of TRAIL influences responsiveness to TRAIL-induced apoptosis.

Authors: Ivana Manini; Andrea Sgorbissa; Harish Potu; Andrea Tomasella; Claudio Brancolini
Journal: Cancer Biol Ther Date: 2013-09-19 Impact factor: 4.742

2. Analysis of discordant Affymetrix probesets casts serious doubt on idea of microarray data reutilization.

Authors: Andrey Marakhonov; Nataliya Sadovskaya; Ivan Antonov; Ancha Baranova; Mikhail Skoblov
Journal: BMC Genomics Date: 2014-12-19 Impact factor: 3.969

3. MEF2 is a converging hub for histone deacetylase 4 and phosphatidylinositol 3-kinase/Akt-induced transformation.

Authors: Eros Di Giorgio; Andrea Clocchiatti; Sara Piccinin; Andrea Sgorbissa; Giulia Viviani; Paolo Peruzzo; Salvatore Romeo; Sabrina Rossi; Angelo Paolo Dei Tos; Roberta Maestro; Claudio Brancolini
Journal: Mol Cell Biol Date: 2013-09-16 Impact factor: 4.272

4. Microarray Я US: a user-friendly graphical interface to Bioconductor tools that enables accurate microarray data analysis and expedites comprehensive functional analysis of microarray results.

Authors: Yilin Dai; Ling Guo; Meng Li; Yi-Bu Chen
Journal: BMC Res Notes Date: 2012-06-08

5. A decision theory paradigm for evaluating identifier mapping and filtering methods using data integration.

Authors: Roger S Day; Kevin K McDade
Journal: BMC Bioinformatics Date: 2013-07-15 Impact factor: 3.169

6. Identifier mapping performance for integrating transcriptomics and proteomics experimental results.

Authors: Roger S Day; Kevin K McDade; Uma R Chandran; Alex Lisovich; Thomas P Conrads; Brian L Hood; V S Kumar Kolli; David Kirchner; Traci Litzi; G Larry Maxwell
Journal: BMC Bioinformatics Date: 2011-05-27 Impact factor: 3.169

7. Characterization of the macrophage transcriptome in glomerulonephritis-susceptible and -resistant rat strains.

Authors: K Maratou; J Behmoaras; C Fewings; P Srivastava; Z D'Souza; J Smith; L Game; T Cook; T Aitman
Journal: Genes Immun Date: 2010-12-23 Impact factor: 2.676

8. Gene Expression Ratios Lead to Accurate and Translatable Predictors of DR5 Agonism across Multiple Tumor Lineages.

Authors: Anupama Reddy; Joseph D Growney; Nick S Wilson; Caroline M Emery; Jennifer A Johnson; Rebecca Ward; Kelli A Monaco; Joshua Korn; John E Monahan; Mark D Stump; Felipa A Mapa; Christopher J Wilson; Janine Steiger; Jebediah Ledell; Richard J Rickles; Vic E Myer; Seth A Ettenberg; Robert Schlegel; William R Sellers; Heather A Huet; Joseph Lehár
Journal: PLoS One Date: 2015-09-17 Impact factor: 3.240

9. ImmuSort, a database on gene plasticity and electronic sorting for immune cells.

Authors: Pingzhang Wang; Yehong Yang; Wenling Han; Dalong Ma
Journal: Sci Rep Date: 2015-05-19 Impact factor: 4.379

10. AbsIDconvert: an absolute approach for converting genetic identifiers at different granularities.

Authors: Fahim Mohammad; Robert M Flight; Benjamin J Harrison; Jeffrey C Petruska; Eric C Rouchka
Journal: BMC Bioinformatics Date: 2012-09-12 Impact factor: 3.169