The Overlay Tool has been developed to combine high throughput data derived from various microarray platforms. This tool analyzes high-resolution correlations between gene expression changes and either copy number abnormalities (CNAs) or loss of heterozygosity events detected using array comparative genomic hybridization (aCGH). Using an overlay analysis which is designed to be performed using data from multiple microarray platforms on a single biological sample, the Overlay Tool identifies potentially important genes whose expression profiles are changed as a result of losses, gains and amplifications in the cancer genome. In addition, the Overlay Tool will incorporate loss of heterozygosity (LOH) probability data into this overlay procedure. To facilitate this analysis, we developed an application which computationally combines two or more high throughput datasets (e.g. aCGH/expression) into a single categorized dataset for visualization and interrogation using a gene-centric approach. As such, data from virtually any microarray platform can be incorporated without the need to remap entire datasets individually. The resultant categorized (overlay) data set can be conveniently viewed using our in-house visualization tool, aCGHViewer (Shankar et al. 2006), which serves as a conduit to public databases such as UCSC and NCBI, to rapidly investigate genes of interest.
The Overlay Tool has been developed to combine high throughput data derived from various microarray platforms. This tool analyzes high-resolution correlations between gene expression changes and either copy number abnormalities (CNAs) or loss of heterozygosity events detected using array comparative genomic hybridization (aCGH). Using an overlay analysis which is designed to be performed using data from multiple microarray platforms on a single biological sample, the Overlay Tool identifies potentially important genes whose expression profiles are changed as a result of losses, gains and amplifications in the cancer genome. In addition, the Overlay Tool will incorporate loss of heterozygosity (LOH) probability data into this overlay procedure. To facilitate this analysis, we developed an application which computationally combines two or more high throughput datasets (e.g. aCGH/expression) into a single categorized dataset for visualization and interrogation using a gene-centric approach. As such, data from virtually any microarray platform can be incorporated without the need to remap entire datasets individually. The resultant categorized (overlay) data set can be conveniently viewed using our in-house visualization tool, aCGHViewer (Shankar et al. 2006), which serves as a conduit to public databases such as UCSC and NCBI, to rapidly investigate genes of interest.
Summary of different array identifiers for the EGFR gene.
Experiment Type
aCGH
Expression Profile
SNP Studies
Array Platform
RPCI 6k BAC Array
U133 Plus 2 Array
Mapping 100K Array
EGFR
RP11-81B20
1565 483_at
SNP_A-1656548
RP11-89E8
1565 484_X_at
SNP_A-1656756
201983_s_at
SNP_A-1663463
201984_s_at
SNP_A-1665473
211550_at
SNP_A-1703938
211551_at≠
SNP_A-1705242≠
≠Since there were 15 probe sets for EGFR on the Affymetrix U133 Plus 2 array, many of these have been omitted to abbreviate the table.
≠Since there were 29 SNPs for EGFR on the Affymetrix Mapping 100K array, the ‘majority’ of have been omitted to abbreviate the table.
Table 2.
Values obtained from two high throughput platforms for the EGFR gene.
Experiment Type
aCGH Studies
Expression Studies
Array Platform
RPCI 6K BAC Array
Log2Ratio
U133 Plus 2 Array
SLR
EGFR
RP11-81B20
4.41
1565483_at
4.10
RP11-89E8
4.425
1565484_x_at
2.93
201983_s_at
4.49
201984_s_at
4.67
211550_at
−0.04
211551_at
0.40
211607_x_at
4.74
224999_at
4.66
232120_at
3.24
232541_at
4.42
232925_at
4.58
233044_at
2.90
237938_at
1.62
243327_at
0.45
Annotation files for the BAC arrays are generated by using the FISH mapping and BAC end sequence information (available at http:/hgdownload.cse.ucsc.edu/goldenPath/hg18/database/) and comparing them to the NCBI published positional information of the known genes determined by Refseq alignment, protein evidence and/or transcript evidence. In our specific application, a minimum 30% of the gene sequence coverage is required to associate a BAC to a gene, which avoids erroneous assignment of a gene to a BAC with minimal overlap.The associated gene symbol for the Affymetrix probe set IDs were used to generate the annotation files for the Affymetrix platforms after converting them to the HUGO-approved unique identifiers. This process facilitated identifying those particular Affymetrix probe sets that used either a previous HUGO symbol or alias as its main identifier. Whenever a particular probe set points to a gene symbol that is not unique (either the symbol is both a previous symbol of another gene and a current approved symbol in the HUGO database), the mapping information provided by Affymetrix is cross-compared to the HUGO database to manually resolve the accepted identifier. With the Affymetrix mapping SNP arrays, only SNPs that are located within 10,000bp of the genomic location of the start or the end of a gene are used. Although the mapping SNP arrays also provide an estimate of copy number at each locus, due to the noise levels associated with the SNP array, we felt it was not appropriate to use similar inference type evaluations that have been applied to the BAC arrays, although the high density nature of the SNP arrays provide a more extensive coverage than the BAC arrays (see below).
(A): Genomic location and cytogenetic information of the EGFR gene, a category ‘I’ gene in the RPCI 6K BAC array. Notice EGFR is covered by two BACs (RP11-81B20 and RP11-89E8). (B) Genomic location and cytogenetic information of tumor suppressor PTEN, a category ‘G’ gene in the RPCI 6K BAC array. Notice PTEN is flanked by two specific BACs on the RPCI 6K BAC array: RP11-79A15 and RP11-129G17.
‘G’ Class
Unlike the ‘I’ class genes, ‘G’ class genes are not specifically interrogated by a specific BAC on the array. Due to the consistently high signal to noise ratio, an evaluation is based on ‘inferred’ data from the neighboring BACs. ‘G’ class genes lie in regions of the chromosome which are not covered by a specific BAC. However, if two adjacent BACs on the array show a CNA, it is inferred that the region between them follows the same trend (gain or loss). For example, the locus containing the tumor suppressor gene, phosphatase and tensin homolog (PTEN), is deleted in a small proportion of malignant glioblastomas (Fig. 6). The PTEN gene sequence is not specifically represented on the RPCI 6K BAC array but is flanked by BACsRP11-79A15 and RP11-129G17. The values of the two neighboring BACs are first evaluated either by comparison to user-defined thresholds (non-categorized data) or by using previously generated categories through statistical analysis (categorized data) to give each a data point-specific category. The platform-specific summary category is then evaluated by either one of two currently supported approaches: One-Agree approach or Both-Agree approach. When the ‘Both-Agree approach’ is selected, the platform-specific summary category deviates from ‘no change’ only when both of the neighboring BACs have concordant data point-specific categories of gain/loss. This is contrasted by the One-Agree approach where any one of the two BACs showing a trend will generate a platform-specific summary category of that trend. In our opinion, this evaluation method can be dangerously liberal, and while we recognize that this approach may not be appropriate in all circumstances, we have included this option to be used solely in highly specific scenarios. Since we have found that the Both-Agree approach may lead to the erroneous exclusion of genes residing in gaps adjacent to large amplifications, the use of the One-Agree approach is valid with the caveat that it is applied in the appropriate manner.
A brief description of the highlighting convention used for displaying correlations in aCCHViewer using output from the overlay tool. Based on the choice of the overlay type, the colors used to highlight concordance may signify different types of agreement. This is especially evident for the overlay of aCGH with LOH.
Overlay Type
Data Points highlights in aCGHViewer
Red
Green
Black
aCGH/Expression
Gain/Upregulation
Loss/Downregulation
No agreement.
not coverd by all platforms, no change
SNP CN/Expession
Gain/Upregulation
Loss/Downregulation
No agreement.
not coverd by all platforms, no change
aCGH/SNP LOH/Expression
Gain/Upregulation/LOH
Loss/Downregulation/LOH
No agreement.
not coverd by all platforms, no change
SNP LOH/Expression
LOH/Upregulation
LOH/Downregulation
No agreement.
not coverd by all platforms, no change
For genes with single data points, all of the evaluation methods will generate the same result. However, when a gene is represented by multiple probes (probe sets), the results could differ slightly depending on which evaluation method is selected. To show the subtle nuisances of the various evaluation methods, we have undertaken a preliminary analysis for expression data of one gene for user reference. As an example, analysis of the EGFR gene demonstrates the changes that might be encountered using the different calculation options.Thus, EGFR is represented by multiple probe sets on the Affymetrix U133Plus2 platform. Using SAM, a two-class unpaired analysis (using a t-test as test statistic with 5,000 permutations) was performed among samples with and without evidence of copy number amplification of the EGFR locus using the BAC array data (Rossi et al. 2005). All probe sets were weighted equally and a threshold value of 1.5 (SLR) was chosen for this analysis (see Fig. 7). All of the methods used for this analysis, regardless of stringency, gave the same result for samples that showed a physical amplification, with the exception of one tumor sample (#57). Analysis of tumors with no amplification, however, showed a distinct difference between the more liberal evaluation methods (‘majority’ and ‘max-min’) and the more stringent evaluation methods (‘weighted means’ and ‘median’). The results from the ‘majority’ and ‘max-min’ evaluation methods indicate that there is, in general, an increase in expression, even though there is no amplification. Using the ‘weighted means’ and ‘median’ evaluation methods, increase in expression in any of the samples without physical amplification is not detected. However, by weighting the different probe set based on their suffixes may introduce additional changes when generating the platform specific category (see Fig. 7).
Figure 7.
A preliminary evaluation of the differences in calculation methods and their subsequent determination of the platform specific category for the EGFR gene. The expression of EGFR is interrogated by multiple probe sets on the Affymetrix Gene-Chip U133 Plus 2 array. Samples were segregated by the presence or absence of EGFR amplification as seen on the BAC arrays (data not shown). Notice in tumor samples with EGFR amplification, all of the evaluation methods show upregulation of EGFR expression, regardless of the choice of calculation method (with the exception on the ‘median’ approach on tumor #57). Where there is no EGFR amplification, the choice of calculation method has a large impact as to whether the platform specific category is called up, down or no change. However, in the overlay of the aCGH platform with expression, those tumor samples without amplification would yield a cross platform category of “no agreement” and thus would not be highlighted in the overlay.
Authors: Michael R Rossi; Jeffrey La Duca; Sei-Ichi Matsui; Norma J Nowak; Lesleyann Hawthorn; John K Cowell Journal: Genes Chromosomes Cancer Date: 2005-12 Impact factor: 5.006
Authors: Raoul-Sam Daruwala; Archisman Rudra; Harry Ostrer; Robert Lucito; Michael Wigler; Bud Mishra Journal: Proc Natl Acad Sci U S A Date: 2004-11-08 Impact factor: 11.205
Authors: Ganesh Shankar; Michael R Rossi; Devin E McQuaid; Jeffrey M Conroy; Daniel G Gaile; John K Cowell; Norma J Nowak; Ping Liang Journal: Cancer Inform Date: 2006
Authors: Wessel N van Wieringen; Kristian Unger; Gwenaël G R Leday; Oscar Krijgsman; Renée X de Menezes; Bauke Ylstra; Mark A van de Wiel Journal: BMC Bioinformatics Date: 2012-05-04 Impact factor: 3.169