| Literature DB >> 18840289 |
Raj Chari1, Bradley P Coe, Craig Wedseltoft, Marie Benetti, Ian M Wilson, Emily A Vucic, Calum MacAulay, Raymond T Ng, Wan L Lam.
Abstract
BACKGROUND: High throughput microarray technologies have afforded the investigation of genomes, epigenomes, and transcriptomes at unprecedented resolution. However, software packages to handle, analyze, and visualize data from these multiple 'omics disciplines have not been adequately developed.Entities:
Mesh:
Substances:
Year: 2008 PMID: 18840289 PMCID: PMC2571113 DOI: 10.1186/1471-2105-9-422
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Features required for integrative analysis
| Built-in segmentation for array CGH | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |
| Consensus calling using multiple segmentation algorithms | ✓ | |||||||
| Array platform-independent combined CGH analysis | ✓ | ✓ | ✓ | |||||
| Custom microarray data handling | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |
| Basic copy number and expression integration | ✓ | ✓ | ✓ | ✓ | ||||
| Alignment and analysis of genetic and epigenetic data | ✓ | ✓ | ✓ | |||||
| Multi-dimensional visualization of genetic, epigenetic and gene expression data | ✓ | |||||||
| Two group statistical comparison | ✓ | ✓ | ✓ | ✓ | ||||
| Two group combinatorial gene dosage and gene expression comparison | ✓ | |||||||
| Linking to external biological databases | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| Linking to external gene expression (GEOProfiles) | ✓ | |||||||
| Context-based visualization of genome features | ✓ | ✓ | ✓ | ✓ | ✓ | |||
| Conversion of data between different genome assemblies | ✓ | ✓ | ✓ | ✓ | ||||
| Free for academic/research use | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
Figure 1Main structural components of SIGMAData and genome mapping information is stored in the MySQL database. Segmentation analysis using DNACopy and GLAD and statistical analysis is performed using R, with results stored in database. Java was used to program the application, specifically for the user interface and the different types of visualization. Base-pair positions and gene annotations are linked to other biological databases to facilitate further interrogation by the user.
Figure 2(A) Data hierarchy describing the relationship between platforms, assays and 'omics disciplines. (B) Functionality map of SIGMA2. List of the various functions and the output from that function that can be performed given the number of samples or sample groups and dimensions. Multiple sample analysis (single group and two group) are microarray platform independent. Functions listed in boxes are in addition to those listed in the box preceding the arrows.
Summary of Input, analysis, output for each dimension
| Genomics | Copy number | Array CGH | Segmentation | Regions of gain and loss |
| Genomics | LOH | SNPs* | LOH based on consecutive altered markers | Regions of LOH |
| Genomics | LOH | Microsatellite markers | Same as above | Same as above |
| Genomics | Copy number, LOH | Identify regions of uniparental disomy (UPD): LOH with no copy number change | ||
| Epigenomics | DNA methylation | MeDIP + array CGH | Direct thresholding | Regions of enrichment and lack of methylation |
| Epigenomics | DNA methylation | Bilsulphite-based | Visualization against genome position | |
| Epigenomics | Histone modification states | ChIP-on-chip | Direct thresholding | Regions of enrichment and lack of enrichment |
| Epigenomics | DNA methylation, Histone modification states | Epigenetic interplay | Regions of mutually exclusive change between chromatin state and DNA methylation | |
| Transcriptomics | Gene expression** | Microarrays | Heatmap visualization, clustering | Expression of genes of interested based on DNA analysis |
| Transcriptomics | Gene expression** | SAGE | Heatmap visualization, clustering | Expression of genes of interested based on DNA analysis |
| Genomics, Transcriptomics | Copy number, Gene expression | Correlation analysis of copy number and expression | Genes whose expression is strongly regulatd by copy number | |
| Genomics, Epigenomics | Copy number, DNA methylation | Identify regions of concerted change in BOTH copy number and methylation ("two-hit") | ||
| Genomics, Epigenomics | LOH, DNA methylation | Identify allele-specific methylation events | Regions of allele specific aberrant methylation | |
| Genomics, Epigenomics, Transcriptomics | Copy number, LOH, DNA methylation, Histone modification Gene Expression | Identify co-ordinate genetic, epigenetic and gene expression changes | Genes altered at multiple levels |
* Affymetrix and Illumina data must be pre-processed prior to import; ** functionality invoked in the context of genetic and epigenetic data analyses; ***aligned to genome features (Database of genomic variants, CpG Islands, microRNAs etc.)
Figure 3Algorithm for integrating between different array platforms. Data for every platform is matched to genomic position. Subsequently, an interval-based approach is used to systematically query data for each interval. In this figure, the interval, k, is 10 kb in size. By converting everything to genomic position, samples sets of the same disease type but on different array platforms can be aggregated affording the user with additional statistical power.
Figure 4Description of the SIGMA (A) Customizable toolbar with shortcut buttons, (B) Project/Analysis tree to track work within and between sessions, (C) Main display area using tab-based navigation, (D) Information console and (E) Genome features tracks. Here, a copy number change is displayed in the context of CpG islands (red), microRNAs (orange) and regions annotated in the database of genomic variants (blue).
Figure 5(A) Consensus calling using multiple algorithms. Multiple algorithms (and different parameters) can be selected to analyze a given array CGH sample and this can be defined for each array platform independently as each platform may have exhibit different noise and ratio response characteristics. (B) Heterogeneous array analysis using data from multiple array CGH platforms. Sample from the Agilent 244K, Affymetrix SNP 500K and whole genome BAC array were segmented to define areas of gain and loss. Subsequently, the results were aggregated into a frequency histogram plot showing the common areas of gain and loss across the three samples.
Figure 6Parallel visualization and analysis of the copy number and genotype profiles of the breast cancer cell line HCC2218. Genotype profile of the matching normal blood lymphoblast line (HCC2218BL) is also provided to define regions of LOH. DNA copy number profile was generated on the BCCA whole genome tiling path BAC array and genotype profiles are from the Affymetrix SNP 10K array [28]. This region of chromosome arm 3q has a defined segmental copy number loss and the boundary of the change is evident from the LOH profile. In the genotype profile, the horizontal blue lines indicate a SNP transition from heterozygous in normal to homozygous in the tumor, indicating LOH.
Figure 7A two-group two dimensional comparison of 37 NSCLC and 16 SCLC cancer cell lines. First, segmentation analysis is performed to delineate gains and losses in each sample. Next, a statistical comparison of the distribution of gains and losses between the two groups is done using the Fisher's exact test. (A) Using the interactive search, one of the regions of difference identified is on chromosome 7, with a NSCLC and SCLC sample aligned next to each other. The NSCLC has a clear segmental gain of that region, with the SCLC not having the gain. The right-most graph is a frequency plot summary of two sample sets (NSCLC and SCLC). NSCLC is color-coded in red while SCLC in green, and the overlap appears in yellow. The frequency of chromosome arm 7p gain is higher in the red group. (B) A heatmap is shown representing 15 NSCLC and 15 SCLC gene expression profiles, of the specific genes in the region highlighted in yellow. (C) When examining gene expression data of EGFR specifically, a gene in this region, we can see that the expression is drastically higher in NSCLC vs. SCLC, as predicted by the higher frequency of gain in NSCLC vs. SCLC of that region. Gene expression data are represented as log2 of the normalized intensities.
Figure 8Multi-dimensional perspective of chromosome 17 of the HCC2218 breast cancer cell line. Copy number, LOH, and DNA methylation, and profiling identifies an amplification of ERBB2 coinciding with allelic imbalance and loss of methylation. When examining the gene expression, the expression of HCC2218 is significantly higher than a panel of normal luminal and myoepithelial cell lines [29].