| Literature DB >> 28981577 |
Ajanthah Sangaralingam1, Abu Z Dayem Ullah1, Jacek Marzec1, Emanuela Gadaleta1, Ai Nagano1, Helen Ross-Adams1, Jun Wang1, Nicholas R Lemoine1, Claude Chelala2.
Abstract
Innovations in -omics technologies have driven advances in biomedical research. However, integrating and analysing the large volumes of data generated from different high-throughput -omics technologies remain a significant challenge to basic and clinical scientists without bioinformatics skills or access to bioinformatics support. To address this demand, we have significantly updated our previous O-miner analytical suite, to incorporate several new features and data types to provide an efficient and easy-to-use Web tool for the automated analysis of data from '-omics' technologies. Created from a biologist's perspective, this tool allows for the automated analysis of large and complex transcriptomic, genomic and methylomic data sets, together with biological/clinical information, to identify significantly altered pathways and prioritize novel biomarkers/targets for biological validation. Our resource can be used to analyse both in-house data and the huge amount of publicly available information from array and sequencing platforms. Multiple data sets can be easily combined, allowing for meta-analyses. Here, we describe the analytical pipelines currently available in O-miner and present examples of use to demonstrate its utility and relevance in maximizing research output. O-miner Web server is free to use and is available at http://www.o-miner.org.Entities:
Mesh:
Year: 2019 PMID: 28981577 PMCID: PMC6357557 DOI: 10.1093/bib/bbx080
Source DB: PubMed Journal: Brief Bioinform ISSN: 1467-5463 Impact factor: 11.622
Comparison of workflows and features between O-miner version 1.0 and version 2.0
| Pipeline | Feature | O-miner v1.0 | O-miner v2.0 | |
|---|---|---|---|---|
| Supported platforms | Transcriptomics | Affymetrix Expression Array (Human Genome) | ✓ | Features added to this pipeline (see below) |
| Affymetrix Expression Array (Mouse Genome) | ✗ | ✓ | ||
| Illumina Expression Array (Human Genome) | ✗ | ✓ | ||
| Illumina Expression Array (Mouse Genome) | ✗ | ✓ | ||
| Affymetrix microRNA Array (Human Genome) | ✗ | ✓ | ||
| Affymetrix Exon Array (Human Genome) | ✗ | ✓ | ||
| RNA-Seq (post-processing) | ✗ | ✓ | ||
| Genomics | Affymetrix SNP Array (Human Genome) | ✓ | Simplified | |
| Methylation | IlluminaMethylation Array (Human Genome) | ✗ | ✓ | |
| Input parameters | Transcriptomics | Data type | Raw CEL file, normalized/filtered | |
| Data source | User-defined, GEO repository | Automated suggestion for phenotype annotation from GEO data set | ||
| Analysis type | Paired, unpaired | |||
| Provision for technical replicate | ✓ | |||
| Provision for batch effect correction | ✗ | ✓ | ||
| Provision for survival data | ✗ | ✓ | ||
| Provision for estimate tumour purity | ✗ | ✓ | ||
| Provision for uploading target matrix | ✗ | ✓ | ||
| Genomics | Analytical pipeline | CBS | ASCAT, genome sequencing (post-processing) | |
| Data type | Raw CEL file, normalized, segmented, binary coded | |||
| Data source | User-defined, GEO repository | |||
| Analysis type | Paired, unpaired | |||
| Baseline | User-defined, HapMap | |||
| Provision for uploading target matrix | ✗ | ✓ | ||
| Methylomics | Data type | ✗ | Raw IDAT file, normalized | |
| Data source | ✗ | User-defined, GEO repository | ||
| Analysis type | ✗ | Paired, unpaired | ||
| Provision for technical replicate | ✗ | ✓ | ||
| Provision for batch effect correction | ✗ | ✓ | ||
| Provision for uploading target matrix | ✗ | ✓ | ||
| QC | ArrayMvout, ArrayQualityMetrics | LUMI (Illumina array) | ||
| Analysis parameters | Transcriptomics | Normalization | RMA, GCRMA, TRMA | RSN, SSN, VSN, Quantile (Illumina array) |
| Filter method | IQR, SD, intensity | |||
| Differential expression method | LIMMA | Edge R (RNA-Seq) | ||
| Adjustment method | BH, FDR, BY, Holm | |||
| Provision for | Yes | |||
| Provision for fold-change threshold | Yes | |||
| Gene annotation system | RefSeq, Ensembl, UCSC, Vega | |||
| Genomics | Miscellaneous | miRNA, Cytoband, conserved TFBS | ||
| Minimial common region finder algorithm | CGHRegions | |||
| Provision for defining CNA region | Yes | |||
| QC | ✗ | ChAMP | ||
| Methylomics | Normalization | ✗ | BMIQ, SWAN, PBC | |
| Filter method | ✗ | IQR, SD, intensity | ||
| Differential methylation method | ✗ | LIMMA | ||
| Adjustment method | ✗ | BH, FDR, BY, Holm | ||
| Provision for | ✗ | Yes | ||
| Provision for fold-change threshold | ✗ | Yes | ||
| QC | ArrayMvout report, ArrayQualityMetrics report, Cluster plot | LUMI report (Illumina array), tumour purity report | ||
| Output | Transcriptomics | Differential expression | Gene level | Transcript, exon, splice level (Affymetrix Exon array) |
| Miscellaneous | GO, Venn diagram, Expression plot | Survival plot, correlation tables | ||
| QC | Density plot, cluster plot | |||
| Genomics | Copy number alteration | Gain, Loss | Copy neutral LOH (ASCAT), copy number from genome-sequencing data | |
| Visualization | CNA regions (sample and group level), MCR (group level) | |||
| QC | ArrayMvout report, ArrayQualityMetrics report, Cluster plot | Output from ASCAT algorithm | ||
| Methylomics | Differential methylation | ✗ | CpG island level | |
| Miscellaneous | ✗ | GO, Venn diagram, methylation plot, correlation table | ||
Note: Workflows and features available in O-miner version 1.0 are compared with O-miner version 2.0
Platforms and data types supported by O-miner
| Workflow | Data | Manufacturer | Platform |
|---|---|---|---|
| Transcriptomics | R, N | Affymetrix | miRNA 2.0 |
| miRNA 3.0 | |||
| GeneChip Human Exon 1.0ST | |||
| GeneChip Human Gene 1.1ST & 2.0ST | |||
| GeneChip Human Genome Array U133 Plus 2.0 | |||
| GeneChip Human Genome Array U133 set | |||
| GeneChip Human Genome Array U95 set | |||
| GeneChip Mouse Genome 430 2.0 | |||
| N, U | Illumina | HumanHT-12 v3 | |
| Human HT-12 v4 | |||
| MouseRef-8 v2.0 | |||
| Multiple | RNA-Seq (post-processing only) | ||
| Genomics: CBS | R, N, S, B | Affymetrix | 10K |
| 50K Xba | |||
| 50K Hind | |||
| 100K | |||
| 250K Sty | |||
| 250K Nsp | |||
| 500K | |||
| Genome-Wide Human 5.0 SNP array | |||
| Genome-Wide Human 6.0 SNP array | |||
| Genomics: ASCAT (cancer-specific) | R | Affymetrix | 50K Xba |
| 50K Hind | |||
| 100K | |||
| 250K Sty | |||
| 250K Nsp | |||
| 500K | |||
| Genome-Wide Human 5.0 SNP array | |||
| Genome-Wide Human 6.0 array | |||
| Genomics: Sequencing | P | Multiple | Genome-sequencing (post-processing only) |
| Methylation | R, N | Illumina | Infinium HumanMethylation 27K BeadChip |
| Infinium HumanMethylation 450K BeadChip |
Code: R: raw; N: normalized; U: unnormalized; S: segmented; B: binary coded; P: processed.
Note: O-miner supports the analysis of pre-processed data from RNA-Seq experiments and genomic sequencing data; raw/processed data files generated using Affymetrix and Illumina transcriptomic and genomic arrays; and raw/processed data files from the Illumina Infinium methylation platform.
Figure 1Transcriptomics workflow. O-miner takes as input raw array data (CEL files) from Affymetrix array-based platforms and either normalized/unnormalized data from Illumina expression arrays. QC is performed on data from raw CEL files. Data are then normalized and filtered to remove redundant probes. Users performing meta-analysis have the option to apply the COMBAT algorithm to correct for batch effects when combining data from different studies. Tumour purity can be estimated for Affymetrix data using the ESTIMATE algorithm. Survival analysis can be run for data from all of the array-based platforms. The normalized expression matrix is then subjected to differential expression analysis using LIMMA to identify significantly DEGs between biological groups. Optionally, GO terms that are statistically over- or under-represented are identified using GOstats, and Venn diagrams may be generated. Results are displayed online in expandable tabs and easy to download as text and excel files. (A) Heatmaps of the statistically significant DEGs identified for each of the comparisons are available to download. (B) A boxplot displaying the expression profiles across the biological conditions can be viewed. (C) A Venn diagram showing common and unique genes that are differentially expressed across the biological groups is displayed, if selected from the output options.
Figure 2RNA-Seq post-processing workflow. O-miner provides a workflow for the post-processing of data from RNA-Seq experiments. After the pre-processing stage, comprising QC and alignment steps, a matrix of either raw read counts or RPKM values for each sample are submitted to O-miner. A choice of differential expression analysis methods is available—LIMMA for raw read counts and RPKM values, and edgeR for raw read counts. Like the transcriptomics workflow, users can then select the output options that they wish to implement. These include GO analysis and Venn diagrams. All the results are available as text and excel files and are available for download. The result options and presentation are identical to those generated by the transcriptomics workflow. (A) Unsupervised hierarchical clustering plot from raw read counts data, displaying similarity between gene expression profiles. (B) Venn diagram showing the number of unique and common DEGs between the biological groups.
Figure 3Workflow for CBS analysis. The CBS pipeline generates information about regions of gain and loss. Several steps comprise the CBS workflow, with the steps conducted being dependent on the input type. Raw image CEL files, log2ratios, segmented or binary coded data for a number of Affymetrix SNP arrays are used as input for the workflow. Aroma.affymetrix is applied to the raw CEL files to estimate copy numbers, data normalization and QC. Segmentation is applied using the CBS model. The quartile regression framework is applied to calculate the threshold used to call gains and losses. Regions of gain and loss are annotated from multiple sources. Minimal common regions can be generated using the CGHregions algorithm. (A) The results from each sample are displayed in expandable tabs. These tabs can be expanded further to obtain information about regions of loss and gain, with all findings available to download as an excel file by clicking on the ‘xls’ link. Log2ratio plots based on filtered and unfiltered data are displayed and can be downloaded as PDFs by clicking on the ‘PDF’ icon. (B). For each of the biological groups, frequency plots from both filtered and unfiltered data can be viewed either across all chromosomes or for individual chromosomes. All the filtered frequency plots are available for download as a zipped file by clicking on the arrow on the right-hand side of the window displaying chromosome number. Unfiltered frequency plots can be downloaded as PDFs by clicking on the ‘PDF’ icon. Results shown are from the analysis of data set GSE42525.
Figure 4Workflow for ASCAT analysis. Raw data files are accepted as input. Log2ratios (LRR) and BAFs are calculated using the the R package CalMaTe. These are fitted to an ASPCF model. The ASCAT algorithm is used to estimate aberrant cell fraction, tumour ploidy and absolute allele-specific copy number calls. The results presented are from the analysis of the GSE7130 data set. (A) Raw LRR and BAF plots generated from ASCAT are shown for each sample. (B) Frequency plots of CNAs are also displayed for each biological group, with all frequency plots available for download as a zipped file. Frequency plots are shown across all the chromosomes and also for each individual chromosome. (C) Aberration plots are generated, showing regions of gain (red) and loss (blue) across each of the samples in the data set.
Figure 5Methylation workflow. Raw (IDAT) files and normalized data from Illumina methylation array platforms are accepted as input to the methylation workflow. QC analysis is performed, using the Champ R package. One of the following normalization methods: BMIQ, SWAN and PBC can be chosen to normalize the data. After filtering of the normalized data, differentially methylated probes are identified using LIMMA, with user-defined thresholds for the delta beta value and adjusted P-values applied. Differentially methylated regions are annotated and users can choose to identify statistically significant GO terms from the list of differentially methylated probes. Results shown are from the analysis of data set GSE69118. (A) Sample quality, QC plots and cluster diagrams are presented. Sample quality displays a table showing the sample name and % of failed probes for each sample. QC plots consist of four plots that are available for display and download. These are raw density plot, normalized density plot, raw MDS plot and normalized MDS plot. Cluster diagram displays an unsupervised hierarchical cluster based on normalized methylation data. (B) Each comparison is displayed within an expandable tab alongside information about probeset ID, chromosomal location, HGNC symbol, gene description, whether the region is differentially methylated, location of CpG island, delta beta value and adjusted P-values. A boxplot, showing the difference in methylation values across biological groups, can be also viewed for each probeset ID. (C) Individual comparisons are displayed as separate tabs. Each of the probes reported as differentially methylated are mapped to GO terms, with those that were found to be statistically over- and under-represented listed in tabular format.
Figure 6Application of the transcriptomics workflow for the multi-cohort analysis of BC data. Data Collection: A meta-analysis was conducted using O-miner to investigate the effect of basality on TN BCs. Two Affymetrix data sets GSE48390 and GSE21653 were downloaded using GEO data set as the data source option. The subset of samples defined as triple negative, were selected from the File Organiser window. Analysis Parameters: Once all the sample characteristics and survival covariates were provided, the raw data were normalized using RMA and filtered using SD (top 10%). Samples belonging to each of the data sets were specified and the COMBAT algorithm applied to adjust for batch effects. The resulting normalized matrix was subjected to differential expression and survival analyses. All the results are available and easy to download as text and excel files. Results: (A) Unsupervised hierarchical clustering of the gene expression profiles suggests that TNBL BCs are more similar to each other than to TNnonBL BCs. The cluster is annotated with the sample names and biological groups. Each biological group has its own colour. (B) The GABRP gene was reported differentially expressed between the two biological groups. The expression of GABRP between the TNBL and TNnonBL groups can be displayed by boxplots. (C) Survival, the 5-year KM survival plot suggests that the BLTN group has poorer overall survival relative to the BLnonTN group but this relationship is not significant (P>0.05). (D) Statistically significant GO terms between BLTN and BLnonTN groups are displayed, with hyperlinks to external resources provided.
Figure 7Application of O-miner to the analysis of PCa sequencing data. Data collection: Sequencing data from the TCGA PRAD project were downloaded and subjected to the O-miner RNA-Seq post-processing workflow. Analysis parameters: Following pre-processing of data (QC and alignment steps), a matrix of raw read counts was generated. The matrix of normalized read counts was submitted to O-miner. LIMMA was used to identify DEGs, and statistically significant GO terms were identified. Users can choose to generate Venn diagrams. All of the results are available as text and excel files and are available to download. Results: (A) Significantly DEGs are displayed together with Ensembl gene ID, chromosomal location, fold-change and adjusted P-values. (B) Results of GO analysis of DEGs are displayed in tabular format. Over- and under-represented GO terms are listed and GO IDs, P-values and GO term annotations are present.