| Literature DB >> 35899080 |
Anjana Anilkumar Sithara1, Devi Priyanka Maripuri1, Keerthika Moorthy1, Sai Sruthi Amirtha Ganesh1, Philge Philip2, Shayantan Banerjee1, Malvika Sudhakar1, Karthik Raman1.
Abstract
Despite the tremendous increase in omics data generated by modern sequencing technologies, their analysis can be tricky and often requires substantial expertise in bioinformatics. To address this concern, we have developed a user-friendly pipeline to analyze (cancer) genomic data that takes in raw sequencing data (FASTQ format) as input and outputs insightful statistics. Our iCOMIC toolkit pipeline featuring many independent workflows is embedded in the popular Snakemake workflow management system. It can analyze whole-genome and transcriptome data and is characterized by a user-friendly GUI that offers several advantages, including minimal execution steps and eliminating the need for complex command-line arguments. Notably, we have integrated algorithms developed in-house to predict pathogenicity among cancer-causing mutations and differentiate between tumor suppressor genes and oncogenes from somatic mutation data. We benchmarked our tool against Genome In A Bottle benchmark dataset (NA12878) and got the highest F1 score of 0.971 and 0.988 for indels and SNPs, respectively, using the BWA MEM-GATK HC DNA-Seq pipeline. Similarly, we achieved a correlation coefficient of r = 0.85 using the HISAT2-StringTie-ballgown and STAR-StringTie-ballgown RNA-Seq pipelines on the human monocyte dataset (SRP082682). Overall, our tool enables easy analyses of omics datasets, significantly ameliorating complex data analysis pipelines.Entities:
Year: 2022 PMID: 35899080 PMCID: PMC9310080 DOI: 10.1093/nargab/lqac053
Source DB: PubMed Journal: NAR Genom Bioinform ISSN: 2631-9268
Figure 1.Schema for iCOMIC pipeline. Multiple workflows are embedded in iCOMIC providing users with the complete freedom to choose from the integrated tools. Both DNA-Seq and RNA-Seq pipelines take in raw FASTQ files as input. Quality control and alignment are common steps in both pipelines. FastQC and Cutadapt are the Quality control tools used and MultiQC is used to generate a consolidated report on Quality statistics. Analysis of RNA-Seq data includes mapping of sequencing reads to a reference genome using Aligner, Quantification of expression levels using Expression modeller and Differential expression analysis. On the other hand, steps in DNA-Seq analysis include Alignment followed by identifying the variants and annotating them. Tools incorporated in iCOMIC are listed in Table 1.
List of tools incorporated in iCOMIC along with their corresponding functions
| Function | DNA-Seq tools | RNA-Seq tools |
|---|---|---|
| Quality control | FastQC, MultiQC, Cutadapt | FastQC, MultiQC, Cutadapt |
| Alignment | GEM-Mapper v3, BWA-MEM, Bowtie2, | STAR, HISAT2 |
| Variant calling | GATK HC, samtools mpileup, FreeBayes, GATK Mutect2 | - |
| Annotation | Annovar, SnpEff | - |
| Quantification of expression levels | - | StringTie, HTSeq |
| Differential expression | - | DESeq2, ballgown |
Figure 2.Schematic diagram of DNA-Seq pipeline. The input, followed by the application of various quality control techniques, alignment to the reference genome, variant calling, filtering and annotation are indicated in this figure.
Figure 3.Schematic diagram of RNA-Seq pipeline. The input, followed by the application of various quality control techniques, alignment to the reference genome, counting the mapped reads, normalization, and differential expression analysis, ultimately generating the TXT/PDF output is detailed in this figure.
Figure 4.Snakemake workflow management system. All the input and output files in blue colour are those corresponding to DNA-Seq analysis and those in green correspond to RNA-seq analysis. The common files for DNA and RNA-Seq analysis are represented in red. ‘Rule’ files specifying the input, output and the shell/wrapper script form the basic units of Snakemake. Each rule corresponds to individual tools. The additional parameters for the tools are indicated in the ‘config’ file. According to the choice of tools made by the user, rules are integrated into the Snakefile and the workflow is executed.
Summary of germline variant benchmarking with NA12878/HG001 dataset
|
|
|
|
|
|
|---|---|---|---|---|
| BWA MEM-GATK HC-SnpEff | INDEL | 0.967 | 0.976 | 0.971 |
| SNP | 0.978 | 0.998 | 0.988 | |
| BWA MEM-freebayes-SnpEff | INDEL | 0.931 | 0.917 | 0.924 |
| SNP | 0.979 | 0.997 | 0.988 | |
| BWA MEM-GATK HC-Annovar | INDEL | 0.967 | 0.976 | 0.971 |
| SNP | 0.978 | 0.998 | 0.988 | |
| BWA MEM-Bcftools-Annovar | INDEL | 0.741 | 0.838 | 0.789 |
| SNP | 0.976 | 0.996 | 0.986 | |
| Gem3-GATK HC-SnpEff | INDEL | 0.964 | 0.978 | 0.971 |
| SNP | 0.977 | 0.999 | 0.988 | |
| Gem3-Freebayes-SnpEff | INDEL | 0.934 | 0.92 | 0.927 |
| SNP | 0.978 | 0.998 | 0.988 | |
| BWA MEM-GATK HC-Annovar | INDEL | 0.967 | 0.976 | 0.971 |
| SNP | 0.978 | 0.998 | 0.988 | |
| BWA MEM-Freebayes-Annovar | INDEL | 0.931 | 0.917 | 0.924 |
| SNP | 0.979 | 0.997 | 0.988 | |
| Gem3-GATK HC-Annovar | INDEL | 0.964 | 0.978 | 0.971 |
| SNP | 0.977 | 0.999 | 0.988 | |
| Gem3-Freebayes-Annovar | INDEL | 0.934 | 0.92 | 0.927 |
| SNP | 0.978 | 0.998 | 0.988 | |
| BWA MEM-Bcftools-SnpEff | INDEL | 0.741 | 0.838 | 0.789 |
| SNP | 0.976 | 0.996 | 0.986 | |
| Gem3-Bcftools-SnpEff | INDEL | 0.781 | 0.353 | 0.486 |
| SNP | 0.975 | 0.997 | 0.986 | |
| Bowtie2-GATK HC-SnpEff | INDEL | 0.847 | 0.978 | 0.908 |
| SNP | 0.953 | 0.998 | 0.975 | |
| Bowtie2-GATK HC-Annovar | INDEL | 0.847 | 0.978 | 0.908 |
| SNP | 0.953 | 0.998 | 0.975 | |
| Bowtie2-Freebayes-SnpEff | INDEL | 0.717 | 0.909 | 0.802 |
| SNP | 0.945 | 0.996 | 0.97 | |
| Bowtie2-Freebayes-Annovar | INDEL | 0.717 | 0.908 | 0.802 |
| SNP | 0.945 | 0.996 | 0.97 | |
| Bowtie2-Bcftools-SnpEff | INDEL | 0.648 | 0.891 | 0.75 |
| SNP | 0.944 | 0.985 | 0.964 | |
| Bowtie2-Bcftools-Annovar | INDEL | 0.648 | 0.891 | 0.75 |
| SNP | 0.944 | 0.985 | 0.964 |
Figure 5.Fold change correlation between iCOMIC and reference dataset for the four workflows. The Pearson correlation coefficient was used to calculate fold changes.
Comparison of iCOMIC with existing tools for genomic data analysis. ‘Yes’ specifies the presence of the feature, ‘No’ indicates the absence of a feature, ‘Partial’ indicates the presence of some aspects of the particular feature, and ‘Not Specified’ indicates that the information is not available. The features which are compared included: 1) The accessibility of the tool through Graphical User Interface, 2) commercial availability of the tool, 3) ability of the tool to run on the cloud, 4) automated analyses of the pipeline, 5) ability of the user to customize their own pipeline, 6) programming skills required for performing the analysis, 7) availability of DNA-Seq analysis pipeline, 8) availability of RNA-Seq analysis pipeline, 9) compatibility of the tool for cancer data. Furthermore, we have also listed out the platforms supported by the tools, programming language used to build the tool and workflow language used to write pipelines
| Features/tools | iCOMIC | snakePipes | Sequanix | Omics Pipe | GenPipes | CANEapp | Armor | Galaxy | VIPER | systemPipeR | CLC genomics workbench | nf-core |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| GUI | Yes | No | Yes | No | No | Yes | No | Yes | No | No | Yes | No |
| Open source | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | No | Yes |
| Cloud support | No | Yes | No | Yes | Yes | Yes | No | Yes | No | Yes | No | Yes |
| Automated analysis | Yes | Yes | Yes | Yes | No | Yes | Partial | No | Not specified | Yes | Yes | Yes |
| Custom pipeline | Yes | Yes | Yes | Yes | Yes | Yes | Partial | Yes | No | Yes | No | Yes |
| Programming Skills not necessary | Yes | Partial | Yes | Partial | Partial | Yes | Partial | Yes | Partial | Partial | Yes | Partial |
| DNA-Seq analysis | Yes | Partial | Yes | Yes | Yes | No | No | Yes | No | No | Yes | Yes |
| RNA-Seq analysis | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
| Compatible for cancer data | Yes | Not specified | No | Yes | No | No | No | Yes | Yes | No | Yes | Yes |
| Platform supported | Linux, macOS (v10.15.5 and above), Windows OS* | Not specified | Linux | Not specified | Unix | Windows and mac(GUI), Linux(server side pipeline) | Linux, iOS | Linux, iOS | Unix, iOS | Not specified | Windows, iOS, Linux | Linux, iOS |
| Programming language | Python | Python | Python | Python | Python | Python, Java | R | Web based | Multiple | R | Java | Python |
| Workflow language | Snakemake | Snakemake | Snakemake | Ruffus | Not specified | Not specified | Snakemake | - | Snakemake | SYSargs | Not specified | Nextflow |
*The steps required to successfully install iCOMIC on Windows are discussed in the documentation (Section 2.4).
Summary of germline variant benchmarking with NA12878/HG001 dataset using Galaxy
|
|
|
|
|
|
|---|---|---|---|---|
| BWA MEM-freebayes-SnpEff | INDEL | 0.887 | 0.948 | 0.917 |
| SNP | 0.976 | 0.984 | 0.980 |
Figure 6.Fold change correlation between Galaxy and reference dataset for STAR-HTSeq-DESeq2 workflow. The Pearson correlation coefficient was used to calculate fold changes.