| Literature DB >> 26840129 |
Ram Vinay Pandey1,2, Stephan Pabinger1, Albert Kriegner1, Andreas Weinhäusel1.
Abstract
Traditional Sanger sequencing as well as Next-Generation Sequencing have been used for the identification of disease causing mutations in human molecular research. The majority of currently available tools are developed for research and explorative purposes and often do not provide a complete, efficient, one-stop solution. As the focus of currently developed tools is mainly on NGS data analysis, no integrative solution for the analysis of Sanger data is provided and consequently a one-stop solution to analyze reads from both sequencing platforms is not available. We have therefore developed a new pipeline called MutAid to analyze and interpret raw sequencing data produced by Sanger or several NGS sequencing platforms. It performs format conversion, base calling, quality trimming, filtering, read mapping, variant calling, variant annotation and analysis of Sanger and NGS data under a single platform. It is capable of analyzing reads from multiple patients in a single run to create a list of potential disease causing base substitutions as well as insertions and deletions. MutAid has been developed for expert and non-expert users and supports four sequencing platforms including Sanger, Illumina, 454 and Ion Torrent. Furthermore, for NGS data analysis, five read mappers including BWA, TMAP, Bowtie, Bowtie2 and GSNAP and four variant callers including GATK-HaplotypeCaller, SAMTOOLS, Freebayes and VarScan2 pipelines are supported. MutAid is freely available at https://sourceforge.net/projects/mutaid.Entities:
Mesh:
Year: 2016 PMID: 26840129 PMCID: PMC4739551 DOI: 10.1371/journal.pone.0147697
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1The workflow of MutAid.
The MutAid pipeline can be run with a single command. Sanger sequencing data analysis has one start point and the flow of analysis runs from top to the bottom, illustrated by a black arrow. NGS has three starting points: 1) raw reads (red color), 2) high quality FASTQ file (blue color)—in this case first step is skipped and 3) mapped reads in BAM or SAM file format (green color)—in this case step 1 and 2 are skipped. * Step is optional.
MutAid variant cross-referencing.
MutAid constructs direct links to more than 30 publically available databases for each variant in the output summary table. These links are created based on coordinates, and the Entrez gene ID. The table lists, which links will be created for known and novel variants, in exonic and intergenic regions.
| Database | Known variants | Novel variants | Intergeneic region variants |
|---|---|---|---|
| UCSC genome browser | X | X | X |
| Ensembl genome browser | X | X | X |
| Decipher | X | X | X |
| Gwas Central | X | X | X |
| PolyPhen_2 | X | ||
| NCBI dbSNP | X | X | |
| NCBI variation viewer | X | X | |
| Entrez Gene | X | X | |
| ClinVar | X | ||
| dbVar | X | ||
| Genetic Testing Registary (GTR) | X | ||
| WikiGenes | X | X | |
| BioGPS | X | X | |
| Cosmic database | X | ||
| GeneTests | X | ||
| GENATLAS | X | ||
| GeneCards | X | ||
| GOPubmed | X | ||
| H_InvDB | X | ||
| RefSeq mRNA database | X | ||
| HomoloGene | X | ||
| GEO Profiles | X | ||
| UniGene | X | ||
| Pubmed | X | ||
| RefSeq Protein database | X | ||
| UniProt | X | ||
| QuickGO | X | ||
| Reactome pathway database | X | ||
| Kegg pathway database | X | ||
| OMIM database | X | ||
| HGNC database | X | ||
MutAid variant summary output table description.
MutAid produces a final variant summary with one line per variant including experimental information, patient information, variant information, variant effects and database cross-references.
| Output Info category | Output Info | Output Info Example |
|---|---|---|
| Patient_Id | P000002 | |
| Family_Id | F01 | |
| Lab_Analysis_Date | 2013-10-30_18-42-59 | |
| Seq_Platform | Illumina | |
| Seq_System | HiSeq2000 | |
| Assay_Id | BRCA_Panel1 | |
| Var_Id | chr13.GRCh37:g.18258370G>A | |
| Var_Type | SNV | |
| Var_Cov | 229 | |
| Total_Cov | 426 | |
| A | 197 | |
| C | 0 | |
| G | 229 | |
| T | 0 | |
| Var_Chr | chr13 | |
| Var_Start | 18258369 | |
| Var_End | 18258370 | |
| Var_Strand | + | |
| Var_Gene | NAT2 | |
| Var_RefGene | NM_000015 | |
| Var_Feature | exon_2;CDS_2 | |
| Var_DNA | A>G | |
| Var_Codon | AAA>AAG | |
| Var_AA | Lys>Lys | |
| Frameshift | ||
| FASTQC_Report | p000002_1.fq_fastqc_qc_report.html;p000002_2.fq_fastqc_qc_report.html | |
| Patient_Fastq | p000002_1.fq;p000002_2.fq | |
| Patient_Bam | p000002.bam | |
| Patient_Vcf | p000002.vcf | |
| Entrez_Gene | URL to Entrez Gene database by Entrez gene ID (10) | |
| RefSeq_mRNA | URL to RefSeq mRNA nucleotide database by RefSeq Transcript ID (NM_000015) | |
| RefSeq_Protein | URL to RefSeq Protein database by RefSeq Protein ID (NP_000006) | |
| HomoloGene | URL to HomoloGene database by RefSeq Transcript ID (NM_000015) | |
| GEO_Profiles | URL to GEO_Profiles database by RefSeq Transcript ID (NM_000015) | |
| UniGene | URL to UniGene database by RefSeq Transcript ID (NM_000015) | |
| Pubmed | URL to PubMed database by RefSeq Transcript ID (NM_000015) | |
| dbSNP | URL to dbSNP database by SNP identifier (rs1799931) | |
| ClinVar | URL to ClinVar by Entrez gene ID (10) | |
| dbVar | URL to dbVar by Entrez gene ID (10) | |
| NCBI variation viewer | URL to dbSNP database by SNP identifier (rs1799931) | |
| Cosmic | URL to Cosmic database by Entrez gene symbol (NAT2) | |
| Gen_Test_Reg | URL to Genetic Testing Registry by Entrez gene ID (10) | |
| Omim | URL to OMIM by omim ID (612182) | |
| Hgnc | URL to HGNC by HGNC ID (7646) | |
| PolyPhen_2 | URL to PolyPhen2 database by SNP identifier (rs1799931) | |
| Decipher | URL to Decipher genome browser by genomic cooridinate (chr8:18258370..18258370) | |
| Kegg | URL to KEGG pathway by KEGG Pathway ID (hsa03440) | |
| Kegg_Locus | URL to KEGG pathway Locus by KEGG Pathway ID and Entrez Gene ID(hsa03440 and 675) | |
| Reactome | URL to Reactome database by UniProt Protein identifier (P51587) | |
| WikiGenes | URL to WikiGenes by Entrez gene ID (10) | |
| GeneTes | URL to GeneTes by Entrez gene symbol (NAT2) | |
| BioGPS | URL to BioGPS by Entrez gene ID (10) | |
| GENATLAS | URL to GENATLAS by Entrez gene symbol (NAT2) | |
| GeneCards | URL to GeneCards by Entrez gene symbol (NAT2) | |
| GOPubmed | URL to GOPubmed by Entrez gene symbol (NAT2) | |
| H_InvDB | URL to H_InvDB by Entrez gene symbol (NAT2) | |
| UniProt | URL to UniProt database by UniProt Protein identifier (P51587) | |
| QuickGO | URL to QuickGO database by UniProt Protein identifier (P51587) | |
| UCSC | URL to UCSC genome browser by genomic coordinate (chr8:18258370–18258370) | |
| Ensembl | URL to Ensembl genome browser by genomic coordinate (chr818258370-18258370) | |
| GWAS_Central | URL to GWAS Central genome browser by genomic cooridinate (chr8:18258370..18258370) | |
| Fishers_Exact_Test_pvalue | 2.2e-16 | |
| Fishers_Exact_Test_pvalue_FDR_corrected | 0,076420371 | |
| dbSnp_Id | rs1801406 | |
| dbSnp_Common | X | |
| dbSnp_Coding | X | |
| Gwas_Catalogue | X | |
| dbSnp_Flagged | X | |
| dbSnp_Mult | X | |
| dbSnp_HapMap | X | |
| dbSnp_Cpg_Island | X | |
| Mapper_Name | Bwa;Bowtie2;GSNAP | |
| Variant_Caller | gatk;freebayes;samtools;varscan;hotspot |
Fig 2Venn diagrams of called SNVs in MutAid by four variant callers with BWA, Bowtie2 and GSNAP mapping (A) Freebayes (B) GATK-HaplotypeCaller. (C) SAMTOOLS and (D) VarScan2.
GATK shows 93.29% overlap between at least two mappers whereas Varscan2 shows least overlap among all four variant callers with 78%. SAMTOOLS and Freebayes show 92.23% and 88%, respectively, agreement with at least two mappers.
Fig 3Venn diagrams of called INDELs in MutAid by four variant callers using BWA, Bowtie2 and GSNAP mapping results. (A) Freebayes (B) GATK-HaplotypeCaller. C) SAMTOOLS and (D) VarScan2.
GATK shows 90.78% overlap between at least two mappers and SAMTOOLS shows least overlap among all four variant callers with 74.34%. Varscan2 and Freebayes show 76.56% and 83.70%, respectively, agreement with at least two mappers.
Fig 4Venn diagrams of called SNV by four variant callers using (A) BWA (B) Bowtie2 (C) GSNAP with same mapper.
Result shows that 75% - 84% SNVs are common with at least two out of four variant callers. With all 3 mappers Varscan2 identified novel SNVs from 16% - 24%.
Fig 5Venn diagrams of called INDEL by four variant callers using (A) BWA (B) Bowtie2 (C) GSNAP with same mapper.
Consistent with SNV results more than 78% INDELS are identified by at least two variant callers.
Fig 6Visualization of SNVs in IGV called by MutAid pipeline with Illumina and Sanger sequencing data analysis.
MutAid produces BAM files for NGS and Sanger, which can be loaded into IGV to view and confirm the identified variants. In blue color we can see that SNV (T>C) has been identified by NGS (top panel) and confirmed by Sanger sequencing (middle panel).
Fig 7Visualization of conservation track in UCSC genome browser for novel variants.
MutAid constructs a direct link to the UCSC genome browser for all variants including novel variants. On top, reference nucleotides are displayed and in the bottom panel (highlighted with green color) the conservation track of several species is displayed. To confirm novel variants, conservation analysis can be performed for each mutation position. A novel mutation might be ignored if a position has poor conservation among the species (pointed by red color arrow). A novel mutation may be further analyzed if the position is highly conserved (pointed by blue color arrow).
Identified SNVs by MutAid pipeline.
SNVs were called using four variant callers (for each mapping result) with a minimum read coverage of 20, minimum variant allele coverage of 4 and a base quality of at least 20. The percentage given in brackets is the fraction of SNVs having an entry in the Single Nucleotide Polymorphism Database (dbSNP) version 137.
| Freebayes (%) | GATK-HaplotypeCaller (%) | SAMTOOLS (%) | Varscan2 (%) | |
|---|---|---|---|---|
| Bowtie2 | 4654(90.09) | 4452(94.77) | 4149(97.78) | 10731(55.65) |
| BWA | 5079(89.33) | 4802(93.77) | 4447(97.55) | 11920(55.99) |
| GSNAP | 4606(94.03) | 4402(96.93) | 4368(98.10) | 8513(68.80) |
Identified INDELs by MutAid pipeline.
INDELs were called using four variant callers (for each mapping result) using the same settings as for SNV calling. The percentage given in brackets is the fraction of INDELs having an entry in dbSNP version 137.
| Freebayes (%) | GATK-HaplotypeCaller (%) | SAMTOOLS (%) | Varscan2 (%) | |
|---|---|---|---|---|
| Bowtie2 | 853 (39.51) | 823 (43.74) | 464 (45.91) | 1953 (31.29) |
| BWA | 845 (38.46) | 819 (41.64) | 406 (45.07) | 1835 (30.19) |
| GSNAP | 849 (40.40) | 847 (42.27) | 466 (45.92) | 1900 (30.53) |
Comparison of various features of MutAid and other tools for NGS and Sanger data analysis.
| Features | MutAid v1.0 | ngs_backbone v1.4 | bcbio-nextgen 0.9.0 | SIMPLEX v2.0 |
|---|---|---|---|---|
| Variant annotation | yes | No | no | No |
| Co-analysis of hotspot mutations | yes | No | no | no |
| Sanger data analysis | yes | Yes | no | no |
| Short read mappers | BWA, Bowtie, Bowtie2, TMAP, GSNAP | BWA | BWA, Bowtie2 | BWA |
| Variant callers | GATK-HaplotyperCaller, SAMTOOLS, Freebayes, Varscan2 | GATK | GATK, muTect, Freebayes | GATK |
| Multiple variant callers in one run | yes | No | yes | no |
| Quality control | yes | yes | no | yes |
| Sequencing data supported | targeted sequencing, exome sequencing and whole genome sequencing | transcriptome sequencing | exome sequencing, genome sequencing and transcriptome sequencing | exome sequencing |
| Several data analysis in single run | yes | no | yes | no |
| Virtual Machine | yes | no | yes | yes |
| Installation required | no | yes | no | no |
| Supported sequencing platforms | Sanger, Illumina, 454, Ion torrent | 454, Illumina, ABI SOLiD | Illumina, 454 | Illumina, ABI SOLiD |
| Parallel processing | yes | no | yes | yes |
| Multiple dataset parallel analysis | yes | no | no | yes |
| Dependency for installation | no | yes | no | yes |
| Graphical QC report | yes | no | no | yes |