| Literature DB >> 23341494 |
Stephan Pabinger1, Andreas Dander, Maria Fischer, Rene Snajder, Michael Sperk, Mirjana Efremova, Birgit Krabichler, Michael R Speicher, Johannes Zschocke, Zlatko Trajanoski.
Abstract
Recent advances in genome sequencing technologies provide unprecedented opportunities to characterize individual genomic landscapes and identify mutations relevant for diagnosis and therapy. Specifically, whole-exome sequencing using next-generation sequencing (NGS) technologies is gaining popularity in the human genetics community due to the moderate costs, manageable data amounts and straightforward interpretation of analysis results. While whole-exome and, in the near future, whole-genome sequencing are becoming commodities, data analysis still poses significant challenges and led to the development of a plethora of tools supporting specific parts of the analysis workflow or providing a complete solution. Here, we surveyed 205 tools for whole-genome/whole-exome sequencing data analysis supporting five distinct analytical steps: quality assessment, alignment, variant identification, variant annotation and visualization. We report an overview of the functionality, features and specific requirements of the individual tools. We then selected 32 programs for variant identification, variant annotation and visualization, which were subjected to hands-on evaluation using four data sets: one set of exome data from two patients with a rare disease for testing identification of germline mutations, two cancer data sets for testing variant callers for somatic mutations, copy number variations and structural variations, and one semi-synthetic data set for testing identification of copy number variations. Our comprehensive survey and evaluation of NGS tools provides a valuable guideline for human geneticists working on Mendelian disorders, complex diseases and cancers.Entities:
Keywords: Mendelian disorders; bioinformatics tools; cancer; next-generation sequencing; variants
Mesh:
Year: 2013 PMID: 23341494 PMCID: PMC3956068 DOI: 10.1093/bib/bbs086
Source DB: PubMed Journal: Brief Bioinform ISSN: 1467-5463 Impact factor: 11.622
Figure 1:Basic workflow for whole-exome and whole-genome sequencing projects. After library preparation, samples are sequenced on a certain platform. The next steps are quality assessment and read alignment against a reference genome, followed by variant identification. Detected mutations are then annotated to infer the biological relevance and results can be displayed using dedicated tools. The found mutations can further be prioritized and filtered, followed by validation of the generated results in the lab.
Variant identification
| Name | OS | BAM/SAM input | Other inputs | Output | Identifies | Data set | Result |
|---|---|---|---|---|---|---|---|
| Germline callers | |||||||
| CRISP | Lin | Yes | – | VCF | SNP, INDEL | KTS | 24 034 SNPs, 259 INDELs |
| GATK (UnifiedGenotyper) | Lin | Yes | – | VCF | SNP, INDEL | KTS | 49 476 SNPs, 1959 INDELs |
| SAMtools | Lin | Yes | FASTA | VCF | SNP, INDEL | KTS | 21 852 SNPs, 332 INDELs |
| SNVer | Lin, Mac, Win | Yes | – | VCF | SNP, INDEL | KTS | 22 105 SNPs, 234 INDELs |
| VarScan 2 | Lin, Mac, Win | No | pileup/mpileup | VCF, VarScan CSV | SNP, INDEL | KTS | 34 984 SNPs, 1896 INDELs |
| Somatic callers | |||||||
| GATK (SomaticIndelDetector) | Lin | Yes | – | VCF | INDEL | WES | 151 INDELs |
| SAMtools | Lin | Yes | FASTA | BCF | SNP, INDEL | WES | Canceledb |
| SomaticSniper | Lin | Yes | – | VCF, somatic sniper output | SNP, INDEL | WES | 6926 SNPs |
| VarScan 2 | Lin, Mac, Win | No | pileup/mpileup | VCF, VarScan CSV | SNP, INDEL, CNV | WES | 1685 SNPs, 324 INDELs |
| CNV identification tools | |||||||
| CNVnator | Lin | Yes | FASTA | CSV | CNV | cnv_sim | 39 CNVs |
| RDXplorer | Lin, Mac | Yes | FASTA | CSV | CNV | cnv_sim | 4 CNVsc |
| CONTRA | Lin, Mac | Yes | FASTA | VCF, CSV | CNV | WES | 3 CNVs |
| ExomeCNV | Lin, Mac, Win | Yes | pileup + BED + FASTA | CSV | CNV, LOH | WES | 137 CNVs |
| SV identification tools | |||||||
| BreakDancer | Lin, Mac | Yes | config file | CSV, BED | INDEL, INV, TRANS, CNV | WGS (tumor + normal) | 6219 DELs, 0 INSs, 7 INVs, 17 303 ITX, 5037 CTX |
| Breakpointer | Lin | Yes | – | GFF | INDEL | WGS (tumor) | d |
| CLEVER | Lin | Yes | FASTA | CLEVER format | INDEL | WGS (tumor) | d |
| GASVPro (GASVPro-HQ) | Lin, Mac | Yes | – | clusters file | INDEL, INV, TRANS | WGS (tumor) | 2529 DELs, 207 INVs |
| SVMerge | Lin | Yes | FASTA | BED | INDEL, INV, CNV | – | Abortede |
Four different types of tools for variant identification can be distinguished: germline callers, somatic callers, CNV identification and SV identification tools. Listed are the results of the tested applications (4, 2, 3 and 5, respectively). All surveyed applications are listed in Supplementary Tables S3–S6. aSNVs are counted based on their position but in a sequence independent manner. bSomatic mutation calling with SAMtools was canceled due to unclear definition of tumor and normal files. Furthermore, we were not able to find the CLR field in the resulting vcf file, which should hold the Phred-log ratio between the likelihood by treating the two samples independently, and the likelihood by requiring the genotype to be identical. cFor RDXplorer the filtered result data set was used. dCLEVER and Breakpointer created result files with >2.6 million lines, which need to be further processed. eInstallation was aborted due to unreasonable dependencies. OS, operating system; Lin, Linux; Mac, Mac OS X; Win, Windows; BAM, Binary SAM; BED, Browser Extensible Data, a text-based file format; CSV, comma separated values; FASTA, text-based format for representing nucleotide sequences; GFF, general feature format; mpileup, multisample pileup; pileup, text-based format representing base-pair information at each chromosomal position; SAM, Sequence Alignment/Map; VCF, Variant Call Format; CNV, copy number variation; CTX, inter-chromosomal translocation; DEL, deletion; INDEL, insertion/deletion; INS, insertion; INV, inversions; ITX, intra-chromosomal translocation; LOH, loss of heterozygosity; SNP, single-nucleotide polymorphism; SNV, simple nucleotide variant; SV, structural variant; TRANS, translocations.
Variant annotation
| Name | OS | Input | Output | SNP | INDEL | CNV | GUI | CLI | Web | Function/Location Parameters | DB IDs | Number of scores |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ANNOVAR | Lin, Mac, Win, web interface | VCF, pileup, CompleteGenomics, GFF3-SOLiD, SOAPsnp, MAQ, CASAVA | TXT | Yes | Yes | Yes | No | Yes | No | 9 (func) + 11(exonic-func) | Yes | GERP++ conservation, LRT, MutationTaster, PhyloP conservation, PolyPhen, SIFT |
| AnnTools | Lin, Mac | VCF, pileup, TXT | VCF | Yes | Yes | Yes | No | Yes | No | 5 (position) + 4 (functional class) | Yes | – |
| NGS–SNP | Lin, Mac | VCF, pileup, MAQ, diBayes, TXT | TXT | Yes | No | No | No | Yes | No | 17 | Yes | Condel, PolyPhen, SIFT |
| SeattleSeq | web interface | VCF, MAQ, CASAVA, GATK BED, custom | VCF, SeattleSeq | Yes | Yes | No | No | No | Yes | 11(dbSNP) + 5 (GVS) | Yes | GERP, Grantham, phastCons, PolyPhen |
| snpEff | Lin, Mac, Win | VCF, pileup/TXT (deprecated) | VCF, TXT, HTML overview | Yes | Yes | No | No | Yes | No | 34 | Yes | – |
| SVA | Lin | VCF, SV.events file, BCO | CSV | Yes | Yes | Yes | Yes | Yes | No | 17 (SNP), 17 (INDEL), 10 (CNV) | Yes | – |
| VARIANT | web interface | VCF, GFF2, BED | web report, TXT | Yes | Yes | No | No | Yes | Yes | 26 | Yes | – |
| VEP | Lin, web interface | VCF, pileup, HGVS, TXT, variant identifiers | TXT | Yes | Yes | No | No | Yes | Limited | 28 | Yes | Condel, PolyPhen, SIFT |
Tools for annotation of different variants are displayed. Some of the listed applications are available via web, whereas others have to be installed locally and can be accessed via a command-line interface. These reviewed applications calculate different scores and use public databases for annotation. Each mutation will end up with several annotations from each single tool. Annotation tools that were not tested are listed in Supplementary Table S7. OS, operating system; Lin, Linux; Mac, Mac OS X; Win, Windows; CLI, command line interface; CNV, copy number variation; GUI, graphical user interface; INDEL, insertion/deletion; SNP, single-nucleotide polymorphism; ASM.tsv, Complete Genomics’ text-based genotyping-calling format; BCO, binary format coverage and quality score file including: consensus quality, SNP quality, RMS mapping quality, read depth; BED, Browser Extensible Data, a text-based file format; CASAVA, genotype-calling output format of Illumina’s CASAVA (Consensus Assessment of Sequence and Variation) software; CSV, Comma Separated Values; GFF2, Generic Feature Format version 2; GFF3, Generic Feature Format version 3; HGVS, nomenclature for the description of sequence variants by HGVS (Human Genome Variation Society); MAQ, genotype-calling output format of Maq (Mapping and Assembly with Qualities); variant identifiers, e.g. dbSNP rsIDs or any synonym for a variant present in the Ensembl Variation database; VCF, variant call format; pileup, text-based format representing base-pair information at each chromosomal position; SOAPsnp, genotype-calling format from the SOAPsnp component of the Short Oligonucleotide Analysis Package (SOAP); SV.events file, ERDS (Estimation by Read Depth with SNVs) output, each row in the.events file corresponds to a CNV.
Figure 2:Venn diagrams showing the number of identified variants for tested germline (A), somatic (B), CNV (C) and exome CNV (D) tools. The depicted numbers in (A) and (B) report identified SNPs and INDELs. Venn diagram (C) shows the overlap between known (cnv_sim) and predicted CNVs. Figure (D) illustrates the overlap between CONTRA and ExomeCNV. The intersection numbers were adjusted to reflect that 10 CNVs detected by CONTRA are located within 3 CNVs reported by ExomeCNV.
Visualization
| Name | OS | BAM/ SAM | VCF | Other formats | Annotation |
|---|---|---|---|---|---|
| Web-based genome browsers | |||||
| Ensembl Genome Browser | web interface | Yes | Yes | BED, bedGraph, GFF, GTF, PSL, WIG, BAM, bigWig | Yes |
| UCSC Genome Browser | web interface | Yes | Yes | BED, bigBed, bedGraph, GFF, GTF, WIG, bigWig, MAF, SNP, PSL | Yes |
| VEGA Genome Browser | web interface | Yes | Yes | BED, bedGraph, bigBed, bigWig, GBrowse, GFF, GTF, PSL, WIG | Yes |
| Stand-alone genome browsers | |||||
| Artemis | Lin, Mac, Win | Yes | Yes | BCF, FASTA | Yes |
| Integrative Genomics Viewer (IGV) | Lin, Mac, Win | Yes | Yes | SNP, GFF, BED, IGV, TAB, WIG, (>30 formats) | Yes |
| Savant | Lin, Mac, Win | Yes | Yes | FASTA, BED, GFF, WIG, TAB | Yes |
| CNV and SV visualization | |||||
| Circos | Lin, Mac, Win, web interface | No | No | GFF, CSV | Yes |
This table holds genome browsers as well as tools producing circos plots, whereby genome browsers were split into web-based applications, accessible using a web browser and stand-alone tools with a graphical user interface. All genome browsers use tracks to display different features like reference genome, annotations or experimental data. Further visualization tools can be found in Supplementary Tables S8 and S9. OS, operating system; Lin, Linux; Mac, Mac OS X; Win, Windows; BAM, Binary SAM; BCF, Binary VCF; BED, Browser Extensible Data, a text-based file format; bedGraph, file format allowing the display of continuous-valued data in track format; bigBed, compressed, binary-indexed BED file; bigWig, compressed, binary indexed WIG file; FASTA, text-based format for representing nucleotide sequences; GBrowse, Gbrowse proprietary format; GFF, General Feature Format; GTF, Gene Transfer Format; IGV, Integrative Genomics Viewer format; MAF, Multiple Alignment Format; PSL, pattern space layout; SAM, Sequence Alignment/Map; SNP, Personal Genome SNP format; TAB, tab-delimited file; VCF, Variant Call Format; WIG, Wiggle Track Format.