| Literature DB >> 22870267 |
Maria Fischer1, Rene Snajder, Stephan Pabinger, Andreas Dander, Anna Schossig, Johannes Zschocke, Zlatko Trajanoski, Gernot Stocker.
Abstract
In recent studies, exome sequencing has proven to be a successful screening tool for the identification of candidate genes causing rare genetic diseases. Although underlying targeted sequencing methods are well established, necessary data handling and focused, structured analysis still remain demanding tasks. Here, we present a cloud-enabled autonomous analysis pipeline, which comprises the complete exome analysis workflow. The pipeline combines several in-house developed and published applications to perform the following steps: (a) initial quality control, (b) intelligent data filtering and pre-processing, (c) sequence alignment to a reference genome, (d) SNP and DIP detection, (e) functional annotation of variants using different approaches, and (f) detailed report generation during various stages of the workflow. The pipeline connects the selected analysis steps, exposes all available parameters for customized usage, performs required data handling, and distributes computationally expensive tasks either on a dedicated high-performance computing infrastructure or on the Amazon cloud environment (EC2). The presented application has already been used in several research projects including studies to elucidate the role of rare genetic diseases. The pipeline is continuously tested and is publicly available under the GPL as a VirtualBox or Cloud image at http://simplex.i-med.ac.at; additional supplementary data is provided at http://www.icbi.at/exome.Entities:
Mesh:
Year: 2012 PMID: 22870267 PMCID: PMC3411592 DOI: 10.1371/journal.pone.0041948
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1Schematic overview.
The SIMPLEX analysis pipeline contains five major steps (blue boxes), which are further divided into several components. Mandatory components are depicted in black, optional in gray. The first step of the pipeline includes calculations of quality statistics on raw and processed reads, and applies filters and trimmers on sequenced reads (quality report). Afterwards, the pipeline aligns the processed reads to a reference genome (sequence alignment), performs alignment statistics and region filtering (alignment statistics), and detects variants resulting in a list of potential disease driver candidates (variant detection). Output files can be visualized using standard genome viewers. At the end, the pipeline automatically annotates variants, generates a detailed summary report, and combines calculated results including key figures in a structured way (annotation & summary).
Mandatory pipeline parameters.
| Parameter | Name | Description |
| -c | command | which pipeline should be run ( |
| -od | output directory | directory where result files are stored |
| -genP | genome prefix | prefix of the reference genome (e.g. hg18, hg19) |
| -I | input files | list of files containing raw sequence reads1 |
| -sfeb | SAM exome option | bed file determining the exone regions |
| -dsP | dip splitter option | percentage to distinguish between homo- and heterozygous DIPs |
| -k | Cluster profile | configuration file to access the cluster service |
Listed are all parameters that need to be specified when starting the pipeline.
If PE data is given, the file names need to end with _R1 or _R2.
Description of output files.
| Name | Format | Description |
| read qualities | read quality statistics report available for raw and refined reads | |
| read alignment | bam, bai | result files of alignment and alignment filtering steps |
| insert size distribution | png | insert size histogram (provided for PE data only) |
| exon counts | tsv | number of covering reads and fold coverage per exon |
| mutations | vcf, tsv | list of detected mutations |
| summary report | tsv, xlsx | detailed report of the analysis including several key figures |
Listed are key intermediate and final results that are created by the pipeline.
Detailed results of SIMPLEX evaluation.
| SE Samples | PE samples | |
| reads passed preprocessing | 98% | 100% |
| reads mappable | 63% | 54% |
| reads used for variant detection | 23% | 16% |
| number of raw SNPs | 14,926 | 17,858 |
| number of filtered SNPs | 6,357 | 7,875 |
| number of DIPs | 473 | 402 |
| raw SNPs in dbSNP | 92% | 76% |
| filtered SNPs in dbSNP | 99% | 98% |
| DIPs with RefSeq association | 99% | 99% |
| raw loss-of-function SNPs | 1,181 | 4,098 |
| filtered loss-of-function SNPs | 593 | 1329 |
| loss-of-function DIPs | 47 | 96 |
| missense/nonsens raw SNPs | 737 | 1,562 |
| missense/nonsens filtered SNPs | 497 | 1,098 |
Listed are key figures (in avg.) for SE and PE samples.
Runtime summary for Kabuki syndrome study.
| Statistic | SE Samples | PE samples |
| mean pipeline runtime | 12∶39 | 21∶19 |
| median pipeline runtime | 12∶54 | 24∶59 |
| longest pipeline runtime | 15∶14 | 26∶37 |
| shortest pipeline runtime | 07∶39 | 05∶13 |
| mean longest step (local realignment) | 02∶45 | 06∶59 |
| mean alignment duration | 02∶50 | 04∶52 |
| mean SNP calling duration | 00∶05 | 00∶06 |
Listed are overall and key runtime statistics (in hours).
Comparison of exome analysis tools.
| Criteria | SIMPLEX | ngs– backbone | GATK | inGAP | SeqGene | GAMES | TREAT | Atlas2 |
| Free of charge | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| SE/PE data handling | ✓/✓ | ✓/− | ✓/✓ | ✓/✓ | n.m. | ✓/✓ | ✓/✓ | ✓/✓ |
| NS/CS data handling | ✓/✓ | ✓/✓ | ✓/✓ | ✓/− | n.m. | ✓/✓ | ✓/− | ✓/✓ |
| Alignment | ✓ | ✓ | - | ✓ | ✓ | - | ✓ | - |
| Variant annotation | ✓ | - | ✓ | - | ✓ | ✓ | ✓ | - |
| Highly customizable | ✓ | ✓ | ✓ | - | ✓ | ✓ | ✓ | - |
| PCR duplicate handling | ✓ | - | - | - | - | ✓ | - | - |
| Homo−/heterozygosity | ✓/✓ | −/− | ✓/✓ | −/− | ✓/✓ | −/− | ✓/✓ | ✓/✓ |
| Quality reports | ✓ | ✓ | ✓ | - | ✓ | - | ✓ | - |
| Summary reports | ✓ | - | - | ✓ | - | ✓ | - | - |
| HPC support | ✓ | ✓ | ✓ | - | - | - | ✓ | - |
| Cloud support | ✓ | - | - | - | - | - | ✓ | ✓ |
| Graphical user interface | - | - | - | ✓ | - | - | - | ✓ |
| Multi user support | ✓ | - | - | - | - | - | - | - |
| Standalone | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| Web service | ✓ | - | - | - | - | - | - | - |
Compared are several key features of currently available non-commercial exome sequencing analysis pipelines.
n.m. … not mentioned.
a) [14].
b) [15].
c) [16].
d) [17].
e) [18].
f) [50].
g) [19].