| Literature DB >> 23139896 |
Federica Torri1, Ivo D Dinov, Alen Zamanyan, Sam Hobel, Alex Genco, Petros Petrosyan, Andrew P Clark, Zhizhong Liu, Paul Eggert, Jonathan Pierce, James A Knowles, Joseph Ames, Carl Kesselman, Arthur W Toga, Steven G Potkin, Marquis P Vawter, Fabio Macciardi.
Abstract
Whole-genome and exome sequencing have already proven to be essential and powerful methods to identify genes responsible for simple Mendelian inherited disorders. These methods can be applied to complex disorders as well, and have been adopted as one of the current mainstream approaches in population genetics. These achievements have been made possible by next generation sequencing (NGS) technologies, which require substantial bioinformatics resources to analyze the dense and complex sequence data. The huge analytical burden of data from genome sequencing might be seen as a bottleneck slowing the publication of NGS papers at this time, especially in psychiatric genetics. We review the existing methods for processing NGS data, to place into context the rationale for the design of a computational resource. We describe our method, the Graphical Pipeline for Computational Genomics (GPCG), to perform the computational steps required to analyze NGS data. The GPCG implements flexible workflows for basic sequence alignment, sequence data quality control, single nucleotide polymorphism analysis, copy number variant identification, annotation, and visualization of results. These workflows cover all the analytical steps required for NGS data, from processing the raw reads to variant calling and annotation. The current version of the pipeline is freely available at http://pipeline.loni.ucla.edu. These applications of NGS analysis may gain clinical utility in the near future (e.g., identifying miRNA signatures in diseases) when the bioinformatics approach is made feasible. Taken together, the annotation tools and strategies that have been developed to retrieve information and test hypotheses about the functional role of variants present in the human genome will help to pinpoint the genetic risk factors for psychiatric disorders.Entities:
Year: 2012 PMID: 23139896 PMCID: PMC3490498 DOI: 10.3390/genes3030545
Source DB: PubMed Journal: Genes (Basel) ISSN: 2073-4425 Impact factor: 4.096
Review of the most used software in next-generation sequencing (NGS) data analysis. Which includes two major computational macro-processes: (1) a primary step related to mapping and assembling, with alignment quality control, quality score re-calibration, realignment in “difficult” regions of the genome; and (2) secondary, advanced steps focused on variant (single nucleotide polymorphisms (SNPs), insertions-deletions (Indels) and copy number variations (CNVs)) calling and annotation. These macro-processes are briefly reviewed to provide a background for the software algorithms embedded in DNA-Seq analysis.
| Process | Software & Algorithms | Website |
|---|---|---|
|
| homemade script | (N/A) |
|
| MAQ | |
| BWA | ||
| BWA-SW (SE only) | ||
| PERM | ||
| BOWTIE | ||
| SOAPv2 | ||
| MOSAIK | ||
| NOVOALIGN | ||
|
| VELVET | |
| SOAPdenovo | ||
| ABYSS | ||
|
| SAMTOOLS | |
| PICARD | ||
|
| GATK | |
| PICARD | ||
| SAMTOOLS | ||
| IGVtools | ||
|
| ||
|
| SVA | |
| SAMTOOLS | ||
| ERDS | ||
|
| SAMTOOLS | |
| ANNOVAR | ||
|
| GATK | |
| ANNOVAR | ||
|
| ||
| CNVseq | CNVseq | |
| R | ||
|
| SVA | |
| SAMTOOLS | ||
| ERDS | ||
|
| CNVer | |
| BOWTIE | ||
| SAVANT | ||
|
| dwgsim |
Comparison of several Graphical Workflow Environments to manage pipelines. Most workflow environments provide graphical solutions (infrastructures) for the interactive handling of data, with several advantageous features compared to the management of the same processes via command line or scripting interfaces. When adding new software tools, some of these architectures require software recompilation and some do not. Yet, there is significant variation of the status reports generated during or after workflow execution. Data storage, internal or external, operating system and local hardware dependencies and utilization of available grid managers also vary between the different workflow environments. There are many synergies between the Pipeline and various alternative environments for software tool integration and interoperability, with also some valuable differences. The Laboratory of Neuro Imaging (LONI) pipeline infrastructure provides computational workflow execution capability with or without the use of local hardware or administrative support. Adding new software tools to the pipeline library of tools is efficient, does not require recompiling the programs, and requires only a brief description of the tool invocation syntax using the client “module description” dialog. Thus, the LONI pipeline offers a flexibility and simplicity in design of novel workflow solutions that is not available in the other two most used systems for NGS data analysis, Taverna and Galaxy. Similarly, the LONI pipeline allows workflow pausing and resuming, and provides explicit controls ensuring that processes are only instantiated when the complete upstream activities have successfully completed execution. Additionally, the available Taverna and Galaxy services have restrictive upper limits on storage (100 GB) and per-process RAM (64 GB), when they are deployed on Amazon Web-Services/Cloud creating bottlenecks with data staging to/from the servers and computational runs. The Pipeline service provides a pair of dedicated open-access servers (http://genomics.loni.ucla.edu) each with 40-cores and 1.4 TB of shared RAM.
| Workflow Management System | Module concatenation and interoperability | Asynchronous Task Management | Requires Tool Recompiling | Data Storage | Platform Independent | Client-Server Model | Grid Enabled |
|---|---|---|---|---|---|---|---|
| LONI Pipeline [ | Y | Y | N | External | Y | Y | Y |
|
| |||||||
| Taverna [ | Y | N | Y | Internal(MIR) | Y | N | Y |
|
| |||||||
| Kepler [ | Y | N | Y | Internal(actors) | Y | N | Y |
|
| |||||||
| Triana [ | Y | N | Y | Internal data structure | Y | N | Y |
|
| |||||||
| Workflow Navigation System [ | N | N | N/A | External | Y | N | N |
|
| |||||||
| Galaxy [ | N | N | Y | External | N | Y | N |
|
| |||||||
| VisTrails [ | Y | N | Y | Internal | N | N | N |
|
|
Workflow Management System: list of the compared graphical workflow environments. Module concatenation and interoperability: Asynchronous Task Management: ability to submit new workflows and report the status of executing or completed workflows asynchronously, e.g., constant interruptions of network connectivity. Requires Tool Recompiling: requirement to recompile new computational libraries or software tools against the graphical environment libraries, and to restart the environment when provisioning these new services. Data Storage: ability of environment to store data (raw, processed and derived) internally (RAM/DB) or externally (NFS/Services). Platform Independent: dependency of the environment on the local hardware and operating system. Platform independence refers to the workflow environment itself, not the computational library of tools that are accessible via that environment. For environments with Client-Server architecture, this is irrelevant, as the platform-independent clients can always connect to, submit data, process protocols and monitor the status of executing pipeline workflows by connecting to (possibly platform dependent) back-end pipeline servers where specific operating systems (most commonly Linux) may be required by many informatics and genomics computing libraries. Client-Server Model: independent server and clients that can be broadly interconnected provided. Grid Enabled: use of a Grid Engine/Grid Job Manager. Legend: Y = yes, N = no.
Figure 1A snapshot of the general organization of a workflow within the LONI pipeline environment. This is an example of embedded modules into an alignment workflow based on BWA software. The user can simply set up the location of the input files in the data sources, manage the programs involved in the core modules, and indicate the location of the output files in the output data sink section. Every section can be interactively edited or modified through a menu of options accessed by right-clicking the mouse on the respective portion of the workflow.
Figure 2An example of the workflow approach to analyze DNA-Seq data in GPCG. Several alternative workflows can be run independently or connected in a logical flow. Once the reads have been pre-processed, they can be aligned (1.1), undergo (1.3) Basic and (1.4) Advanced QC, (2.1a) SNP/Indels and (2.1b) CNVs calling and annotation. The reads can also undergo (1.2) de novo assembly, and if a reference genome is available the contigs can be realigned back to the reference genome and then undergo the following computational processes.
Review of the processes and related workflows currently implemented in the NGS Pipeline. All processes and workflows have been tested and validated and are available for use by interested scientists. A single pipeline can be run independently from others, or can be connected as illustrated by the analytical workflow protocols described in this Table.
| Process | Process Description | Software & Algorithms | Input * | Output (Files) | Upstream Module Dependencies | Downstream Module Dependencies |
|---|---|---|---|---|---|---|
|
|
| homemade script | reads (original solexa format) | subset of reads (fastq format) |
|
|
|
|
| MAQ | reads (binary fastq format) | SAM |
|
|
| BWA | reads (fastq format) | SAM | ||||
| BWA-SW (SE only) | reads (solexa format) | SAM | ||||
| PERM | reads (fastq format) | SAM | ||||
| BOWTIE | reads (solexa format) | SAM | ||||
| SOAPv2 | reads (fastq format) | SAM | ||||
| MOSAIK | reads (solexa format) | SAM | ||||
| NOVOALIGN | reads (solexa format) | SAM | ||||
|
|
| VELVET | reads (fastq format) | contigs file |
|
|
| SOAPdenovo | reads (fastq format) | contigs file | ||||
| ABYSS | reads (fastq format) | contigs file | ||||
|
|
| PICARD, SAMTOOLS | SAM | BAM |
|
|
|
|
| PICARD, SAMTOOLS, GATK | BAM | BAM clean |
|
|
|
|
| Sequence Variant Analyzer v1.0 | BAM clean | csv files with variants and annotation |
|
|
| SAMTOOLS and ANNOVAR for annotation | BAM clean | txt files with variants | ||||
| Unified genotyper and ANNOVAR for annotation | BAM clean | txt files with variants | ||||
|
|
| BOWTIE CNVer SAVANT | reads (solexa format) | txt file with the CNVs calls |
|
|
| CNVseq | SAM | txt file with the CNVs calls |
| |||
| SAMTOOLS ERDS Sequence variant analyzer ERDS v1.0 | BAM clean | csv file with the CNVs calls |
| |||
|
| Generate simulated reads according to the needs of the user | dwgsim | - | SE or PE .fastq files | - |
|
* With solexa format we refer to the Phred quality score code used by the Illumina Pipeline version prior than 1.8 (Phred +64). The newer versions of the Illumina Pipeline produce reads file directly in Sanger format (Phred +33). To guarantee backwards compatibility with data produced by version of the Illumina Pipeline previous than 1.8 we have embedded a conversion step from Solexa FASTQ to Sanger FASTQ for the alignment software that don’t support the solexa format. The user can remove this step in case the conversion is not needed; # External software like PLINQseq for the statistical analysis or IGV for visualization are not embedded in the workflow; $ If a reference genome is not available the contigs can be used like they are for further analysis. If a reference genome is available the contigs can be aligned back to the reference genome with BWA-SW.
Figure 3Schematic representation of alignment modules available for both single and paired-end data (2.1).
Figure 4Schematic representation of the de novo assembly workflows available (2.2).
Figure 5A snapshot of the general organization of the Basic QC workflow (2.1.3). After an initial file cleaning that performs various fix ups, the alignment file in Sequence Alignment/Map (SAM) format is converted in Binary Sequence Alignment/Map (BAM) file and sorted. The workflow takes care of the duplicated reads removing or marking the potential PCR duplicates. If multiple read pairs have identical external coordinates, it only retains the pair with highest mapping quality. This step is particularly suited for paired end data and the user can switch between the two options simply changing the REMOVE_DUPLICATES argument in the GUI related to this module. The removal step can be excluded from a workflow run depending on the interest in studying repetitive elements. In case of paired end reads, the pipeline then ensures that all mate-pair information is in sync between each read and its mate pair, fixing any incoherent information. The BAM file undergoes MD tagging that adds string, labeling the mismatching positions. The BAM is finally indexed using the index of the reference genome.
Figure 6A snapshot of the general organization of the Advanced QC workflow (2.1.4). (A) After the basic QC, the reads that map within Indels in the individual’s genome compared to the reference genome are locally realigned, as they may lead to alignment artifacts that can easily be misinterpreted as SNPs. The next step is the base quality score recalibration to recalibrate base quality scores of reads, by the analysis of the covariation among several features of a base (e.g., reported quality scores, the position within the read). The workflow produces plots and tables with the most important metrics for a DNA-Seq experiment (i.e., mean quality by cycle, insert size metrics, quality score distribution, GC-bias metrics, main alignment metrics) with the PICARD software; (B) The users can then produce useful tracks for the visualization of the data in Integrative Genome Viewer (IGV). Examples are the (a) callability track (i.e., evaluates how much a region can be trusted in term of coverage, accuracy and quality by GATK and can be visualized as a bar chart in IGV); (b) the sliding window coverage (i.e., a computation of average alignment over a specified window size across the genome with igvtools). The main outputs of this step are: (1) a cleaned BAM file ready to be used for variant calling, (2) a set of plots and text files that can help the user to have a general picture about the general quality of the experiment and (3) a set of track files to visualize the dataset and its features. The user can upload the indexed BAM files and these tracks in IGV to visualize and annotate the reads across the whole genome with user-produced or online tracks (RefSeq, RepeatMasker, Database of Genomic Variants).
Figure 7A snapshot of the three independent workflow for variant calling, and annotation workflows available. Sequence Variant Analyzer (SVA) displays a graphical user interface (GUI) to visualize, annotate, filter and analyze the called variants.
Figure 8A snapshot of the general organization of the CNVs modules. ERDS, CNVer and CNVseq have been implemented as a first wave of tools to call CNVs in DNA-Seq data.
Runtimes and performances on simulated data for modules in common across Graphical Pipeline for Computational Genomics (GPCG) and Galaxy. The performances of GPCG in terms of run time were better than Galaxy for all the tested modules.
| Analytical category | Input file(file size) | Job description | GPCG workflow name | Time | Galaxy module name | Time |
|---|---|---|---|---|---|---|
|
| 2.4 Gb × 2 (PE) | Upload of the data into the webserver | (N/A) | (N/A) | Upload of the data | 180 min |
|
| 2.4 Gb fastq file | Conversion of solexa into sanger format | Preprocessing pipeline: sol2sanger | 6 min | FASTQ Groomer | 45 min |
|
| 2.4 Gb × 2 fastq files (PE) | BWA paired end alignment with default parameters | BWA PE (1.1) | 132 min | Map with BWA for Illumina | 240 min |
| 2.4 G × 2 fastq files (PE) | Bowtie paired end alignment with default parameters | BOWTIE PE (1.1) | 205 min | Map with Bowtie for Illumina | 270 min | |
|
| 1.6 Gb SAM file | Synchronization of mate-pair information | Fix Mate Information (Basic QC, 1.3) | 6 min | Paired Read Mate Fixer for paired data | 30 min |
| 1.6 Gb SAM file | Marks duplicate reads | Mark Duplicates (Basic QC, 1.3) | 2 min | Marks duplicate reads | 20 min | |
| 1.6 Gb SAM file | Reports the alignment metric of a SAM/BAM file | Collect Alignment Summary Metrics (Advanced QC, 1.4) | 2 min | SAM/BAM Alignment Summary Metrics | 6 min | |
| 1.6 Gb SAM file | Reports the SAM/BAM GCbias metrics | Collect GC Bias Metrics (Advanced QC, 1.4) | 3 min | SAM/BAM GC Bias Metrics | 7 min | |
| 1.6 Gb SAM file | Reports the insert size metrics | Collect Insert Size Metrics (Advanced QC, 1.4) | 2 min | Insertion size metrics for PAIRED data | 6 min |
Figure 9Snapshot from the module we used to run the alignment of an entire flowcell with BWA-PE. This workflow includes the indexing of the reference genome (BWA: Index), the alignment of the two reads separately (BWA-aln) and their final combination (BWA: samse/sampe). The sixteen input files (i.e., one forward and one reverse read for each one of the eight lanes of the flowcell) are shown in the data source panel magnified on the right. The pipeline allows managing all the options of the BWA alignment software through the module’s GUI without worrying about complex command lines.
Figure 10The LONI pipeline computational library Navigator allows the interactive traversal, inspection, downloading and utilization of specific NGS analyses. Nested insert images illustrate the most common steps of search, selection, comparison, modification and execution of available end-to-end computational genomics workflows.