| Literature DB >> 26158448 |
Malachi Griffith1, Obi L Griffith2, Scott M Smith3, Avinash Ramu3, Matthew B Callaway3, Anthony M Brummett3, Michael J Kiwala3, Adam C Coffman3, Allison A Regier3, Ben J Oberkfell3, Gabriel E Sanderson3, Thomas P Mooney3, Nathaniel G Nutter3, Edward A Belter3, Feiyu Du3, Robert L Long3, Travis E Abbott3, Ian T Ferguson3, David L Morton3, Mark M Burnett3, James V Weible3, Joshua B Peck3, Adam Dukes3, Joshua F McMichael3, Justin T Lolofie3, Brian R Derickson3, Jasreet Hundal3, Zachary L Skidmore3, Benjamin J Ainscough3, Nathan D Dees3, William S Schierding3, Cyriac Kandoth3, Kyung H Kim3, Charles Lu3, Christopher C Harris3, Nicole Maher4, Christopher A Maher5, Vincent J Magrini1, Benjamin S Abbott3, Ken Chen3, Eric Clark3, Indraniel Das3, Xian Fan3, Amy E Hawkins3, Todd G Hepler3, Todd N Wylie3, Shawn M Leonard3, William E Schroeder3, Xiaoqi Shi3, Lynn K Carmichael3, Matthew R Weil3, Richard W Wohlstadter3, Gary Stiehr3, Michael D McLellan3, Craig S Pohl3, Christopher A Miller3, Daniel C Koboldt3, Jason R Walker3, James M Eldred3, David E Larson1, David J Dooling3, Li Ding6, Elaine R Mardis7, Richard K Wilson7.
Abstract
In this work, we present the Genome Modeling System (GMS), an analysis information management system capable of executing automated genome analysis pipelines at a massive scale. The GMS framework provides detailed tracking of samples and data coupled with reliable and repeatable analysis pipelines. The GMS also serves as a platform for bioinformatics development, allowing a large team to collaborate on data analysis, or an individual researcher to leverage the work of others effectively within its data management system. Rather than separating ad-hoc analysis from rigorous, reproducible pipelines, the GMS promotes systematic integration between the two. As a demonstration of the GMS, we performed an integrated analysis of whole genome, exome and transcriptome sequencing data from a breast cancer cell line (HCC1395) and matched lymphoblastoid line (HCC1395BL). These data are available for users to test the software, complete tutorials and develop novel GMS pipeline configurations. The GMS is available at https://github.com/genome/gms.Entities:
Mesh:
Year: 2015 PMID: 26158448 PMCID: PMC4497734 DOI: 10.1371/journal.pcbi.1004274
Source DB: PubMed Journal: PLoS Comput Biol ISSN: 1553-734X Impact factor: 4.475
Fig 1Overview of the GMS.
The genome modeling system (GMS) is implemented to use a federated disk SAN, with meta-data stored in a PostgreSQL relational database. Sample management tools allow the import of new samples and instrument data. Data are then processed through various analysis pipelines (e.g., reference alignment, somatic variation detection, etc.) that in turn are managed and monitored by a workflow system (Box 1). Stand-alone GMS tools, not part of automated pipelines, are available through a common tool tree. Most components of the system can be accessed through an Ubuntu Linux command-line interface or Ruby-on-Rails web interface.
Major GMS pipelines.
A brief description of each analysis pipeline tested for initial release of the GMS.
| Pipeline | Description | Products |
|---|---|---|
| Genotype Microarray | Performs genotype calling on SNP array data against a reference sequence. | SNVs BED file. |
| Reference Alignment | Performs alignment and variant detection for reads from a single sample. Works with WGS data and capture data. | BAM file of aligned reads, VCF files and BED files for germline SNVs, Indels, SVs, and CNVs. Reports on coverage. |
| Somatic Variation | Performs tumor/normal variant detection. Extends reference alignment with somatic evaluation, LOH analysis, annotation and prioritization. Works with WGS data and capture data. | VCF files and BED files for somatic SNVs, Indels, SVs, and CNVs. |
| RNA-seq | Uses Bowtie/TopHat/Cufflinks to assemble transcripts and estimate abundance, alternative splicing, alternative promoter usage, etc. Also uses various tools to perform comprehensive quality and coverage analysis of RNA-seq libraries | Spliced alignment BAM, FPKM expression, digital expression, fusion detection, etc. |
| Differential Expression | Combines results from a pair of RNA-seq builds and performs differential expression analysis. | CuffDiff and CummeRbund output. |
| Med Seq (aka Clin Seq) | Integrates data from WGS, exome and transcriptome sequencing of a single patient’s tumor. Visualization and annotation of somatic events. Prioritization of somatic events by relevance to cancer biology and therapeutic decision making. | Approximately 2,000 files, including: spreadsheets of ranked and annotated variants, drug-gene interactions, Circos plots, copy number images, mutation diagrams, etc. |
Data processed by the GMS.
A brief summary of data processed by use of the GMS at The Genome Institute of Washington University School of Medicine in St. Louis (as of October 2014).
| Metric | Human | Non-human | Total |
|---|---|---|---|
| WGS cases (samples) | 2,517 (4,349) | 355 (534) | 2,872 (4,883) |
| Exome/targeted cases (samples) | 30,343 (35,366) | 6,027 (8,270) | 36,370 (43,636) |
| RNA/cDNA cases (samples) | 375 (555 samples) | 711 (855 samples) | 1,086 (1,410) |
| Bp of Illumina NGS reads | 622 terabases | 82 terabases | 704 terabases |
Fig 2Key concepts of the GMS.
The genome modeling system is architected around the idea of a ‘genome model’. The following vignettes illustrate key concepts integral to these models: (A) A subject can be modeled multiple times, possibly each with distinct ‘processing profiles’. For example, two different models can be defined for the HCC1395 genome using the ‘reference alignment’ pipeline. In Model 1, the processing profile specifies the use of BWA for alignment and Samtools for variant detection. In Model 2, Bowtie2 and GATK are used for these steps instead. (B) A given processing profile can be used across a group of models, ensuring, for instance, that all subjects in a cohort are processed in similar ways. In this example, two different cell line genomes (HCC1395 and XY2123) have models defined of the exact same type, using the processing profile with BWA/Samtools specified. (C) A model has no results until a build is generated. If the model is updated to have new inputs, a new build is required. Builds are immutable snapshots of modeling pipeline results. In this example, the HCC1395 genome has a reference alignment model again making use of the BWA/Samtools profile. However, as new instrument data becomes available, new builds are constructed to reflect the most complete data. (D) When models are used as inputs for other models, the last complete build for the input model is used as an input for the downstream build. In this example, both tumor and normal genomes are available for an individual (in this case HCC1395). Reference alignment models are built for each sample and then both are used as inputs for a third ‘somatic variation’ model. In reality, it is the underlying data in the reference alignment builds that are used to create a somatic variation build, identifying all variants that are thought to be tumor specific.
Fig 3Somatic variation processing profile and workflow.
To illustrate key GMS concepts, the processing profiles and workflow for the somatic variation pipeline are shown. Abbreviations: copy number variant (CNV), copy number amplification (CNA), genome analysis tool kit (GATK), insertion/deletion (Indel), loss of heterozygosity (LOH), mapping quality (MQ), single nucleotide variant (SNV), structural variant (SV), variant allele frequency (VAF).
Fig 4HCC1395 (“TST1”) example input, models, and outputs.
A test dataset for the HCC1395 cell line is provided with the GMS software to allow testing of software installation, and facilitate further development. It is also used to illustrate much of the current functionality of the GMS. HCC1395 tumor and the corresponding HCC1395BL ‘normal’ cell line DNA and RNA samples were sequenced by whole genome, exome, and RNA-seq methods producing six sets of instrument data for input to various GMS pipelines. Additional required inputs for the pipelines include a reference genome (e.g., GRCh37), gene annotations (e.g., Ensembl 67_37l), and variant databases (e.g., dbSNP37). Different versions (processing profiles) of the reference alignment were used to align WGS and exome DNA reads to the reference genome. A separate RNA-seq pipeline similarly aligns RNA reads. Alternate versions of the somatic variation pipeline are used to call various types of variants from exome and WGS data by comparing tumor and normal reference alignments. A differential expression pipeline identifies significantly altered transcript expression levels by comparing the tumor and normal RNA-seq alignments. Finally, the MedSeq pipeline summarizes all upstream pipelines into a single convenient result set. This includes a multitude of reports and visualizations for single nucleotide variants (SNVs), Indels (insertions and deletions), SVs (structural variants), CNVs (copy number variations), transcript fusions, differentially expressed genes, alternatively expressed isoforms, and much more. Data types are further integrated to, for example, identify which variants at the DNA level are expressed at the RNA level and which events affect known cancer driver genes or druggable targets.
Fig 5Circos plot of HCC1395 tumor/normal comparison.
Circos is a popular tool for summarizing genomic events in a tumor genome. This is just one of many automatically generated visualizations made possible by the GMS. In this example, the WGS, exome and RNA-seq data for HCC1395 are displayed in several tracks along with additional visualizations illustrating individual events. Moving inwards, SNVs and Indels are plotted on the outermost track, then highly expressed genes, CNVs, and finally chromosomal translocations at the center. For events predicted to affect protein coding genes, additional plots are auto-generated to display the mutation position relative to protein domains and previously reported mutations from the Cosmic database, as illustrated in the topmost plot. Moving clockwise, a screenshot of IGV demonstrates one of the somatic deletions identified. IGV XML sessions are automatically generated to allow rapid manual review of all predicted events. Next, a histogram illustrates the expression of a single highly expressed gene relative to the distribution of expression for all genes. Then, a CNV plot is shown for an amplified portion of one chromosome. Finally, the coverage and supporting reads for a chromosomal translocation are depicted.
Terminology for the Genome Modeling System.
Brief descriptions of critical objects in the Genome Modeling System.
| Term | Definition |
|---|---|
| Subject | The entities around which analysis occurs. Exist at multiple levels of granularity. For example, an individual, a cohort, a sample from an individual, or even a species. Anything that can be described abstractly as “having a genome”. When the subject is a human patient, use of the GMS will normally require appropriate ethics review and informed consent of the patient. Related documentation will be linked to analyses via an anonymized unique patient number (UPN) stored in the GMS subject database table along with additional metadata. |
| Model | The basic unit of analysis. Each model represents one state of belief about the sequence and features of a given subject. Multiple models can be made of the same subject, with different processing profiles, and/or different input data used as evidence. |
| Pipeline | Each type of model defines a distinct analysis pipeline. The definition includes a specification for inputs and parameters to each model, as well as logic to construct a workflow to build results given specific values for those inputs and parameters. |
| Processing Profile | A reusable collection of parameters describing how to build a model of a particular type/pipeline. Each is a complete computational method specification, including exact tool names and versions, as well as sufficient logic to determine the precise workflow. All models with the same processing profile have been processed the same way, though input data may vary. |
| Build | One attempt to execute the required workflow for a model, given its inputs. The last complete build for a model represents the current “state” of the model. While models can be updated, the information content in each build is a static snapshot of results. |
| Instrument Data | A unit of data from a sequencer, microarray instrument, or other device, used as primary input to the GMS. Illumina data, for instance, produces one unit of instrument data per flow cell, lane, and index sequence. It is typically associated with a file of reads, and a collection of metrics. |
| Software Result | A reusable intermediate result made by the build process. When the exact same process is to occur a second time on the same inputs with the same parameters, the software result produced the first time is detected. The GMS uses these to prevent redundant work, and expedite processing after minor analysis protocol changes. |
| Disk Allocation | A record of a slice of disk being allocated to a given owner. Builds, software results, and instrument data are owners of disk allocations. |
| Workflow | A graph of steps, and the data flow between those steps. A workflow is generated for each attempt to build a model. Individual steps may also define subordinate workflows, leading to a nested graph of tasks to accomplish the analysis goal. |