Literature DB >> 26306272

Development of Bioinformatics Pipeline for Analyzing Clinical Pediatric NGS Data.

Erin L Crowgey1, Anders Kolb2, Cathy H Wu1.   

Abstract

Using an Illumina exome sequencing dataset generated from pediatric Acute Myeloid Leukemia patients (AML; type FLT3/ITD+) a comprehensive bioinformatics pipeline was developed to aid in a better clinical understanding of the genetic data associated with the clinical phenotype. The pipeline starts with raw next generation sequencing reads and using both publicly available resources and custom scripts, analyzes the genomic data for variants associated with pediatric AML. By incorporating functional information such as Gene Ontology annotation and protein-protein interactions, the methodology prioritizes genomic variants and returns disease specific results and knowledge maps. Furthermore, it compares the somatic mutations at diagnosis with the somatic mutations at relapse and outputs variants and functional annotations that are specific for the relapse state.

Entities:  

Year:  2015        PMID: 26306272      PMCID: PMC4525226     

Source DB:  PubMed          Journal:  AMIA Jt Summits Transl Sci Proc


Introduction

Acute myeloid leukemia (AML) is a complex disease characterized by dysregulation of signal transduction pathways in hematopoietic progenitors that ultimately results in the increase of proliferation and survival of leukemic cells(1). AML is considered a disease of the genome as many genetic alterations are required for onset. Genomic variants for AML are often described as either Type I mutations, which alter cell proliferation, or Type II mutations, which alter cell survival pathways(2). Pediatric AML is a rare disease with only ~500 children a year diagnosed (stjude.org) and prognosis has improved over the decades(3). However, relapse is a major concern and accounts for more than half of the deaths in pediatric leukemia cases(1)(3). Common mutations associated with AML are found in several genes including FLT3, NPM1, CEBPA, RAS, c-KIT, and WT1. Furthermore, co-occurring mutations such as an internal tandem duplication (ITD) in the FLT3 gene accompanied by mutations in WT1, have been associated with poor outcome (4). The FLT3/ITD is an in-frame insertion in exon 14 or 15 that changes the amino acid sequence in the juxtamembrane domain, leading to ligand-independent FLT3 activation (5). In the clinical setting FLT3 / ITD is detected through a PCR based assay, and additional testing is required to further analyze the sample for other potential genomic mutations. Recent advancements in DNA sequencing technology have aided in our ability to detect numerous genetic alterations from a single genomic sample. These advancements can aid in personalized medicine by revealing the genomic architecture of a specific patient. However, applying NGS in the medical field requires knowledgeable personnel and significant computer infrastructure and algorithms specific for handling the large datasets. This study retrospectively analyzes FLT3/ITD positive samples, diagnosis, remission, and relapse, with the goal of developing a bioinformatics pipeline capable of detecting the FLT3/ITD, along with other genetic alterations, which collectively can aid in a better understanding of biological processes dysregulated in the relapse state of pediatric AML. The goal of the bioinformatics pipeline is to provide an enhanced output that allows a clinician to better understand the pathways and biological processes affected by the detected genetic alterations. Starting with raw NGS sequencing reads, bioinformatics pipelines were created for analyzing exon-captured Illumina data. The pipeline combines publicly available algorithms and custom scripts to detect and prioritize genomic variants. Six FLT3/ITD positive pediatric AML samples, with varying FLT3/ITD allelic ratios, were analyzed using the developed methodologies. A thorough analysis between the diagnosis and relapse sample was conducted for each patient, revealing several relapse specific mutations. Our pipeline detected different types of genetic alterations, i.e. large insertions and single nucleotide polymorphisms (SNP), helping to establish NGS as a feasible methodology in the clinical setting. The pipeline is being designed with the flexibility to integrate other genomic detection algorithms in the future, such as copy number variation.

Methods

Illumina paired-end exon-sequencing data generated from bone marrow samples was received from the Children’s Oncology Group. The quality of the sequence reads was examined using fastqc (Babraham Institute) and cutadapt (https://code.google.com/p/cutadapt/) was used to trim low quality bases. The trimmed NGS reads were aligned to the human reference genome (hg19) using bwa-mem (version bwa-0.7.4) (6). Average depth of coverage per exon (vertical) and average exon coverage (horizontal) were calculated using a custom script and Ensembl annotation files. Following the best practices described by the Genome Analysis Tool Kit (GATK) developers (7), alignment files were processed using Picard Tools Version 1.67 (http://picard.sourceforge.net/). Mutect (Version 1.1.4) and Shimmer (Version 5.8.8) were executed for SNP detection, and to aid in the validation of the pipeline the results were compared to verified variants provided by the Children Oncology Group. Variant call files (VCF version 4.1) were annotated with SnpEff Version 3.3a using the package GRCh37.75 annotation recommended by SnpEff(4). Using SnpEff annotated transcript ID, variants in the VCFs were mapped to UniProt Accession Numbers and Gene Ontology information using a custom script. Protein-protein interactions for the protein coding genes were determined using the STRING API (9), a database of known and predicted protein interactions derived from: genomic context, high-throughput experiments, co-expression, and previous knowledge. Pindel algorithm was executed on the alignment files for detection of FLT3/ITD (10). The output files were converted to vcf files using the pindel2vcf script provided with the Pindel package. Only insertions located in exon 14 or 15 in the FLT3 gene were analyzed as potential ITDs. All computational work was performed at the University of Delaware on the BioHen high performance computing cluster.

Results and Discussion

Six FLT3/ITD positive samples, with varying allelic ratios and cytogenetic markers were analyzed with a custom pipeline (Figure 1). The pipeline consisted of publicly available algorithms, such as bwa and GATK, plus custom scripts. A key aspect of the established methodologies is the modularization of algorithms and scripts, which creates an environment that allows for the dynamic integration with up-dated algorithms and databases.
Figure 1.

Bioinformatics workflow

Genomic Detection FLT3/ITD

Detecting large insertions, deletions, and tandem duplications from NGS is a challenging task with only a few high quality algorithms publicly available. Recently, Spencer et al. (11) compared several algorithms for the detection of FLT3/ITD and published that Pindel (10), a pattern growth approach, successfully identified FLT3 / ITD. The Pindel algorithm was incorporated into the pipeline for the detection of an insert in exon 14 or 15 in the FLT3 gene that was consistent with the clinical FLT3/ITD (Table 1). For the 6 patients analyzed, 3 samples per patient, Pindel detected an insert in 5 of the 6 patient’s diagnosis sample (83%). The Spencer et al. study reported 100% detection of the FLT3 / ITD in the samples analyzed in their study using a targeted NGS approach (27 genes). For the study presented whole exon-sequencing was used and therefore the coverage in the region of interest was much lower, perhaps decreasing the ability to detect the ITD. Pindel also detected an insert in 4 of the relapse samples and 1 of the remission samples. A benefit to using Pindel is that it provides a better resolution of the genomic abnormality by providing a genomic position, sequence, and length of the insert, which are not all available with the PCR electrophoresis assay. Future work for this portion of the pipeline will include an allelic ratio calculation for the FLT3/ITD. This is a difficult task as purity of the cell population sequenced is difficult to determine.
Table 1.

Summary of Pindel Results

IDSamplePositionSequenceLength
Patient 1DiagnosisNone Detected
RelapseNone Detected
RemissionNone Detected
Patient 2Diagnosis28,608,235TCTTGGAAACTCCCATTTGAGATCATATTCA31
RelapseNone Detected
RemissionNone Detected
Patient 3Diagnosis28,608,249ATTTGAGATCATATTCATATTCTCTGAAATCAACGTAGCC40
Relapse28,608,265ATATTCTCTGAAATCTCCACGGGGG25
RemissionNone Detected
Patient 4Diagnosis28,608,214CTTACCAAACTCTAAATTTTCTCTTGGAAACTCCC37
Relapse28,608,214ATCTTACCAAACTCTAAATTTTCTCTTGGAAACTCCCAT37
RemissionNone Detected
Patient 5Diagnosis28,608,223CTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATTCTCTGAAATCAACGTAGAAGTACTCATTA76
Relapse28,608,223CTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATTCTCTGAAATCAACGTAGAAGTACTCATTA76
Remission28,608,223CTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATTCTCTGAAATCAACGTAGAAGTACTCATTA76
Patient 6Diagnosis28,608,243ACTCCCATTTGAGATCATATTCATATTCTCTGAAATCAACGTAGAAGTACTCATTATCTGAGGAGCCGGTCAC73
Relapse28,608,243ACTCCCATTTGAGATCATATTCATATTCTCTGAAATCAACGTAGAAGTACTCATTATCTGAGGAGCCGGTCAC73
RemissionNone Detected

Genomic Detection Somatic SNPs and InDels

The pipeline is composed of 3 genomic variant detection algorithms, Pindel, Mutect, and Shimmer, that collectively report single nucleotide polymorphisms (SNP), small insertions and deletions (InDel), and large InDels. When analyzing cancer samples it is important to distinguish, and prioritize, somatic SNPs versus germline SNPs. To aid with validating the pipeline, the somatic SNPs detected were fist compared to the list of verified variants provided by COG (Figure 2). Mutect detected 100% of the verified variants in eight of the twelve samples (diagnosis and relapse), with an average of 93% detection of verified variants. Shimmer detected 100% of the verified variants in five of the twelve samples, with an average of 78% detection of verified variants.
Figure 2.

Summary somatic SNP detection

The pipeline also detected high quality somatic SNPS that were not reported to COG. Table 2 summarizes the number of somatic SNPs detected in each sample. For the majority of the patients the relapse samples had more somatic mutations compared with their matched diagnosis sample. Three of the patients had an extremely high number of somatic mutations in their relapse sample, and are undergoing further analysis to determine the potential driver of these mutations.
Table 2.

Summary ranked somatic SNPs

IDSampleSomatic SNPsRanked Variants
Patient 1diagnosis20150
relapse167829706
Patient2diagnosis16047
relapse73111617
Patient 3diagnosis17747
relapse11142
Patient 4diagnosis19471
relapse250
Patient 5diagnosis15552
relapse20079
Patient 6diagnosis15329
relapse71774306

Genomic Variant Prioritization

A custom prioritization module was developed to rank the somatic variants, located in protein coding regions of the genome, at the diagnosis state and relapse state using a similar method as described by Hu et al. (12). Five major criteria were used for prioritizing the variants detected: protein-protein interactions, gene ontology, functional consequence, and quality of variant. Using the 27 genes published by Spencer et al., a Gene Ontology (GO) enrichment analysis was done using Bingo, a Cytoscape plug-in. These 27 genes were used because they are cited as genes with known genetic alterations associated with pediatric AML. GO terms that had a significant p-value (<0.05) were extracted, and variants located in a gene annotated with one of the enriched GO terms, were given a positive score. Protein-protein interactions were scored with a similar strategy, with positive scores given to variants located in a gene whose product has a protein-protein interaction with a protein known to be associated with pediatric AML. The goal is to use characteristics of known pediatric AML cancer genes to identify new genes of interest.

Cytoscape Knowledge Maps

A comparison between diagnosis and relapse samples was performed to better understand shifts in the biological processes influenced by genetic alterations between the two time points. A special feature of the pipeline is the automatic generation of a knowledge map consisting of protein-protein interactions and GO terms associations for the genes of interest that can be easily displayed in Cytoscape. The Cytoscape map displays genes with a highly ranked mutation as red nodes connected to their GO term (green nodes) and other interacting proteins or proteins with a genetic alteration that does not cause a change in amino acid sequence (blue nodes). Figure 2 highlights an example map generated for a Patient’s relapse state, highlighting mutations specific for the relapse state, except for the FLT3/ITD. The pipeline detected and prioritized variants detected in PAK2, PRAMEF1, PRAMEF13, and RHPN2. There were two variants detected in PAK2 (rs76714248, MAF 0.019 and rs67093638) that are predicted to alter the amino acid sequence of the translated protein. PAK2 is a protein kinase involved in several signaling pathways such as apoptosis and proliferation. Two other genes, PRAMEF1 and PRAMEF13, which mapped to GO negative regulation of apoptosis, were also prioritized for this sample. These genes are also annotated with negative regulation of cell differentiation, negative regulation of retinoic acid receptor signaling, positive regulation of cell proliferation, and negative regulation of transcription. The goal of this type of functional output is to help researchers and clinicians make hypothesis regarding the changes from the diagnosis state to the relapse state. For example, this patient gained several mutations in genes involved in apoptosis, cellular differentiation, and retinoic acid signaling that may alter their susceptibility to treatment. Cancer is a disease of the genome making it a necessity to be able to analyze multiple types of genetic alterations at once. The application of next generation sequencing has the potential to aid in the diagnosis and treatment of cancer as costs for sequencing decline and the magnitude of data increases. A primary limiting factor to clinical applications of genomic NGS is downstream bioinformatics analysis. This paper highlights core algorithms required for analyzing clinical NGS samples and reports new algorithms under development for the prioritization and visualization of somatic mutations detected in clinical NGS samples. Currently, the pipeline is available for in-house use only, but in the future it will be made publicly available. Furthermore, as additional samples are analyzed the pipeline will be broaden to rank and distinguish between driver mutations and clinically actionable mutations.
  10 in total

1.  A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3.

Authors:  Pablo Cingolani; Adrian Platts; Le Lily Wang; Melissa Coon; Tung Nguyen; Luan Wang; Susan J Land; Xiangyi Lu; Douglas M Ruden
Journal:  Fly (Austin)       Date:  2012 Apr-Jun       Impact factor: 2.160

Review 2.  Computational prediction of cancer-gene function.

Authors:  Pingzhao Hu; Gary Bader; Dennis A Wigle; Andrew Emili
Journal:  Nat Rev Cancer       Date:  2006-12-14       Impact factor: 60.716

Review 3.  Prognostic factors and risk-based therapy in pediatric acute myeloid leukemia.

Authors:  Soheil Meshinchi; Robert J Arceci
Journal:  Oncologist       Date:  2007-03

Review 4.  Children's Oncology Group's 2013 blueprint for research: acute myeloid leukemia.

Authors:  Alan S Gamis; Todd A Alonzo; John P Perentesis; Soheil Meshinchi
Journal:  Pediatr Blood Cancer       Date:  2012-12-19       Impact factor: 3.167

5.  Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads.

Authors:  Kai Ye; Marcel H Schulz; Quan Long; Rolf Apweiler; Zemin Ning
Journal:  Bioinformatics       Date:  2009-06-26       Impact factor: 6.937

6.  Detection of FLT3 internal tandem duplication in targeted, short-read-length, next-generation sequencing data.

Authors:  David H Spencer; Haley J Abel; Christina M Lockwood; Jacqueline E Payton; Philippe Szankasi; Todd W Kelley; Shashikant Kulkarni; John D Pfeifer; Eric J Duncavage
Journal:  J Mol Diagn       Date:  2012-11-14       Impact factor: 5.568

Review 7.  Structural and functional alterations of FLT3 in acute myeloid leukemia.

Authors:  Soheil Meshinchi; Frederick R Appelbaum
Journal:  Clin Cancer Res       Date:  2009-06-23       Impact factor: 12.531

8.  Wilms' tumour 1 mutations are associated with FLT3-ITD and failure of standard induction chemotherapy in patients with normal karyotype AML.

Authors:  K Summers; J Stevens; I Kakkas; M Smith; L L Smith; F Macdougall; J Cavenagh; D Bonnet; B D Young; T A Lister; J Fitzgibbon
Journal:  Leukemia       Date:  2007-01-04       Impact factor: 11.528

9.  A framework for variation discovery and genotyping using next-generation DNA sequencing data.

Authors:  Mark A DePristo; Eric Banks; Ryan Poplin; Kiran V Garimella; Jared R Maguire; Christopher Hartl; Anthony A Philippakis; Guillermo del Angel; Manuel A Rivas; Matt Hanna; Aaron McKenna; Tim J Fennell; Andrew M Kernytsky; Andrey Y Sivachenko; Kristian Cibulskis; Stacey B Gabriel; David Altshuler; Mark J Daly
Journal:  Nat Genet       Date:  2011-04-10       Impact factor: 38.330

10.  STRING 8--a global view on proteins and their functional interactions in 630 organisms.

Authors:  Lars J Jensen; Michael Kuhn; Manuel Stark; Samuel Chaffron; Chris Creevey; Jean Muller; Tobias Doerks; Philippe Julien; Alexander Roth; Milan Simonovic; Peer Bork; Christian von Mering
Journal:  Nucleic Acids Res       Date:  2008-10-21       Impact factor: 16.971

  10 in total
  2 in total

1.  Error-corrected sequencing strategies enable comprehensive detection of leukemic mutations relevant for diagnosis and minimal residual disease monitoring.

Authors:  Erin L Crowgey; Nitin Mahajan; Wing Hing Wong; Anilkumar Gopalakrishnapillai; Sonali P Barwe; E Anders Kolb; Todd E Druley
Journal:  BMC Med Genomics       Date:  2020-03-04       Impact factor: 3.063

2.  Modeling pediatric AML FLT3 mutations using CRISPR/Cas12a- mediated gene editing.

Authors:  Natalia Rivera-Torres; Kelly Banas; Eric B Kmiec
Journal:  Leuk Lymphoma       Date:  2020-08-20
  2 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.