Literature DB >> 17485433

AutoCSA, an algorithm for high throughput DNA sequence variant detection in cancer genomes.

E Dicks¹, J W Teague, P Stephens, K Raine, A Yates, C Mattocks, P Tarpey, A Butler, A Menzies, D Richardson, A Jenkinson, H Davies, S Edkins, S Forbes, K Gray, C Greenman, R Shepherd, M R Stratton, P A Futreal, R Wooster.

Abstract

UNLABELLED: The undertaking of large-scale DNA sequencing screens for somatic variants in human cancers requires accurate and rapid processing of traces for variants. Due to their often aneuploid nature and admixed normal tissue, heterozygous variants found in primary cancers are often subtle and difficult to detect. To address these issues, we have developed a mutation detection algorithm, AutoCSA, specifically optimized for the high throughput screening of cancer samples. AVAILABILITY: http://www.sanger.ac.uk/genetics/CGP/Software/AutoCSA.

Entities: Chemical Disease Species

Mesh：

Substances：
DNA, Neoplasm

Year: 2007 PMID： 17485433 PMCID： PMC5947781 DOI： 10.1093/bioinformatics/btm152

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

Introduction

Cancers arise due to the accumulation of mutations in critical target genes conferring growth/survival advantage in a clone of cells which eventually manifests as clinical disease. Whilst a proportion of these mutations can be inherited in the germline giving rise to cancer susceptibility syndromes, the majority are accumulated somatically. There has been considerable effort to identify the variants and hence the genes that cause cancer. Indeed, since the completion of the Human Genome Project it is now possible to systematically screen megabases of sequence for these somatic variants. A number of software programs and protocols have been developed to identify sequence variants to a high sensitivity; PolyPhred (Nickerson ) has been available for some time, while comparitive sequence analysis (CSA) (Mattocks ), Mutation Surveyor (SoftGenetics), novoSNP (Weckx ), InSNP (Manaster ) and SNPdetector (Zhang ) are more recent developments. In addition, PolyPhred has been enhanced to detect SNPs in PCR-amplified diploid samples (Stephens ). We have extended some of the concepts of CSA variant detection protocol developed by Mattocks into a fully functional computer application, AutoCSA. CSA was initially developed to simplify and aid the detection of variants in DNA sequence traces. Briefly, CSA involves comparing raw trace profiles from each of the four channels (bases) between the sample under investigation and a reference sample by overlaying the traces using ABI Genescan software. Each channel is then manually inspected for the presence of a reduced peak height between the reference and the sample trace and also the presence of a novel peak indicating a possible variant. This key concept has been used in the development of AutoCSA which, unlike CSA, is capable of automatically analysing large numbers of sequence traces with minimal intervention. In particular, AutoCSA has also been optimized to detect heterozygous substitutions present at less than 50% of wildtype signal that are frequently present in PCR-amplified templates from primary tumour samples. The software has been further developed to efficiently detect other classes of variants, notably small homozygous and heterozygous insertions and deletions.

Algorithm and Results

AutoCSA is split into three main components, pre-processing of the trace file, variant detection and a post-processing stage to remove false positives. Detailed information on the algorithm can be found on our website (http://www.sanger.ac.uk/genetics/CGP/Software/AutoCSA/detailed_algorithm.shtml). A training set of 161 somatic variants composed of 96 substitutions (84 heterozygous, 12 homozygous), 36 heterozygous insertions/deletions and 29 homozygous insertions/deletions was used to optimize the software.

Pre-processing

One of the important concepts of AutoCSA is that it uses raw data channels from the sequence trace file, which contains the absolute peak heights generated by the sequencing reaction. These data are likely to be more quantitative than the processed data generated by the software onboard the ABI sequencer, which equalizes peak heights across the trace (Mattocks ). However, the raw data require some manipulation to render it suitable for analysis with AutoCSA, and a pre-processing step is required to produce a trace with an approximately uniform base spacing (mobility correction) and uniform base line intensity (baselining). The pre-processing stage also involves the identification of the position (scan index) and height (intensity) of the peaks in each of the four channels (bases) of the trace file. AutoCSA uses the known amplimer DNA sequence to identify the correct consecutive peaks in the sequence trace. A quality value is assigned to each base with a matched peak and defined as a signal (matched peak) to noise (unmatched peak) ratio of intensities.

Heterozygous substitutions

The primary discriminator for indicating the presence of a heterozygous substitution is a peak height drop ≥20% between the trace under investigation and a reference trace. In addition, the algorithm requires the presence of an additional mutant peak, which must satisfy a local peak height ratio test, by comparing height intensities of adjacent bases. Using these parameters, AutoCSA detected 81/84 (96.4%) heterozygous substitutions in the training set. Three substitutions were missed due to poor local quality issues in the traces. Homozygous substitutions are identified by the absence of the wildtype base during the amplimer matching procedure. Each missing position is interrogated for the presence of a viable novel peak. Using these criteria, AutoCSA detected 12/12 (100%) homozygous somatic substitutions in the training set.

Homozygous insertions and deletions

Homozygous insertions are identified by interrogating the base-spacing between neighbouring nucleotides. A scan index gap is calculated between neighbouring bases that have been aligned to the amplimer sequence for the trace under investigation. If there is a homozygous insertion there will be a larger than expected scan gap. Homozygous deletions can be determined by failure to identify the expected peaks during the amplimer matching procedure. Using these criteria AutoCSA detected 28/29 (97%) of homozygous insertion/deletions in the training set.

Heterozygous insertions and deletions

To detect heterozygous insertions/deletions, AutoCSA first identifies an abrupt drop, or step in the quality of the sequence trace. The second criterion is a critical concentration of individual, closely spaced heterozygous substitutions from the start of the reduced quality step to the end of the trace. Using these criteria AutoCSA detected 36/36 (100%) heterozygous insertions/deletions in the training set.

Post-processing (variant flagging) and visualization

AutoCSA reduces the number of false calls displayed to users by using a series of novel filters that examine the global and local quality of a trace and the concentration of variants. Variants that pass the filters are ‘flagged’ for manual review or otherwise automatically rejected by the system. If bi-directional sequencing is used, a further set of rules can be applied by AutoCSA, which utilises information from both strands to reduce the false positive calls further. This second set of rules examines the corresponding base on the opposite strand to determine if an equivalent variant has been called and also to assess the noise level under the specific base to help rule out noise and sequencing artefacts. AutoCSA generates a series of web pages summarizing the resulting variants with images of each variant and associated protein annotation (Fig. 1).

Fig. 1

AutoCSA displays

(A) Lists all sequences screened with a summary of variants found. (B) Lists traces screened with coverage information and number of variants on each trace. (C) Main substitution display, four traces are displayed with 20 bases either side of the potential substitution. The top two traces are the traces which were used to call the variant (reference first and trace under investigation second). The third and fourth traces are the reverse sequenced traces. The DNA and protein annotation of the variant are displayed above and to the right of the traces.

Testing of AutoCSA

To evaluate the performance of AutoCSA, Mutation Surveyor version 2.0 was used in a comparison analysis of 43 Mb of DNA. These data were obtained by resequencing the 518 protein kinase genes in the human genome in a series of 30 primary colorectal tumours and one colorectal cell line. A total of 105 somatic substitutions and 22 somatic heterozygous insertion/deletion mutations were identified in this set using a combination of AutoCSA and Mutation Surveyor. Ninety seven (92%) substitutions and 22 (100%) heterozygous insertion/deletions were identified by AutoCSA alone, 82 (78%) substitutions and 4 (18%) heterozygous insertion/deletions were identified by Mutation Surveyor alone. AutoCSA generated 0.21 false positives per sequence trace compared to 0.52 false positives per sequence trace generated by Mutation Surveyor.

Summary

In conclusion, we have developed a variant detection system which has been optimized to detect the often subtle heterozygous variants which are common in primary cancer samples. The software has been developed so it can automatically run over large numbers of trace files with minimal human intervention and can therefore be easily integrated into high throughput resequencing projects.

6 in total

1. novoSNP, a novel computational tool for sequence variation discovery.

Authors: Stefan Weckx; Jurgen Del-Favero; Rosa Rademakers; Lieve Claes; Marc Cruts; Peter De Jonghe; Christine Van Broeckhoven; Peter De Rijk
Journal: Genome Res Date: 2005-03 Impact factor: 9.043

2. InSNP: a tool for automated detection and visualization of SNPs and InDels.

Authors: Carl Manaster; Weiyue Zheng; Markus Teuber; Stefan Wächter; Frank Döring; Stefan Schreiber; Jochen Hampe
Journal: Hum Mutat Date: 2005-07 Impact factor: 4.878

3. Automating sequence-based detection and genotyping of SNPs from diploid samples.

Authors: Matthew Stephens; James S Sloan; P D Robertson; Paul Scheet; Deborah A Nickerson
Journal: Nat Genet Date: 2006-02-19 Impact factor: 38.330

4. PolyPhred: automating the detection and genotyping of single nucleotide substitutions using fluorescence-based resequencing.

Authors: D A Nickerson; V O Tobe; S L Taylor
Journal: Nucleic Acids Res Date: 1997-07-15 Impact factor: 16.971

5. Comparative sequence analysis (CSA): a new sequence-based method for the identification and characterization of mutations in DNA.

Authors: C Mattocks; P Tarpey; M Bobrow; J Whittaker
Journal: Hum Mutat Date: 2000-11 Impact factor: 4.878

6. SNPdetector: a software tool for sensitive and accurate SNP detection.

Authors: Jinghui Zhang; David A Wheeler; Imtiaz Yakub; Sharon Wei; Raman Sood; William Rowe; Paul P Liu; Richard A Gibbs; Kenneth H Buetow
Journal: PLoS Comput Biol Date: 2005-10-28 Impact factor: 4.475

6 in total

7 in total

1. Signatures of mutation and selection in the cancer genome.

Authors: Graham R Bignell; Chris D Greenman; Helen Davies; Adam P Butler; Sarah Edkins; Jenny M Andrews; Gemma Buck; Lina Chen; David Beare; Calli Latimer; Sara Widaa; Jonathon Hinton; Ciara Fahey; Beiyuan Fu; Sajani Swamy; Gillian L Dalgliesh; Bin T Teh; Panos Deloukas; Fengtang Yang; Peter J Campbell; P Andrew Futreal; Michael R Stratton
Journal: Nature Date: 2010-02-18 Impact factor: 49.962

2. CHILD: a new tool for detecting low-abundance insertions and deletions in standard sequence traces.

Authors: Ilia Zhidkov; Raphael Cohen; Nophar Geifman; Dan Mishmar; Eitan Rubin
Journal: Nucleic Acids Res Date: 2011-01-28 Impact factor: 16.971

3. Systematic sequencing of renal carcinoma reveals inactivation of histone modifying genes.

Authors: Gillian L Dalgliesh; Kyle Furge; Chris Greenman; Lina Chen; Graham Bignell; Adam Butler; Helen Davies; Sarah Edkins; Claire Hardy; Calli Latimer; Jon Teague; Jenny Andrews; Syd Barthorpe; Dave Beare; Gemma Buck; Peter J Campbell; Simon Forbes; Mingming Jia; David Jones; Henry Knott; Chai Yin Kok; King Wai Lau; Catherine Leroy; Meng-Lay Lin; David J McBride; Mark Maddison; Simon Maguire; Kirsten McLay; Andrew Menzies; Tatiana Mironenko; Lee Mulderrig; Laura Mudie; Sarah O'Meara; Erin Pleasance; Arjunan Rajasingham; Rebecca Shepherd; Raffaella Smith; Lucy Stebbings; Philip Stephens; Gurpreet Tang; Patrick S Tarpey; Kelly Turrell; Karl J Dykema; Sok Kean Khoo; David Petillo; Bill Wondergem; John Anema; Richard J Kahnoski; Bin Tean Teh; Michael R Stratton; P Andrew Futreal
Journal: Nature Date: 2010-01-06 Impact factor: 49.962

Review 4. Lessons learnt from large-scale exon re-sequencing of the X chromosome.

Authors: F Lucy Raymond; Annabel Whibley; Michael R Stratton; Jozef Gecz
Journal: Hum Mol Genet Date: 2009-04-15 Impact factor: 6.150

5. Base-calling algorithm with vocabulary (BCV) method for analyzing population sequencing chromatograms.

Authors: Yuri S Fantin; Alexey D Neverov; Alexander V Favorov; Maria V Alvarez-Figueroa; Svetlana I Braslavskaya; Maria A Gordukova; Inga V Karandashova; Konstantin V Kuleshov; Anna I Myznikova; Maya S Polishchuk; Denis A Reshetov; Yana A Voiciehovskaya; Andrei A Mironov; Vladimir P Chulanov
Journal: PLoS One Date: 2013-01-28 Impact factor: 3.240

6. Decoding of superimposed traces produced by direct sequencing of heterozygous indels.

Authors: Dmitry A Dmitriev; Roman A Rakitov
Journal: PLoS Comput Biol Date: 2008-07-25 Impact factor: 4.475

7. A systematic, large-scale resequencing screen of X-chromosome coding exons in mental retardation.

Authors: Patrick S Tarpey; Raffaella Smith; Erin Pleasance; Annabel Whibley; Sarah Edkins; Claire Hardy; Sarah O'Meara; Calli Latimer; Ed Dicks; Andrew Menzies; Phil Stephens; Matt Blow; Chris Greenman; Yali Xue; Chris Tyler-Smith; Deborah Thompson; Kristian Gray; Jenny Andrews; Syd Barthorpe; Gemma Buck; Jennifer Cole; Rebecca Dunmore; David Jones; Mark Maddison; Tatiana Mironenko; Rachel Turner; Kelly Turrell; Jennifer Varian; Sofie West; Sara Widaa; Paul Wray; Jon Teague; Adam Butler; Andrew Jenkinson; Mingming Jia; David Richardson; Rebecca Shepherd; Richard Wooster; M Isabel Tejada; Francisco Martinez; Gemma Carvill; Rene Goliath; Arjan P M de Brouwer; Hans van Bokhoven; Hilde Van Esch; Jamel Chelly; Martine Raynaud; Hans-Hilger Ropers; Fatima E Abidi; Anand K Srivastava; James Cox; Ying Luo; Uma Mallya; Jenny Moon; Josef Parnau; Shehla Mohammed; John L Tolmie; Cheryl Shoubridge; Mark Corbett; Alison Gardner; Eric Haan; Sinitdhorn Rujirabanjerd; Marie Shaw; Lucianne Vandeleur; Tod Fullston; Douglas F Easton; Jackie Boyle; Michael Partington; Anna Hackett; Michael Field; Cindy Skinner; Roger E Stevenson; Martin Bobrow; Gillian Turner; Charles E Schwartz; Jozef Gecz; F Lucy Raymond; P Andrew Futreal; Michael R Stratton
Journal: Nat Genet Date: 2009-04-19 Impact factor: 38.330

7 in total