| Literature DB >> 20487544 |
Joachim M De Schrijver1, Kim De Leeneer, Steve Lefever, Nick Sabbe, Filip Pattyn, Filip Van Nieuwerburgh, Paul Coucke, Dieter Deforce, Jo Vandesompele, Sofie Bekaert, Jan Hellemans, Wim Van Criekinge.
Abstract
BACKGROUND: Next-generation amplicon sequencing enables high-throughput genetic diagnostics, sequencing multiple genes in several patients together in one sequencing run. Currently, no open-source out-of-the-box software solution exists that reliably reports detected genetic variations and that can be used to improve future sequencing effectiveness by analyzing the PCR reactions.Entities:
Mesh:
Year: 2010 PMID: 20487544 PMCID: PMC2880033 DOI: 10.1186/1471-2105-11-269
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Overview of the VIP pipeline. Overview of the Variant Identification Pipeline (black arrows and white text-boxes) and the VIP Validator (grey arrows and grey text-boxes). The analysis pipeline consists of 4 modules. 1) Raw sequences are extracted from the FASTA files generated by the GS-FLX sequencer and processed into sequenced amplicons and additional information. 2a) Reference amplicons are generated using a list of reference sequences and the list of primers. 2b) Mapping is carried out with BLAT using the reference amplicons and the sequenced amplicons. Variations are detected and stored in the database. 3) The requested reports are generated. The VIP Validator introduces additional variation in the sequence reads and reanalyses those sequences to validate the pipeline for that specific variation.
Figure 2Influence of quality score (Q) on homopolymer accuracy. Distribution of homopolymer related quality scores (Q score). The normal homopolymer Q score distribution is determined by making a distribution of the Q score of the homopolymer base; the mismatch homopolymer Q score distribution is determined by making a distribution of the Q score of the base preceding a homopolymer related deletion or the Q score of a homopolymer related inserted base. Distributions are shown for homopolymers with length 3, 5, 6 and 7 bp. The grey vertical lines are drawn at a Q score of 15 and 30. Distributions are based on data from the two BRCA runs.
Figure 3Pre and post filtering variants data. Plot of the coverage (times a single sequence is read by the sequencing equipment) and the frequency of an observed variation. In reality, genomic variation occurs at a frequency of either 50% or 100% of total reads. The top figure is the distribution when no filters are applied to discriminate between sequencing errors and real variants; the bottom figure is the distribution where several filters (frequency filter, Q score filter, coverage filter and homopolymer filter) are applied to discriminate the real variations from the sequencing errors. Vertical grey line indicates 40 × coverage; horizontal grey lines indicate respectively 33%, 67% and 95% variation frequency.
Comparison of AVA and VIP performance (1 sample, 67 amplicons, 12 known variants)
| AVA software | VIP pipeline | |
|---|---|---|
| Total variants (unfiltered) | 235 | 50 |
| 'Pass filter' variants | 33 | 14 |
| True variants called | 6/12 (50%) | 12/12 (100%) |
| False positives | 27/33 (81.2%) | 2/14 (14.2%) |
| False negatives | 6/12 (50%) | 0/12 (0%) |
Figure 4Improving future sequencing efficiency using priors sequencing data. Example of the reporting possibilities. Run 1 had many unmappable and short, mapped sequences. Length distribution showed these were mainly 60-120 bp sequences. In Run 2 optimized PCR reactions and an additional length separation were carried out prior to the sequencing with a huge reduction (8% vs. 24%) of unmapped and short sequences and thus improving the cost-effectiveness.
Figure 5The VIP Validator. Detection ratios of 1000 known BRCA1/2 variants, random SNVs, random 3 bp and 10 bp deletions, and random 3 bp and 10 bp insertions. The grey horizontal lines indicate 67% detection ratio. The grey vertical lines indicate a 99% cumulative frequency.