| Literature DB >> 23349847 |
Lőrinc S Pongor1, Ferenc Pintér, István Peták.
Abstract
Next generation sequencing (NGS) of PCR amplicons is a standard approach to detect genetic variations in personalized medicine such as cancer diagnostics. Computer programs used in the NGS community often miss insertions and deletions (indels) that constitute a large part of known human mutations. We have developed HeurAA, an open source, heuristic amplicon aligner program. We tested the program on simulated datasets as well as experimental data from multiplex sequencing of 40 amplicons in 12 oncogenes collected on a 454 Genome Sequencer from lung cancer cell lines. We found that HeurAA can accurately detect all indels, and is more than an order of magnitude faster than previous programs. HeurAA can compare reads and reference sequences up to several thousand base pairs in length, and it can evaluate data from complex mixtures containing reads of different gene-segments from different samples. HeurAA is written in C and Perl for Linux operating systems, the code and the documentation are available for research applications at http://sourceforge.net/projects/heuraa/Entities:
Mesh:
Year: 2013 PMID: 23349847 PMCID: PMC3548894 DOI: 10.1371/journal.pone.0054294
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1Mapping replacements, deletions and insertions with words tiling the reference sequence.
Words of length w are positioned at delta bps from each other. The non-matching words (red) allow for the identification of the polymorphic region, i.e. the segment in which the mutation is found.
Figure 2Analysis of the polymorphic region (schematic flowchart).
Each polymorphic region within a read is analyzed according to this general procedure. Dynamic programming refers to the use of the Needleman Wunsch or the Smith Waterman algorithm Mutation calling refers to converting the results of the previous steps into standard mutation description format (http://www.hgvs.org/mutnomen/recs-DNA.html) [20].
Typical run times for non-repetitive reads*.
| Operation | Time | |
| 1 | Index construction | 1∼ msecs |
| 2 | Sub-word-identification (strstr calls): | 6.2 sec |
| 3 or 4 | Identification of one simple mutations, | 1.7 sec |
| Identification of one complex mutation(Needleman-Wunsch) | 16.3 sec |
Calculated for 1,000,000 copies of a 225 bp segment taken from exon 20 of the human EGFR gene (see Exon 20 reference sequence in Supplementary Materials S1). Simple mutations are randomly placed single substitutions or deletions of arbitrary size. Complex mutations were generated by placing two random substitutions plus a randomly placed 4 bp deletion within a window of 15 nucleotides.
Figure 3Dependance of runtime on various program parameters.
A) Dependence of the running time on the value of w word size. B) Dependence of the running time on the value of delta tiling shift. Run times given for 500,000 copies of the Exon_20 reference sequence (see Section 4.1 as well as Supplementary Materials S1). Error bars indicate +/−3 standard deviations calculated from 10 repetitions. w = 15 and delta = 5 were used unless otherwise indicated. We note that irrespective of the choice of parameters shown in the figure, HeurAA found the mutations in all cases. C)“No of random substitutions” indicate the number of substitutions randomly introduced into the same read.
Accuracy and speed of mutation mapping.
| Simulated datasetsa) | Experimental datasets | |||||||||||
| SNP(1) | SNP (2) | Deletion (3 bp) | Deletion (15 bp) | Cell line H1975 | Cell line HCC827 | |||||||
| Algorithm: | Time (sec) | % | Time (sec) | % | Time(sec) | % | Time(sec) | % | Time (sec) | No of hits(% correct) | Time (sec) | No of hits(% correct) |
| Needleman-Wunsch | 132600 | 100 | 135600 | 100 | 136600 | 100 | 129800 | 100 | 204000 | 1000000 (100) | 99000 | 1000000 (100) |
| BWA | 1264 | 100 | 1306 | 100 | 1234 | 100 | 1276 | 100 | 1641 | 998142 (99.8) | 913 | 997564 (97.88) |
| Bowtie | 88 | 100 | 92 | 100 | 154 | 0 | 152 | 0 | 144 | 325812 (32.49) | 95 | 29700 (0) |
| Bowtie2 (beta) | 146 | 100 | 148 | 100 | 144 | 100 | 146 | 100 | 379 | 988372 (98.84) | 212 | 985512 (97.14) |
| mrsFAST | 14 | 100 | 14 | 100 | 16 | 0 | 14 | 0 | 15 | 640727 (64) | 16 | 68280 (0) |
| Mosaik | 3860 | 100 | 4086 | 100 | 3824 | 0 | 3870 | 0 | 4926 | 254718 (25.4) | 1426 | 83784 (0) |
| HeurAA(this work) | 7,1 | 100 | 7,5 | 100 | 7,6 | 100 | 6,9 | 100 | 23 | 1000000 (100) | 25 | 1000000 (100) |
1,000,000 sequences of 225 bp, containing the indicated number of different, randomly placed substitutions, insertions or deletions generated by the MSBAR program of th EMBOSS package [25]. The reference sequence was a part of exon 20 of the Human Epidermal Growth Factor Receptor gene (NT_033968/ENSG00000146648) exon 20, 4838312 - 4838535 bp, sequence in Supplementary Materials S1).
Data collection was carried out by PCR amplicon sequencing on a Roche 454 sequencer as recommended for the 454 platform ( http://454.com/applications/targeted-resequencing/index.asp ). Reads containing complete sequence identifiers and longer than 80 bp were analyzed further.
The samples were taken from a culture of H1975 cells carrying a a 128C>T mutation in exon 20 of the EGFR gene [26] (COSMIC [24] ID: 6240). The reference sequence was the same as in (a). 1,000,000 reads containing complete sequence identifier and longer than 80 bp were analyzed.
The samples, taken from a culture of HCC827 cells carrying a 137delGGAATTAAGAGAAGCA mutation in exon 19 of the EGFR gene [27] (COSMIC ID: 6223, sequence in Supplementary Materials S1). The reference sequence was a segment of the EGFR gene(NT_033968/ENSG00000146648) exon 19, 4831698 - 4831757 bp segment, sequence in Supplementary Materials S1). 1,000,000 reads containing complete sequence identifier and longer than 80 bp were analyzed. Part of the dataset is deposited with the Supplementary Materials S1.
http://bioinformatics.bc.edu/marthlab/Mosaik.
Comparison of typical run-times(a) and hit accuracy measured on the prion octapeptide repeat region.
| Random SNP(1) | Deletion of 24 bp | Insertion of 48 bp | ||||
| Time (sec) | % found | Time (sec) | found | Time (sec) | found | |
|
| 153228 | 100 | 153228 | + | 153228 | + |
|
| 1270 | 100 | 1298 | + | 1304 | + |
|
| 190 | 100 | 24 | − | 22 | − |
|
| 346 | 100 | 428 | + | 446 | + |
|
| 16.4 | 100 | 14.8 | − | 14.8 | − |
|
| 5430 | 100 | 5912 | − | 6490 | − |
|
| 8.12 | 100 | 9.24 | + | 9.42 | + |
The times were measured for 1,000,000 amplicons harboring different randomly placed substitutions.
SNP(1) sequences were produced by placing 1 random substitution in to the repeat region using the MSBAR program of the EMBOSS package [25]. The times were measured for 1,000,000 randomly generated read copies.
The times were measured for 1,000,000 copies of the same sequence.
http://bioinformatics.bc.edu/marthlab/Mosaik, version 1.1.0014.
Figure 4HeurAA output.
Sample: Name of the sample; Reference: name of the reference sequence; Mutation: Mutation found (mutations with >5 nucleotides are replaced with the length of the mutation); Percent: Percentage of mutation in all sequences identified with specified sample an reference; Forw: Number of mutations found in forward sequences; Rev: Mutations found in reverse complement sequences; Mut no.: Total number of mutations found; Reads: Total reads found for specified sample/reference pair; Annotation: For mutations longer than 5 nucleotides), HeurAA prints out the entire mutation here.
Figure 5Repeats in the reference sequence.
The repeats are indicated by boxes. Note that the strstr function detects only the first occurrences of the words so a mutation occurring within a region covered only by the subsequent occurrences (grey) will not be noticed unless the algorithm specifically looks for them.
Figure 6Logical structure of Index file (example).
In this example the sub-word found in position 6 is also found in at least one other location of the reference sequence (see example in Figure 5).