| Literature DB >> 21672185 |
Francisco Fernandes1, Paulo G S da Fonseca, Luis M S Russo, Arlindo L Oliveira, Ana T Freitas.
Abstract
BACKGROUND: Over the past few years, new massively parallel DNA sequencing technologies have emerged. These platforms generate massive amounts of data per run, greatly reducing the cost of DNA sequencing. However, these techniques also raise important computational difficulties mostly due to the huge volume of data produced, but also because of some of their specific characteristics such as read length and sequencing errors. Among the most critical problems is that of efficiently and accurately mapping reads to a reference genome in the context of re-sequencing projects.Entities:
Mesh:
Year: 2011 PMID: 21672185 PMCID: PMC3118166 DOI: 10.1186/1471-2105-12-163
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Indexes for the text . (a) Suffix tree for the text abbbab, with the leaves numbered according to lexicographic order of the suffixes they represent. (b) The corresponding suffix array indicating the starting position of the sorted suffixes. (c) The equivalent BWT obtained by rotating and then sorting the text. Notice that the BWT is actually composed of only the last letter of the sorted rotations. Interestingly, this representation permits reconstructing the original text, and is also much more amenable to compression because it typically contains sequences of repeated characters. Notice, in addition, that every subtree of the suffix tree corresponds to an interval of the suffix array and, equivalently, of the BWT. In this example, the dashed boxes indicate the subtree/intervals corresponding to the suffixes that start with the common prefix b.
Figure 2Sketch and example of the TAPyR procedure. (a) Sketch of the seed strategy employed by TAPyR. In this schema, three seeds are chosen, and seven matches of these seeds are found on the reference genome, three for the first seed, two for the second, and two for the third. These occurrences are ordered in the genome and scanned from left to right. Multiple seed matches are formed by extending current partial matches with the next occurrence if the coherence criteria are met. Otherwise, the current multiple match is stored as a potential candidate and a new one is started. In this example, we finish with five potential candidates for extension indicated by the dashed boxes. The largest candidate(s), i.e. the multiple seed occurrence that span most bases, are chosen for extension. In this case, that should be . (b) A more concrete example, in which we have a sequence of length 15, which was originally read from position 101 of the genome, with one insertion at position 5 and one substitution at position 9. The algorithm starts searching from the beginning of the read in the index, but cannot continue beyond the fourth a character. At this point, we have the first seed s1 = aaaa, which occurs at position 101 in the genome. The next character of the read is skipped, and the search continues from position 6, which is the beginning of the second seed. Seed s2 = ccct happens to have an accidental occurrence at the position 201, which is not related to the actual read position in the genome. Again, we skip the next (mismatched) character of the read and restart at position 11. This time the search reaches the end of the read, and yields the last seed s3 = gggtt, occurring at position 110. These three occurrences are now sorted according to their position in the genome, and it turns out that the occurrences of s1 and s3 form a coherent multiple seed occurrence of combined length 9. The other candidate would be composed of the occurrence s2 alone which is not chosen for expansion since it is smaller. The space between the two seeds is then filled using dynamic programming, and the correct mapped position (101) is returned along with the final alignment.
Real data sets
| Reference genome Source | Reference genome size | SRA accession and species | Total reads | Average read length |
|---|---|---|---|---|
| ≈2.2 Mbp | ○ SRR001327 | 646,724 | 253 | |
| ≈4.96 Mbp | ○ SRR000868 | 588,397 | 263 | |
| ≈23.3 Mbp | ○ SRR006911 | 203,196 | 223 | |
| ≈103 Mbp | ○ SRR022943 | 3,214,353 | 103 | |
| ≈150 Mbp | ○ SRR003807 | 834,659 | 239 | |
| ≈100 Mbp | ○ SRR014420 Human individual NA15510 | 3,204 | 212 | |
Biological data sets used for the evaluation of the algorithms. The read data sets were downloaded from the Sequence Read Archive (SRA) public repository.
Experimental results with real data
| % | % | % | |||||||
|---|---|---|---|---|---|---|---|---|---|
| BWA-SW | 19m | 92.86 | 40 | 16m | 65.4 | 22 | 8m | 95.98 | 56 |
| SSAHA2 | 6m | 92.86 | 41 | 5m | 65.4 | 23 | 68m | 98.45 | 57 |
| Newbler | 11m | 92.95 | 47 | 11m | 65.55 | 24 | 12m | 97.72 | 60 |
| Segemehl | 181m | 90.77 | 61 | 110m | 62.95 | 32 | 90m | 98.14 | 62 |
| GASSST | 6m | 89.96 | 63 | 4m | 62.71 | 32 | 3m | 62.80 | 72 |
| TAPyR 85 | 2m | 88.37 | 70 | 13m | 59.98 | 35 | 48s | 95.30 | 66 |
| TAPyR 50 | 3m | 93.33 | 40 | 20m | 66.80 | 20 | 51s | 98.91 | 53 |
| BWA-SW | 51m | 61.05 | 19 | 58m | 95.80 | 19 | 6s | 98.25 | 58 |
| SSAHA2 | 513m | 70.40 | 18 | 92m | 97.30 | 18 | 16s | 99.44 | 54 |
| Newbler | 45m | 69.07 | 19 | 119m | 96.86 | 23 | 1m | 98.72 | 78 |
| Segemehl | 249m | 68.10 | 23 | 339m | 90.98 | 28 | 52s | 96.41 | 89 |
| GASSST | 9m | 58.50 | 27 | 22m | 82.80 | 30 | 1m | 83.61 | 93 |
| TAPyR 85 | 31m | 55.59 | 35 | 7m | 85.91 | 30 | 1s | 95.13 | 96 |
| TAPyR 50 | 31m | 73.30 | 15 | 7m | 97.05 | 19 | 2s | 99.63 | 60 |
Results of the experiments performed with real biological data. These tests were run on a Linux server with 16 Gb of RAM. TAPyR was tested under two configurations, requiring alignments with at least 50% and 85% identity, designated by TAPyR 50 and TAPyR 85, respectively. The other tools were run with their default options for 454 data, except for the following modifications. SSAHA2 and Segemehl were set to report only the best alignment for each read. For GASSST, we set the minimum identity to 85% (option -p 85) to match Segemehl. Newbler was set not to generate large files (option -nobig) and to load the index into the main memory (option -m). Reported times refer to the total number of CPU-seconds that the process used directly, as given by the Linux command time -f "%U". The average base-pairs-per-error rates ("bp/err" columns) are computed based on the best reported alignment of each read only.
Experimental results with synthetic data
| HS1 | HS2 | HS3 | ||||
|---|---|---|---|---|---|---|
| % | % | % | ||||
| BWA-SW | 1358s | 99.32 | 1269s | 98.81 | 1212s | 94.10 |
| SSAHA2 | 5044s | 99.86 | 4328s | 99.86 | 4121s | 99.18 |
| Newbler | 1192s | 99.99 | 6066s | 95.53 | 7260s | 83.25 |
| Segemehl | 3824s | 99.99 | 6823s | 99.95 | 6299s | 98.81 |
| GASSST | 855s | 99.47 | 794s | 99.27 | 693s | 97.35 |
| TAPyR 50 | 63s | 99.99 | 113s | 99.78 | 239s | 98.20 |
Results of the experiments performed with synthetic data. The data sets were generated as described in the text. The tests were run under the same conditions as those with real data. Shown is the percentage of reads of each data set that were successfully aligned to their original positions in the reference sequence.
Memory requirements for real data
| BWA-SW | 3 | 27 | 7 | 29 | 34 | 59 |
| SSAHA2 | 517 | 788 | 521 | 773 | 552 | 717 |
| Newbler | n/a | 600 | n/a | 900 | n/a | 500 |
| Segemehl | 31 | 302 | 68 | 328 | 315 | 447 |
| GASSST | n/a | 2381 | n/a | 2479 | n/a | 2539 |
| TAPyR 50 | 4 | 7 | 8 | 18 | 39 | 64 |
| BWA-SW | 144 | 149 | 210 | 208 | 118 | 103 |
| SSAHA2 | 679 | 1559 | 755 | 1209 | 648 | 680 |
| Newbler | n/a | 5100 | n/a | 2300 | n/a | 500 |
| Segemehl | 1388 | 2439 | 2025 | 2643 | 1134 | 1299 |
| GASSST | n/a | 5657 | n/a | 6967 | n/a | 4752 |
| TAPyR 50 | 168 | 270 | 244 | 394 | 137 | 220 |
Memory requirements for the real biological data sets of Table 1. Shown are the index file size (when applicable) and peak main memory usage as measured by the tool tstime
http://bitbucket.org/gsauthof/tstime, except for Newbler, whose overall memory demand was estimated through the Linux htop tool.