| Literature DB >> 25860434 |
Xiaoqing Peng, Jianxin Wang, Zhen Zhang, Qianghua Xiao, Min Li, Yi Pan.
Abstract
MOTIVATION: Based on the next generation genome sequencing technologies, a variety of biological applications are developed, while alignment is the first step once the sequencing reads are obtained. In recent years, many software tools have been developed to efficiently and accurately align short reads to the reference genome. However, there are still many reads that can't be mapped to the reference genome, due to the exceeding of allowable mismatches. Moreover, besides the unmapped reads, the reads with low mapping qualities are also excluded from the downstream analysis, such as variance calling. If we can take advantages of the confident segments of these reads, not only can the alignment rates be improved, but also more information will be provided for the downstream analysis.Entities:
Mesh:
Year: 2015 PMID: 25860434 PMCID: PMC4402702 DOI: 10.1186/1471-2105-16-S5-S8
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1The quality score distribution of sequencing errors. The 10 million reads of Illumina's Solexa with length 50-bp simulated by ART, and each base in a read is assigned a quality score by a phred-like algorithm. X represents the quality scores ranging from 0 to 40, and Y represents the number of sequencing error corresponding for each X value.
Figure 2Percentage of sequencing errors in bases with Quality score below 20. (a) shows the percentage of sequencing errors with quality scores lower than 20 in all the bases with quality scores lower than 20; (b) shows the percentage of sequencing errors with quality scores lower than 20 in all the sequencing errors.
Figure 3An example of trimming. The consecutive squares represent the bases of a read with 45 bp, where the black color squares denote the bases with low quality scores, and in contrast the white color squares are the bases with high quality scores. There are eight bases with low quality scores in the read. When K = 4, the read is trimmed into a longest segment which contains four low quality bases, and when K = 3, the read is trimmed into a longest segment which contains three low quality bases.
The alignment rate and precision of each alignment method on single-end simulated data with different read length.
| 50-bp | 75-bp | 100-bp | ||||
|---|---|---|---|---|---|---|
| BWA | 79.4737 | 99.7359 | 73.7762 | 99.7975 | 30.3573 | 99.7208 |
| BWA(mem) | - | - | 82.8545 | 99.8912 | 83.0971 | 99.8004 |
| RAUR(BWA) | 83.5165 | 99.3132 | 86.5413 | 99.1875 | 87.8022 | 99.1834 |
| Bowtie2 | 74.8779 | 99.6313 | 77.8351 | 99.7501 | 71.4820 | 99.8918 |
| Bowtie2(local) | - | - | 85.206 | 95.8658 | 82.3958 | 95.5368 |
| RAUR(Bowtie2) | 83.0495 | 98.2984 | 85.3716 | 98.3442 | 86.8258 | 98.2009 |
There are 7,740,912 simulated single-end reads with length 50-bp, 5,156,962 with length 75-bp, and 3,868,843 with length 100-bp.
The alignment rate and precision of each alignment method on paired-end simulated data with different read length.
| 50-bp | 75-bp | 100-bp | ||||
|---|---|---|---|---|---|---|
| BWA | 89.0737 | 99.6436 | 91.6370 | 99.8411 | 34.7815 | 99.6929 |
| BWA(mem) | - | - | 97.3837 | 99.8505 | 96.6372 | 99.6355 |
| RAUR(BWA) | 94.8130 | 99.2667 | 96.8181 | 99.7171 | 97.0432 | 98.9618 |
| Bowtie2 | 84.2039 | 99.8432 | 88.0185 | 99.9409 | 77.5385 | 99.9537 |
| Bowtie2(local) | - | - | 95.0565 | 98.0066 | 90.5642 | 96.8716 |
| RAUR(Bowtie2) | 96.6203 | 98.2447 | 96.9858 | 99.1592 | 96.8685 | 98.7567 |
There are 996,739 pairs simulated paired-end reads with length 50-bp, 1,541,980 pairs with length 75-bp, and 1,156,184 pairs with length 100-bp.
The number of TP (true positive), and FP (false positive) in the re-aligned reads from single-end simulated datasets.
| 50-bp | 75-bp | 100-bp | |||||||
|---|---|---|---|---|---|---|---|---|---|
| # | # | # | # | # | # | # | # | # | |
| RAUR(BWA) | 312,949 | 284,796 | 28,153 | 658,289 | 629,733 | 28,556 | 2,222,453 | 2,197,995 | 24,458 |
| RAUR(Bowtie2) | 632,552 | 519,586 | 112,966 | 388,654 | 293,101 | 95,553 | 593,629 | 532,641 | 60,988 |
#RA is the number of confidently re-aligned reads with mapping quality not less than 10, #TP is the number of confidently and correctly re-aligned reads, and #FP is the number of confidently but incorrectly re-aligned reads.
The number of TP (true positive), and FP (false positive) in the re-aligned reads from paired-end simulated datasets.
| 50-bp | 75-bp | 100-bp | |||||||
|---|---|---|---|---|---|---|---|---|---|
| # | # | # | # | # | # | # | # | # | |
| RAUR(BWA) | 57,206 | 53,440 | 3,766 | 79,892 | 77,913 | 1,979 | 719,859 | 709,445 | 10,414 |
| RAUR(Bowtie2) | 123,759 | 108,171 | 15,588 | 138,273 | 126,500 | 11,773 | 223,491 | 209,981 | 13,510 |
#RA is the number of confidently re-aligned reads with mapping quality not less than 10, #TP is the number of confidently and correctly re-aligned reads, and #FP is the number of confidently but incorrectly re-aligned reads.
The alignment rate and precision of Bowtie2 on single-end simulated data with different initial values of K.
| 50-bp | 75-bp | 100-bp | ||||
|---|---|---|---|---|---|---|
|
| ||||||
| 10 | 0.870362 | 0.981524 | 0.858727 | 0.981852 | 0.834774 | 0.981529 |
| 9 | 0.869464 | 0.981703 | 0.856311 | 0.982638 | 0.832754 | 0.982217 |
| 8 | 0.868262 | 0.982009 | 0.853695 | 0.983442 | 0.830495 | 0.982984 |
| 7 | 0.866564 | 0.982608 | 0.85074 | 0.984281 | 0.827873 | 0.983787 |
| 6 | 0.864261 | 0.983586 | 0.847368 | 0.985199 | 0.824662 | 0.984629 |
| 5 | 0.861039 | 0.985146 | 0.84329 | 0.98618 | 0.820193 | 0.985421 |
| 4 | 0.856596 | 0.987344 | 0.837615 | 0.987297 | 0.812949 | 0.986246 |
| 3 | 0.850384 | 0.990075 | 0.828366 | 0.988768 | 0.800774 | 0.987542 |
| 2 | 0.841539 | 0.993 | 0.813271 | 0.991168 | 0.782962 | 0.990233 |
| 1 | 0.825283 | 0.995516 | 0.794526 | 0.994829 | 0.763764 | 0.993865 |
There are 7,740,912 simulated single-end reads with length 50-bp, 5,156,962 with length 75-bp, and 3,868,843 with length 100-bp.
The alignment rate and precision of each alignment method on single-end real data with different read length.
|
| |||
|---|---|---|---|
| SRR006273(76 bp) | ERR008838(76 bp) | ERR008843(83 bp) | |
| BWA | 69.0456 | 77.1065 | 81.1342 |
| BWA(mem) | 75.3120 | 80.3020 | 83.2176 |
| RAUR(BWA) | 83.0732 | 83.8950 | 86.3092 |
| Bowtie2 | 70.6717 | 78.7660 | 82.3776 |
| Bowtie2(local) | 80.6902 | 85.9375 | 88.2963 |
| RAUR(Bowtie2) | 81.0039 | 83.8124 | 86.0461 |
There are 10,685,743 single-end reads with length 76-bp in SRR006273, 12,564,212 with length 76-bp in ERR008838, and 15,929,373 with length 83-bp in ERR008843.
The alignment rate and precision of each alignment method on paired-end real data with different read length.
|
| |||
|---|---|---|---|
| ERR007641(51 bp) | SRR019044(76 bp) | ERR050728(90 bp) | |
| BWA | 82.7438 | 80.9078 | 95.7637 |
| BWA(mem) | - | 82.3725 | 96.1319 |
| RAUR(BWA) | 84.9410 | 85.8255 | 96.0608 |
| Bowtie2 | 80.1309 | 79.8907 | 94.0414 |
| Bowtie2(local) | - | 87.4895 | 96.3645 |
| RAUR(Bowtie2) | 82.7563 | 86.2557 | 94.4253 |
There are 2,977,726 pairs reads with length 51-bp in ERR007641, 9,661,679 pairs with length 76-bp in SRR019044, and 676,633 pairs with length 90-bp in ERR050728.