| Literature DB >> 21611185 |
Shingo Suzuki1, Naoaki Ono, Chikara Furusawa, Bei-Wen Ying, Tetsuya Yomo.
Abstract
Next-generation sequencing technologies enable the rapid cost-effective production of sequence data. To evaluate the performance of these sequencing technologies, investigation of the quality of sequence reads obtained from these methods is important. In this study, we analyzed the quality of sequence reads and SNP detection performance using three commercially available next-generation sequencers, i.e., Roche Genome Sequencer FLX System (FLX), Illumina Genome Analyzer (GA), and Applied Biosystems SOLiD system (SOLiD). A common genomic DNA sample obtained from Escherichia coli strain DH1 was applied to these sequencers. The obtained sequence reads were aligned to the complete genome sequence of E. coli DH1, to evaluate the accuracy and sequence bias of these sequence methods. We found that the fraction of "junk" data, which could not be aligned to the reference genome, was largest in the data set of SOLiD, in which about half of reads could not be aligned. Among data sets after alignment to the reference, sequence accuracy was poorest in GA data sets, suggesting relatively low fidelity of the elongation reaction in the GA method. Furthermore, by aligning the sequence reads to the E. coli strain W3110, we screened sequence differences between two E. coli strains using data sets of three different next-generation platforms. The results revealed that the detected sequence differences were similar among these three methods, while the sequence coverage required for the detection was significantly small in the FLX data set. These results provided valuable information on the quality of short sequence reads and the performance of SNP detection in three next-generation sequencing platforms.Entities:
Mesh:
Substances:
Year: 2011 PMID: 21611185 PMCID: PMC3096631 DOI: 10.1371/journal.pone.0019534
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Volume of data from three next-generation sequencing technologies.
| Method | Read length | Number of reads | Total bases | Redundancy |
| FLX | 260.7 | 475,819 | 124,042,803 | 26.8 |
| GA | 36 | 9,624,599 | 346,485,564 | 75 |
| SOLiD M25 | 25 | 125,399,243 | 3,134,981,075 | 678.4 |
| SOLiD M50 | 50 | 226,945,098 | 11,347,254,900 | 2455.4 |
| SOLiD F50 | 50 | 100,015,475 | 5,000,773,750 | 1082.1 |
The read length of FLX was mean read length including adapters.
The SOLiD M25 was the data set of 25-base mate library.
The SOLiD M50 and F50 data sets were expressed as the sum of two replicates of 50-base mate pair and two replicates of 50-base fragment libraries, respectively.
Comparison of mapping.
| Method | Ratio of mapped reads | Accuracy per base |
| FLX | 89.0 | 99.9 |
| GA | 63.7 | 96.7 |
| SOLiD | 47.3 | 99.8 |
Filtered data set of GA was shown.
Figure 1Error ratio in GA reads depending on the base position of the read.
Ratio of mismatch between mapped reads and reference sequence to the total number of mapped reads was plotted against base position in the reads. The mismatch ratio increases along with the base position indicating decrease of accuracy of base calls.
Comparison of uncovered regions.
| Method | Uncovered bases (Uncommon) | GC contents |
| FLX | 4,799 | 51.3 |
| GA | 58,367 | 56.1 |
| SOLiD | 27,986 | 50.4 |
The common 71334 uncovered bases not covered by any reads of the three methods were removed. Most were uncovered due to duplicated sequences, such as ribosomal RNA, insertion sequences, and highly preserved homologs.
Detection of single-base substitutions.
| Method | Coverage | False positive | True positive | False negative (Uncovered) |
| FLX | 23.5 | 8 | 239 | 20 (14) |
| GA | 16.8 | 46 | 223 | 36 (17) |
| SOLiD | 609.3 | 18 | 243 | 16 (12) |
Figure 2Numbers of true and false detection as a function of the mean coverage.
(A) Magnification of the low coverage range. (B) Whole range. The circle, triangle and cross symbols indicate the number of True Positive (TP), False Negative (FN) and False Positive (FP), respectively. Black, red and green represent FLX, GA and SOLiD, respectively. The extrapolated lines for the saturation of TP using FLX and GA were added in (B).