| Literature DB >> 22829749 |
Lin Liu1, Yinhu Li, Siliang Li, Ni Hu, Yimin He, Ray Pong, Danni Lin, Lihua Lu, Maggie Law.
Abstract
With fast development and wide applications of next-generation sequencing (NGS) technologies, genomic sequence information is within reach to aid the achievement of goals to decode life mysteries, make better crops, detect pathogens, and improve life qualities. NGS systems are typically represented by SOLiD/Ion Torrent PGM from Life Sciences, Genome Analyzer/HiSeq 2000/MiSeq from Illumina, and GS FLX Titanium/GS Junior from Roche. Beijing Genomics Institute (BGI), which possesses the world's biggest sequencing capacity, has multiple NGS systems including 137 HiSeq 2000, 27 SOLiD, one Ion Torrent PGM, one MiSeq, and one 454 sequencer. We have accumulated extensive experience in sample handling, sequencing, and bioinformatics analysis. In this paper, technologies of these systems are reviewed, and first-hand data from extensive experience is summarized and analyzed to discuss the advantages and specifics associated with each sequencing system. At last, applications of NGS are summarized.Entities:
Mesh:
Year: 2012 PMID: 22829749 PMCID: PMC3398667 DOI: 10.1155/2012/251364
Source DB: PubMed Journal: J Biomed Biotechnol ISSN: 1110-7243
Comparison in alignment between Ion Torrent and HiSeq 2000.
| Ion Torrenta | HiSeq 2000b | |
|---|---|---|
| Total reads num | 165518 | 205683 |
| Total bases num | 18574086 | 18511470 |
| Max read length | 201 | 90 |
| Min read length | 15 | 90 |
| Map reads num | 157258 | 157511 |
| Map rate | 95% | 76.57% |
| Covered rate | 96.50% | 93.11% |
| Total map length | 15800258 | 14176420 |
| Total mismatch base | 53475 | 142425 |
| Total insertion base | 109550 | 1397 |
| Total insertion num | 95740 | 1332 |
| Total deletion base | 152495 | 431 |
| Total deletion num | 139264 | 238 |
| Ave mismatch rate | 0.338% | 1.004% |
| Ave insertion rate | 0.693% | 0.009% |
| Ave deletion rate | 0.965% | 0.003% |
a: use TMAP to align; b: use SOAP2 to align.
Figure 1Ion Torrent sequencing quality. E. coli K12 DH10B (NC_010473.1) with GC 50.78% was used for this experiment. (a) is 314–200 bp from Ion Torrent. The left figure is quality value: pink range represents quality minimum and maximum values each position has. Green area represents the top and bottom quarter (1/4) reads of quality. Red line represents the average quality value in the position. The right figure is read length analysis: colored histogram represents the real read length. The black line represents the mapped length, and because it allows 3′ soft clipping, the length is different from the real read length. (b) is accuracy analysis. In each position, accuracy type including mismatch, insertion, and deletion is shown on the left y-axis. The average accuracy is shown the right y-axis. Accuracy of 200 bp sequencing could reach 99%. (c) is base composition along reads (left) and GC distribution analysis (right). The left figure is base composition in each position of reads. Base line splits after about 95 cycles indicating an inaccurate sequencing. The right one uses 500 bp window and the GC distribution is quite even. The data using high GC samples also indicates a good performance in Ion Torrent (data not shown).
MiSeq 150PE data.
| Sample | GC | Q20 | Q30 |
|---|---|---|---|
| Human HPV | 33.57; 33.62 | 98.26; 95.52 | 93.64; 88.52 |
| Bacteria | 61.33; 61.43 | 90.84; 83.86 | 78.46; 69.04 |
(1) The data in the table includes both read 1 and read 2 from paired-end sequencing.
(2) GC represents the GC content of libraries.
(3) Q20 value is the average Q20 of all bases in a read, which represents the ratio of bases with probability of containing no more than one error in 100 bases. Q30 value is the average Q30 of all bases in a read, which represents the ratio of bases with probability of containing no more than one error in 1,000 bases.
The comparison between PGM and MiSeq.
| PGM | MiSeq | |
|---|---|---|
| Output | 10 MB–100 MB | 120 MB–1.5 GB |
| Read length | ~200 bp | Up to 2 × 150 bp |
| Sequencing time | 2 hours for 1 × 200 bp | 3 hours for 1 × 36 single read |
| 27 hours for 2 × 150 bp pair end read | ||
| Sample preparation time | 8 samples in parallel, less than 6 hrs | As fast as 2 hrs, with 15 minutes hand on time |
| Sequencing method | semiconductor technology with a simple | Sequencing by synthesis (SBS) |
| Potential for development | Various parameters | Limited factors, major concentrate in flowcell surface size, insert sizes, and how to pack cluster in tighter |
| Input amount |
| Ng (Nextera) |
| Data analysis | Off instrument | On instrument |
Figure 2Sequencing of a fosmid DNA using Pacific Biosciences sequencer. With coverage, the accuracy could be above 97%. The figure was constructed by BGI's own data.
(a)
| Sequencer | 454 GS FLX | HiSeq 2000 | SOLiDv4 | Sanger 3730xl |
|---|---|---|---|---|
| Sequencing mechanism | Pyrosequencing | Sequencing by synthesis | Ligation and two-base coding | Dideoxy chain termination |
| Read length | 700 bp | 50SE, 50PE, 101PE | 50 + 35 bp or | 400 ~ 900 bp |
| Accuracy | 99.9%* | 98%, (100PE) | 99.94% *raw data | 99.999% |
| Reads | 1 M | 3 G | 1200~1400 M | — |
| Output data/run | 0.7 Gb | 600 Gb | 120 Gb | 1.9~84 Kb |
| Time/run | 24 Hours | 3~10 Days | 7 Days for SE | 20 Mins~3 Hours |
| Advantage | Read length, fast | High throughput | Accuracy | High quality, long read length |
| Disadvantage | Error rate with polybase more than 6, high cost, low throughput | Short read assembly | Short read assembly | High cost low throughput |
(b)
| Sequencers | 454 GS FLX | HiSeq 2000 | SOLiDv4 | 3730xl |
|---|---|---|---|---|
| Instrument price | Instrument $500,000, $7000 per run | Instrument $690,000, $6000/(30x) human genome | Instrument $495,000, $15,000/100 Gb | Instrument $95,000, about $4 per 800 bp reaction |
| CPU | 2* Intel Xeon X5675 | 2* Intel Xeon X5560 | 8* processor 2.0 GHz | Pentium IV 3.0 GHz |
| Memory | 48 GB | 48 GB | 16 GB | 1 GB |
| Hard disk | 1.1 TB | 3 TB | 10 TB | 280 GB |
| Automation in library preparation | Yes | Yes | Yes | No |
| Other required device | REM e system | cBot system | EZ beads system | No |
| Cost/million bases | $10 | $0.07 | $0.13 | $2400 |
(c)
| Sequencers | 454 GS FLX | HiSeq 2000 | SOLiDv4 | 3730xl |
|---|---|---|---|---|
| Resequencing | Yes | Yes | ||
|
| Yes | Yes | Yes | |
| Cancer | Yes | Yes | Yes | |
| Array | Yes | Yes | Yes | Yes |
| High GC sample | Yes | Yes | Yes | |
| Bacterial | Yes | Yes | Yes | |
| Large genome | Yes | Yes | ||
| Mutation detection | Yes | Yes | Yes | Yes |
(1) All the data is taken from daily average performance runs in BGI. The average daily sequence data output is about 8 Tb in BGI when about 80% sequencers (mainly HiSeq 2000) are running.
(2) The reagent cost of 454 GS FLX Titanium is calculated based on the sequencing of 400 bp; the reagent cost of HiSeq 2000 is calculated based on the sequencing of 200 bp; the reagent cost of SOLiDv4 is calculated based on the sequencing of 85 bp.
(3) HiSeq 2000 is more flexible in sequencing types like 50SE, 50PE, or 101PE.
(4) SOLiD has high accuracy especially when coverage is more than 30x, so it is widely used in detecting variations in resequencing, targeted resequencing, and transcriptome sequencing. Lanes can be independently run to reduce cost.