| Literature DB >> 20513662 |
James K Bonfield1, Andrew Whitwham.
Abstract
MOTIVATION: Existing sequence assembly editors struggle with the volumes of data now readily available from the latest generation of DNA sequencing instruments.Entities:
Mesh:
Year: 2010 PMID: 20513662 PMCID: PMC2894512 DOI: 10.1093/bioinformatics/btq268
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.Binning tree containing sequences from two libraries (represented by solid and dashed lines). Information about the sequence positions and pairings is stored in the bin records, while the sequence names, DNA and qualities are held in the sequence records.
Fig. 2.Contig editor, showing quality values by gray scales and mismatches to the consensus by base color.
Fig. 3.Template display showing a mapped assembly with a short insert Illumina library and a long insert capillary library. The Y-axis here shows insert size, while the X-axis is the position within the contig. A genomic insertion is visible at around 5 kb, identified by the jump in average insert size for the Illumina library. Also visible is the filter subwindow. The template colors used are red: inconsistent read-pair orientation; blue: single-ended template; orange: template spanning two contigs; otherwise gray-scale: the mapping quality of the DNA fragments.
Efficiency of opening and viewing an assembly
| Program | Dataset | CPU (s) | Memory (MiB) | File size |
|---|---|---|---|---|
| Gap4 | A | 149 | 6784 | 4 823 620 112 |
| Consed | A | 363 | 6270 | 3 838 652 583 |
| EagleView | A | 385 | 1 0044 | 2 461 728 347 |
| NGSView | A | 0.2 | 36 | 4 197 720 064 |
| MapView | A | 3.0 | 32 | 558 031 038 |
| IGV | A | 5.2 | 118 | 186 611 223 |
| SAMtools | A | 1.2 | 34 | 186 611 223 |
| Gap5 | A | 0.2 | 15 | 139 030 256 |
| IGV | B | 5.0 | 110 | 43 832 012 709 |
| SAMtools | B | 1.1 | 49 | 43 832 012 709 |
| Gap5 | B | 0.4 | 22 | 32 153 736 504 |
aTested on a 32-bit linux system due to lack of a Mono environment on the main 64-bit test system.
‘MiB’ is 1 048 576 bytes—a mebibyte. Dataset A is 6.6 million 44 bp reads mapped to a single 44 Mb contig. Dataset B is 1.1 billion reads (mostly 36 bp) mapped to all human chromosomes. Program versions: EagleView 2.2, Gap5 1.2.7, SAMtools 0.1.7a, MapView 3.4.1, Consed 19.0, Gap4 4.11 and IGV 1.4.
I/O efficiency on dataset A
| Program | Operation | I/O calls | Bytes r/w (KiB) |
|---|---|---|---|
| gap4 | Open + view | 138 928 263 | 3 418 452 |
| gap5 | Open + view | 81 | 116 |
| samtools | Open + view | 9 | 140 |
| gap4 | Move to 20 Mb | 312 | 4 |
| gap5 | Move to 20 Mb | 58 | 101 |
| samtools | Move to 20 Mb | 33 | 221 |
| gap4 | Scroll to 21 Mb | 310 266 | 6288 |
| gap5 | Scroll to 21 Mb | 476 | 4616 |
| samtools | Scroll to 21 Mb | 10 850 | 47 050 |
| gap4 | Break contig | 31 208 502 | 689 953 |
| gap5 | Break contig | 1794 | 823 |
| gap4 | Join contig | 79 387 624 | 1 653 908 |
| gap5 | Join contig | 187 | 2 |
I/O operations showing the number of I/O calls (lseek, read, write, pread, pwrite) for opening the database and displaying the first contig, moving to position 20 Mb in the contig, scrolling to 21 Mb in 1 kb increments, for breaking the contig in two at 20 Mb and joining it together again. For a more complete break down of the I/O calls used see the Supplementary Material.
I/O efficiency on data set B
| Program | Operation | I/O calls | Bytes r/w (KiB) |
|---|---|---|---|
| gap5 | Open + view | 339 | 774 |
| samtools | Open + view | 146 | 8516 |
| gap5 | Move to 100 Mb | 76 | 179 |
| samtools | Move to 100 Mb | 15 | 138 |
| gap5 | Scroll to 101 Mb | 645 | 10 373 |
| samtools | Scroll to 101 Mb | 12 192 | 81 560 |
| gap5 | Break contig | 2859 | 805 |
| gap5 | Join contig | 228 | 135 |
| gap5 | Substitution | 145 | 52 |
| gap5 | Insertion | 1047 | 104 |
I/O operations showing the number of I/O calls (lseek, read, write, pread, pwrite) with dataset B. The contig viewed was Chromosome 1. Breaking and joining contig measurements were averaged over 10 contigs, for Chr4 to Chr13. The substitution and insertion tests were averaged from single base edits at 10 locations spread over ChrX.
Data compression of data set A
| File format | Compression tool | File size |
|---|---|---|
| sam | – | 885 524 410 |
| bam | (bgzf) | 186 486 871 |
| sam | gzip | 179 625 250 |
| sam | 7zip | 144 426 218 |
| gap5 | (zlib) | 137 783 736 |
| gap5 | (lzma2) | 115 331 272 |
| sam | paq8o9 | 86 875 700 |
Compression tools listed in parentheses denote algorithms internal to either SAMtools or Gap5. All others are external command-line tools.
File size by content type, data set A
| Data type | File size (%) | Bits per seq. | Bits per base |
|---|---|---|---|
| bin/range | 4.7 | 7.75 | 0.18 |
| Seq bases | 23.5 | 38.83 | 0.88 |
| Seq quality | 42.6 | 70.36 | 1.60 |
| Seq name | 25.6 | 42.28 | 0.96 |
| Seq other | 3.7 | 6.08 | 0.14 |
File sizes from tg_index -z 16384 -d data_type. ‘Seq other’ here is a general per-sequence overhead. The ‘bin/range’ type includes everything needed to draw the Template Display window; sequence positions, mapping quality and read pairings.