| Literature DB >> 23544073 |
Hsueh-Ting Chu1, William W L Hsiao, Theresa T H Tsao, D Frank Hsu, Chaur-Chin Chen, Sheng-An Lee, Cheng-Yan Kao.
Abstract
BACKGROUND: Recent studies on genome assembly from short-read sequencing data reported the limitation of this technology to reconstruct the entire genome even at very high depth coverage. We investigated the limitation from the perspective of information theory to evaluate the effect of repeats on short-read genome assembly using idealized (error-free) reads at different lengths. METHODOLOGY/PRINCIPALEntities:
Mesh:
Year: 2013 PMID: 23544073 PMCID: PMC3609794 DOI: 10.1371/journal.pone.0059484
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1Model of typical short read sequencing.
(a) The target sequence is randomly broken into fragments and filtered by their lengths to form a sequencing library. (b) The end or ends of the DNA fragments are sequenced in parallel to generate a massive set of short reads. We assumed the sequencing is random so that each position is more or less covered by equal numbers of fixed-length reads.
Comparison of coverage and entropy loss of BAC sequencesa.
| Clone (BAC no.) | SequenceLength(bp) | SequenceCoverage (%) | CoverageDeficiency (%) | H30 | I30 | ΔH30 | % of repeat |
| AC011809 | 108,767 | 99.98 | 0.02 | 5.335862 | 5.337411 | 0.03% | 0.46% |
| AC002328 | 109,171 | 99.61 | 0.39 | 5.335947 | 5.339022 | 0.06% | 0.99% |
| AC064879 | 109,180 | 99.68 | 0.32 | 5.335971 | 5.339058 | 0.06% | 0.97% |
| AC023673 | 109,367 | 99.91 | 0.09 | 5.338616 | 5.339801 | 0.02% | 0.38% |
| AC011713 | 109,694 | 98.85 | 1.15 | 5.332009 | 5.341098 | 0.17% | 2.91% |
| AC009243 | 110,565 | 100.00 | 0.00 | 5.344373 | 5.344534 | 0.00% | 0.02% |
| AC022520 | 110,611 | 99.63 | 0.37 | 5.341812 | 5.344714 | 0.05% | 0.92% |
|
|
|
|
|
|
|
|
|
| AC007764 | 111,222 | 99.84 | 0.16 | 5.346664 | 5.347107 | 0.01% | 0.13% |
| AC000348 | 111,566 | 98.78 | 1.22 | 5.339937 | 5.348449 | 0.16% | 2.81% |
| AC092191 | 80919 | 99.52 | 0.48 | 5.205896 | 5.208925 | 0.06% | 1.01% |
| AC185533 | 95808 | 98.46 | 1.54 | 5.271617 | 5.2823 | 0.20% | 3.41% |
|
|
|
|
|
|
|
|
|
| AC018478 | 103809 | 99.90 | 0.10 | 5.315052 | 5.317144 | 0.04% | 0.56% |
| AC092242 | 111023 | 100.00 | 0.00 | 5.346276 | 5.346329 | 0.00% | 0.01% |
|
|
|
|
|
|
|
|
|
| AC185534 | 119461 | 99.42 | 0.58 | 5.37266 | 5.378151 | 0.10% | 1.75% |
| AC092399 | 122013 | 99.92 | 0.08 | 5.386815 | 5.387333 | 0.01% | 0.17% |
| AC007837 | 123647 | 99.90 | 0.10 | 5.391958 | 5.393112 | 0.02% | 0.28% |
| AC007329 | 126140 | 99.99 | 0.01 | 5.401638 | 5.401783 | 0.00% | 0.05% |
Sequence coverage percentages as listed in Dohm et al [10].
The programs for the computation are available at: http://sourceforge.net/projects/seqentropy/files/SeqEntropy-demo-20130203.zip.
The columns H30, I30,ΔH30 are computed by our program “SeqReadEntropy” using read length of 30 bp.
The column “% of repeat” is computed by our program “SeqReadRepeat” using read length of 30 bp.
Five animal genomes for entropy measurement.
| Organism | Genome size | Version | Computation time |
| Yeast ( | 1.2×107 | sacCer3 | 1.3 minutes |
| Nematode ( | 1.0×108 | ce10 | 33 minutes |
| Fruit fly ( | 1.3×108 | dmel_r5.42 | 42 minutes |
| Zebrafish ( | 1.4×109 | danRer7 | 66 hours |
| Human ( | 3.2×109 | hg19, GRCh37.p5 | 295 hours |
The whole genome sequences were downloaded from http://hgdownload.cse.ucsc.edu/for the organisms: S. cerevisiae, C. elegans, D. rerio, and H. sapiens and ftp://ftp.flybase.net/for D. melanogaster.
The computation time of entropy measurement was recorded for read length 100 bp on a PC with Intel i7-3820 CPU and 8G RAM.
Figure 2Entropy losses at different read lengths for different
organisms. In the five organisms, the genomes of zebra fish (D. rerio) and fruit fly (D. melanogaster) will lose more entropy regardless of any read length used for sequencing. In particular, the fruit fly loses >2% of entropy loss even with read length of 120 bp. It will be <1% of entropy loss at read length of 230 bp. On the other hand, the genomes of Yeast (S. cerevisiae) and Nematode (C. elegans) have minor entropy loss even with very short reads. The detail results of entropy measurements are listed in .
The relative entropy losses of five animal genomes at different read lengths.
| ReadLen k | ΔHk of Yeast | ΔHk of Nematode | ΔHk of Fruit fly | ΔHk of Zebrafish | ΔHk of Human |
| 20 |
| 1.397796% | 5.627856% | 7.44469% | 5.406936% |
| 30 | 0.736196% |
| 4.852753% | 4.461654% | 2.858026% |
| 40 | 0.683882% | 0.642178% | 4.337883% | 3.11601% | 1.684438% |
| 50 | 0.644228% | 0.522634% | 3.932514% | 2.363446% | 1.072922% |
| 60 | 0.611417% | 0.440682% | 3.599801% | 1.886473% |
|
| 70 | 0.582789% | 0.380003% | 3.319438% | 1.556083% | 0.482622% |
| 80 | 0.557695% | 0.332944% | 3.078671% | 1.315029% | 0.379953% |
| 90 | 0.536264% | 0.295774% | 2.868949% | 1.134327% | 0.313174% |
| 100 | 0.517069% | 0.265546% | 2.684404% |
| 0.2661318% |
| 110 | 0.499602% | 0.240705% | 2.520669% | 0.887384% | 0.2306789% |
| 120 | 0.483514% | 0.219947% | 2.374214% | 0.799316% | 0.2028979% |
| 320 | _ | _ | 1.007846% | _ | _ |
| 330 | _ | _ |
| _ | _ |
The bold numbers show the relative entropy loss values and the corresponding minimal read lengths at which the relative entropy losses are below 1% for different animal genomes.
Entropy loss of selected prokaryotic whole genomes with reads of lengths 36, 500 and 1000 bps.
| Seq. no | Organism | SequenceLength(bp) | ΔH36 | ΔH500 | ΔH1000 |
| NC_000913 | E. | 4,639,675 | 0.22% | 0.09% | 0.04% |
| NC_004663 | B. | 6,260,361 | 0.15% | 0.09% | 0.05% |
| NC_008525 | P. | 1,832,387 | 0.16% | 0.11% | 0.08% |
| NC_000908 | M. | 580,076 | 0.11% | 0.00% | 0.00% |
Figure 3Histograms and quartile box plot of relative entropy losses in 2725 prokaryotic replicons.
The x-axis shows the number of replicons in each bin while the y-axis shows the % entropy loss (ΔH). The quartile box plot displays the mean (diamond shape), the medium (50%) the first (25%) and the third (75%) quartiles (the boxes), and the entire range (the whiskers). The vast majority of the replicons lost <1% entropy regardless of the read length.
Figure 4Histograms and quartile box plot of entropy losses in 2725 prokaryotic replicons truncated at 1% entropy loss in order to see the finer breakdown.
The x-axis shows the number of replicons in each bin while the y-axis shows the % entropy loss (ΔH). The quartile box plot displays the mean (diamond shape), the medium (50%) the first (25%) and the third (75%) quartiles (the boxes), and the entire range (the whiskers). It is clear that as read length increases, the entropy loss decreases. As a result, a higher number of replicons have ΔH <1.0%.
Prokaryotic chromosomes with largest entropy losses at read lengths of 125, 500 and 3000 bp.
| Genome ID | ΔH125 | Genome ID | ΔH500 | Genome ID | ΔH3000 |
| Bordetella pertussis Tohama I | 1.79136% | Bordetella pertussis Tohama I | 0.898868% | Mycoplasma agalactiae | 0.367427% |
| Xanthomonas oryzae pv. oryzae PXO99A | 1.82280% | Mycoplasma fermentans M64 chromosome | 0.934389% | Dehalococcoides ethenogenes195 | 0.394745% |
| Wolbachia sp. wRi | 1.98404% | Acinetobacter baumannii SDF | 0.934389% | Mycoplasma fermentans M64 | 0.415779% |
| Aliivibrio salmonicida LFI1238chrom 1 | 2.04042% | Mycoplasma mycoides subsp. mycoides SC str. PG1 | 0.940714% | Orientia tsutsugamushi Boryong | 0.416469% |
| Shigella boydii CDC 3083-94 | 2.09366% | Wolbachia endosymbiont ofCulex quinquefasciatus Pel | 1.028366% | Methylobacillus flagellatus KT | 0.42687% |
| Shigella dysenteriae Sd197 | 2.40508% | Aliivibrio salmonicida LFI1238chrom 1 | 1.153796% | Wolbachia sp. wRi | 0.43966% |
| Acinetobacter baumannii SDF | 2.64465% | Wolbachia sp. wRi | 1.285340% | Alteromonas macleodi | 0.465815% |
| Orientia tsutsugamushi str. Iked | 2.67584% | Aliivibrio salmonicida LFI1238chrom 2 | 1.334486% | Bartonella tribocorum CIP105476 | 0.484289% |
| Mycoplasma mycoides subsp.mycoides SC str. PG1 | 2.75548% | Orientia tsutsugamushi str. Ikeda | 1.570541% | Streptococcus agalactiaeNEM316 | 0.49117% |
| Orientia tsutsugamushi Boryong | 4.62902% | Orientia tsutsugamushi Boryong | 2.753110% | Candidatus Phytoplasma mali | 0.693043% |
The complete entropy computations of 2725 prokaryotic replicons are listed in .