| Literature DB >> 23924250 |
David Williams1, William L Trimble, Meghan Shilts, Folker Meyer, Howard Ochman.
Abstract
BACKGROUND: The numerous classes of repeats often impede the assembly of genome sequences from the short reads provided by new sequencing technologies. We demonstrate a simple and rapid means to ascertain the repeat structure and total size of a bacterial or archaeal genome without the need for assembly by directly analyzing the abundances of distinct k-mers among reads.Entities:
Mesh:
Year: 2013 PMID: 23924250 PMCID: PMC3751351 DOI: 10.1186/1471-2164-14-537
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Figure 1Abundance histograms of icosihenamers (21-mers) for five strains (A–E) of Black lines represent the total number of distinct 21-mers at each abundance value (as present in the Illumina short-read dataset for a strain), and red lines are the best-fit model for each of the empirical 21-mer spectra. To increase the area within the plot containing peaks, total numbers of distinct 21-mers are multiplied by their corresponding 21-mer abundances. This transformation does not affect the model fitting, and estimates of repeat structure and genome size remain unaffected. Panel labels are as follows, A: strain A_03_34; B: strain B_04_28; C: strain C_04_22; D: strain D_04_27, E: strain E_01_37.
Size and repeat structure of genomes estimated by icosihenamer (21-mer) analysis
| | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| | ||||||||||||
| 1× | 4,650,095 | 4,650,095 | 4,834,774 | 4,834,774 | 4,590,007 | 4,590,007 | 5,002,844 | 5,002,844 | 4,836,194 | 4,836,194 | 4,494,886 | 4,494,886 |
| 2× | 34,059 | 68,119 | 5,364 | 10,728 | 177,520 | 355,041 | 45,770 | 91,541 | 111,882 | 223,764 | 14,578 | 29,196 |
| 3× | 3,550 | 10,649 | 8,511 | 25,532 | 25,052 | 75,156 | 9,630 | 28,890 | 34,198 | 102,595 | 6,959 | 20,877 |
| 4× | 1,158 | 4,632 | 6,296 | 25,183 | 8,270 | 33,079 | 2,235 | 8,939 | 30,549 | 122,197 | 2,072 | 8,288 |
| 5× | 845 | 4,223 | 24 | 119 | 5,855 | 29,277 | 447 | 2,236 | 8,777 | 43,887 | 1,874 | 9,370 |
| 6× | 0 | 0 | 286 | 1,714 | 2,271 | 13,627 | 196 | 1,175 | 4,611 | 27,665 | 1,415 | 8,490 |
| 7× | 2,208 | 15,455 | 1,283 | 8,982 | 2,786 | 19,505 | 5,489 | 38,420 | 8,167 | 57,168 | 5,132 | 35,924 |
| 8× | 2,566 | 20,530 | 3,311 | 26,486 | 4,028 | 32,222 | 3,170 | 25,360 | 3,301 | 26,405 | 213 | 1,704 |
| 9× | 0 | 0 | 41 | 365 | 1 | 13 | 99 | 887 | 662 | 5,961 | 23 | 207 |
| 10× | 0 | 0 | 6 | 64 | 424 | 4,242 | 6 | 56 | 989 | 9,888 | 26 | 260 |
| 11 × -20× | 107 | 1669 | 24 | 329 | 679 | 10,796 | 1,331 | 17,155 | 3,080 | 45,197 | 1,240 | 18,757 |
| 21 × -79× | 14 | 333 | 32 | 835 | 709 | 17,629 | 70 | 1,884 | 68 | 1,590 | 62 | 2,728 |
| Cumulative totals: | 4,775,705 | 4,935,111 | 5,180,594 | 5,219,387 | 5,502,511 | 4,630,687 | ||||||
aEach row corresponds to the number of nucleotides and the total amount of genome sequence inferred from the mixed Poisson model fit to each peak of the 21-mer spectrum of short reads for each of the novel E. coli strains (Figure 1A-E), and from direct counts of 21-mer in the E. coli DH1 genome sequence.
Total genome sizes of five strains estimated by PFGE and icosihenamer analysis
| | |||||
|---|---|---|---|---|---|
| PFGEa | 4.869 | 5.047 | 5.149 | 5.423 | 5.268 |
| 21-mer analysis of sequencing reads | 4.776 | 4.935 | 5.181 | 5.219 | 5.502 |
aRestriction digests of genomic DNAs with I-CeuI endonuclease yielded seven restriction fragments for each genome, the sizes of which are listed in Additional file 3: Table S3.
Figure 2Accuracy of genome sizes inferred from icosihenamer (21-mers) abundances. Blue circles indicate genome sizes based on 21-mer abundance analysis of short-read datasets compared to published lengths of the same genomes. Red circles indicate genome sizes based on 21-mer abundance analysis compared to genome sizes for the identical strains estimated by PFGE. (See Additional file 4: Table S4 for list of genomes used in this analysis).