| Literature DB >> 26640690 |
Lauris Kaplinski1, Maarja Lepamets2, Maido Remm1.
Abstract
BACKGROUND: K-mer-based methods of genome analysis have attracted great interest because they do not require genome assembly and can be performed directly on sequencing reads. Many analysis tasks require one to compare k-mer lists from different sequences to find words that are either unique to a specific sequence or common to many sequences. However, no stand-alone k-mer analysis tool currently allows one to perform these algebraic set operations.Entities:
Keywords: K-mers; Next-generation sequencing; Sequence analysis
Mesh:
Year: 2015 PMID: 26640690 PMCID: PMC4669650 DOI: 10.1186/s13742-015-0097-y
Source DB: PubMed Journal: Gigascience ISSN: 2047-217X Impact factor: 6.524
Fig. 1A schematic overview of the workflow of the programs in the GenomeTester4 package. GListMaker takes FASTA or FASTQ as input and builds a binary list of k-mer counts. GListCompare performs set operations with two k-mer lists and generates a new list as output. GListQuery can be used to look up the counts from a list using either a text file, FASTA/FASTQ file or another k-mer list as input
Fig. 2The basic set operations implemented by GListCompare. The default k-mer counts stored in the derived list are the following: for union, the sum of k-mer counts from both lists; for intersection, the smaller of the k-mer counts in either of the lists; and for complement, the k-mer count in the first set. The GListCompare program argument for given set operation s shown below each calculated set
Fig. 3The complement from union function within GListCompare. First, a union of two or more lists is calculated that includes the sample-specific list T1. In this example, this list is T1 ∪ T2. Typically, this union will include k-mers from many species or strains. Next, we find the intersection between T1 and this composite union. We define this as the intersection that only includes k-mers that have the same count in both lists. The result is a list of k-mers that are unique to T1. Note that in this case the resulting list is the same as that calculated using the complement function in the example given in Fig. 2
Comparison of the 32-mer counting speeds of GenomeTester4, Jellyfish 2.2.0, KMC 2.2 and DSK 2.0.7 with a single thread and 24 threads
| File sizes | 1 file, 4.7 Mbp | 24 files, 3.0 Gbp | 2 files, 86 Mbp | 2 files, 2.55 Gbp | |
|---|---|---|---|---|---|
| 1 thread | GListMaker | 0.9 | 978.47 | 16.61 | 764.17 |
| JellyFish | 3.64 | 1923.99 | 43.1 | 1222.48 | |
| KMC | 0.76 | 354.32 | 11.22 | 338.52 | |
| DSK | 2.52 | 102.41 | 14.41 | 361.50 | |
| 24 threads | GListMaker | 1.52 | 243.44 | 12.15 | 261.7 |
| JellyFish | 0.54 | 112.52 | 5.65 | 100.84 | |
| KMC | 0.25 | 41.47 | 2.48 | 57.19 | |
| DSK | 2.63 | 60.76 | 3.51 | 59.16 |
All measurements are taken as the mean of five runs and presented in seconds. For KMC, the RAM-only version was used to speed up the counting; for DSK the RAM limit was 200 GiB
Comparison of the peak memory consumption of GenomeTester4, Jellyfish 2.2.0, KMC 2.2 and DSK 2.0.7 while counting 32-mers with 24 threads
| Source sequence | ||
|---|---|---|
| File sizes | 24 files, 3.0 Gbp | 2 files, 2.55 Gbp |
| GListMaker | 64 GiB | 24 GiB |
| JellyFish | 23 GiB | 9 GiB |
| KMC | 11 GiB | 11 GiB |
| DSK | 32 GiB | 2.7 GiB |
For KMC, the RAM-only version was used to speed up the counting
GListCompare running times for the generation of different bacterial datasets
| List | List size | Number of unique k-mers | Generation time |
|---|---|---|---|
| 32-mer lists of all bacteria | 2 h 8 m | ||
| Union list of all bacteria | 64 GiB | 5,676,675,273 | 2 h 30 m |
| 1.4 KiB | 115 | 75.2 s | |
| 4 MiB | 348,970 | 79.2 s | |
| 922 KiB | 78,667 | 77.3 s |