| Literature DB >> 35668366 |
Abstract
BACKGROUND: Tools for accurately clustering biological sequences are among the most important tools in computational biology. Two pioneering tools for clustering sequences are CD-HIT and UCLUST, both of which are fast and consume reasonable amounts of memory; however, there is a big room for improvement in terms of cluster quality. Motivated by this opportunity for improving cluster quality, we applied the mean shift algorithm in MeShClust v1.0. The mean shift algorithm is an instance of unsupervised learning. Its strong theoretical foundation guarantees the convergence to the true cluster centers. Our implementation of the mean shift algorithm in MeShClust v1.0 was a step forward. In this work, we scale up the algorithm by adapting an out-of-core strategy while utilizing alignment-free identity scores in a new tool: MeShClust v3.0.Entities:
Keywords: Alignment-free; Clustering; DNA; Machine learning; Mean shift; Sequence analysis; Software; Unsupervised learning
Mesh:
Year: 2022 PMID: 35668366 PMCID: PMC9171953 DOI: 10.1186/s12864-022-08619-0
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 4.547
Fig. 1Overview of the first data pass. MeShClust v3.0 is based on the mean shift algorithm, which is an instance of unsupervised learning. The scaled-up MeShClust v3.0 is also an instance of out-of-core learning [35], in which the learning algorithm is trained on separate batches of the training data consecutively. The algorithm requires multiple passes through the input data. In the first data pass, the tool reads a batch of input sequences. Then the mean shift algorithm (all of the four steps) is run on the batch until convergence. Sequences that cannot be assigned to any center are kept in the reservoir. Next, a new batch is read. The main mean shift is run on this batch but without the initialization step and for one iteration only, i.e., already found centers are shifted and merged on the new batch and no new centers are discovered. Sequences that cannot be assigned to any of the centers are added to the reservoir. When the reservoir has enough sequences (more than the batch size), sequences in it are shuffled and a batch of them is clustered using an independent instance of the mean shift algorithm. This instance is run until convergence. The resulting centers (if any) are merged with the centers accumulated by the main mean shift. This procedure is repeated until all sequences are read and the reservoir is empty. In subsequent passes, the algorithm rereads input sequences batch by batch. The main mean shift algorithm is run for one iteration on each batch. If the number of clusters does not change during a pass, the algorithm converges. In the final data pass, all sequences are reread batch by batch, and each sequence is assigned to the cluster with the closest center to it
Statistics of the real data sets. The 14-bacterial-species data set includes 14 clusters. The viral set includes 9 clusters. Cluster counts in the bacterial, maize LTRs, and human microbiome sets are unknown
| Data set | Sequence count | Total length | Maximum length | Minimum length | Mean length | Median length |
|---|---|---|---|---|---|---|
| 14 bacterial species | 1,328 | 4,256,374,969 | 9,270,175 | 801,203 | 3,205,102 | 2,874,351 |
| Bacterial | 10,562 | 38,577,794,947 | 16,040,666 | 112,031 | 3,652,509 | 3,647,501 |
| LTRs | 253,224 | 346,337,915 | 5,999 | 100 | 1,368 | 1,187 |
| Microbiome | 1,071,335 | 269,374,512 | 372 | 171 | 251 | 256 |
| Viral | 96 | 635,979 | 13,246 | 2,605 | 6,625 | 7,458 |
Statistics of the synthetic training data sets. To construct a synthetic data set, a specific number of template random sequences are synthesized. The length of a template is chosen at random between minimum and maximum lengths. A random number (between minimum and maximum numbers) of mutated copies are generated from each template. All clusters in the same data set have the same minimum identity score. For example, members comprising the clusters of the Short-97 data set are 97.00–99.99% identical to the templates, from which these members were generated. Identity scores among templates in the same data set are at most 10% less than the provided minimum identity score. Length is measured in base pairs (bp)
| Data set | Template avg. length (bp) | Template min. length (bp) | Template max. length (bp) | Cluster avg. size | Cluster min. size | Cluster max. size | Cluster count | Sequence count |
|---|---|---|---|---|---|---|---|---|
| Short-97 | 288 | 202 | 396 | 202 | 12 | 400 | 100 | 20,195 |
| Short-95 | 307 | 200 | 400 | 177 | 5 | 400 | 100 | 17,734 |
| Short-90 | 298 | 204 | 399 | 199 | 9 | 400 | 100 | 19,877 |
| Short-80 | 302 | 200 | 400 | 204 | 6 | 392 | 100 | 20,423 |
| Short-70 | 299 | 205 | 400 | 202 | 7 | 395 | 100 | 20,230 |
| Short-60 | 304 | 200 | 399 | 195 | 9 | 395 | 100 | 19,539 |
| Medium-97 | 1,394 | 752 | 1,998 | 192 | 13 | 390 | 100 | 19,215 |
| Medium-95 | 1,358 | 750 | 1,968 | 203 | 7 | 396 | 100 | 20,315 |
| Medium-90 | 1,405 | 759 | 1,977 | 194 | 5 | 400 | 100 | 19,393 |
| Medium-80 | 1,434 | 760 | 2,000 | 222 | 14 | 398 | 100 | 22,208 |
| Medium-70 | 1,345 | 768 | 1,999 | 212 | 8 | 398 | 100 | 21,184 |
| Medium-60 | 1,387 | 771 | 1,993 | 202 | 13 | 398 | 100 | 20,211 |
| Long-97 | 2,677 | 1,520 | 3,983 | 210 | 5 | 398 | 100 | 20,994 |
| Long-95 | 2,611 | 1,508 | 3,959 | 206 | 10 | 400 | 100 | 20,565 |
| Long-90 | 2,677 | 1,530 | 3,969 | 196 | 5 | 400 | 100 | 19,622 |
| Long-80 | 2,859 | 1,528 | 3,990 | 194 | 5 | 398 | 100 | 19,424 |
| Long-70 | 2,830 | 1,512 | 3,993 | 224 | 19 | 399 | 100 | 22,396 |
| Long-60 | 2,630 | 1,519 | 3,977 | 207 | 7 | 398 | 100 | 20,699 |
| Numerous-97 | 272 | 171 | 372 | 203 | 5 | 400 | 5,000 | 1,012,543 |
| Numerous-95 | 272 | 171 | 372 | 203 | 5 | 400 | 5,000 | 1,012,528 |
| Numerous-90 | 271 | 171 | 372 | 204 | 5 | 400 | 5,000 | 1,018,681 |
| Numerous-80 | 271 | 171 | 372 | 203 | 5 | 400 | 5,000 | 1,016,997 |
Evaluations on the synthetic testing data sets. The first set of evaluations was conducted on 12 sets, each of which includes clusters whose members are 80%, 90%, 95%, or 97% identical to a template sequence, i.e., a true center. The second set of evaluations was conducted on six data sets representing clusters of degenerate sequences (e.g., members are 60% or 70% identical to true centers). Each set of the first and second sets of evaluations includes less than 25k sequences. The third set of evaluations was conducted on four data sets, each of which includes more than one million sequences (80%, 90%, 95%, or 97% identical to true centers). All clusters in the same data set have the same minimum identity score. For example, cluster members of the Short-97 data set are 97.00–99.99% identical to the true centers. The direction of the arrow next to each criterion indicates whether a high or a low value is better. We mark MeShClust v3.0 with “auto” when the threshold is estimated automatically, otherwise a specific threshold is provided to the tool
| Tool | Purity ( | Jaccard ( | G-Measure ( | Cluster quality ( | Coverage ( | Centers ( | Time ( | Memory (GB) ( |
|---|---|---|---|---|---|---|---|---|
| Short, Medium, and Long: 80–97% | ||||||||
| 0.92 | 0.19 | 0.33 | 0.29 | 0.92 | 0.01 | 00:71:00 | 0.36 | |
| 0.99 | 0.92 | 0.93 | 0.94 | 0.99 | 0.35 | 00:00:26 | 0.20 | |
| 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 0.78 | 00:05:18 | 6.55 | |
| 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 0.80 | 00:12:08 | 6.59 | |
| 0.70 | 0.08 | 0.20 | 0.15 | 0.70 | 0.00 | 00:00:16 | 0.12 | |
| Short, Medium, and Long: 60–70% | ||||||||
| 1.00 | 0.14 | 0.28 | 0.24 | 0.90 | 0.01 | 01:23:46 | 0.35 | |
| 0.98 | 0.93 | 0.96 | 0.96 | 1.00 | 0.53 | 00:00:25 | 0.19 | |
| 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 0.76 | 00:11:44 | 5.79 | |
| 1.00 | 1.00 | 1.00 | 1.00 | 0.98 | 0.65 | 00:14:32 | 5.83 | |
| 1.00 | 0.22 | 0.34 | 0.34 | 0.83 | 0.01 | 00:00:28 | 0.08 | |
| Numerous: 80–97% | ||||||||
| 1.00 | 0.48 | 0.59 | 0.62 | 1.00 | 0.00 | 00:39:31 | 0.91 | |
| 1.00 | 0.81 | 0.83 | 0.87 | 0.99 | 0.01 | 00:19:04 | 2.58 | |
| 1.00 | 0.99 | 0.99 | 1.00 | 1.00 | 0.15 | 02:58:15 | 12.76 | |
| 1.00 | 0.98 | 0.99 | 0.99 | 0.99 | 0.15 | 02:50:40 | 13.06 | |
| 1.00 | 0.07 | 0.20 | 0.15 | 0.89 | 0.00 | 00:06:41 | 0.72 | |
Evaluations on real data sets. The microbiome set was clustered with an identity score of 97%. The LTRs set was clustered with an identity score of 70%. The direction of the arrow next to each criterion indicates whether a high or a low value is better
| Tool | Dunn ( | Davies-Bouldin ( | Silhouette ( | Intra ( | Inter ( | Cluster quality ( | Coverage ( | Time ( | Memory (GB) ( |
|---|---|---|---|---|---|---|---|---|---|
| Microbiome | |||||||||
| 0.01 | 3.39 | 0.24 | 0.94 | 0.94 | 0.16 | 0.99 | 00:01:45 | 0.93 | |
| 0.02 | 1.50 | 0.43 | 0.96 | 0.90 | 0.25 | 0.99 | 00:05:01 | 2.56 | |
| 0.27 | 0.77 | 0.74 | 0.97 | 0.90 | 0.50 | 0.96 | 01:41:06 | 15.08 | |
| 0.01 | 5.50 | -0.14 | 0.94 | 0.96 | 0.12 | 0.98 | 00:00:31 | 0.42 | |
| LTRs | |||||||||
| 0.02 | 1.81 | 0.05 | 0.78 | 0.52 | 0.29 | 1.00 | 03:47:05 | 0.75 | |
| 0.14 | 1.13 | 0.33 | 0.86 | 0.52 | 0.51 | 1.00 | 00:02:57 | 1.75 | |
| 0.98 | 0.86 | 0.47 | 0.88 | 0.58 | 0.79 | 0.94 | 02:01:22 | 16.24 | |
| 0.02 | 2.02 | 0.11 | 0.75 | 0.55 | 0.27 | 1.00 | 00:09:01 | 0.50 | |
Evaluations on the viral and the 14-bacterial-species data sets. The viral set was clustered with an identity score of 50%; it includes nine clusters representing nine viruses. The 14-bacterial-species set was clustered with multiple identity scores; it includes 14 clusters representing 14 bacterial species. We mark MeShClust v3.0 with “auto” when the threshold is estimated automatically, otherwise a specific threshold is provided to the tool
| Tool | Purity ( | Jaccard ( | G-Measure ( | Cluster quality ( | Coverage ( | Time ( | Memory (GB) ( |
|---|---|---|---|---|---|---|---|
| Viral data set | |||||||
| 0.97 | 0.67 | 0.77 | 0.78 | 0.95 | 00:00:02 | 0.18 | |
| 0.91 | 0.72 | 0.83 | 0.81 | 0.98 | 00:00:28 | 0.08 | |
| 0.96 | 0.56 | 0.72 | 0.71 | 0.72 | 00:00:07 | 0.12 | |
| 1.00 | 0.26 | 0.46 | 0.43 | 0.64 | 00:00:17 | 0.08 | |
| 14-bacterial-species set | |||||||
| 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 00:29:36 | 14.09 | |
| 1.00 | 0.93 | 0.97 | 0.97 | 0.91 | 00:46:01 | 14.11 | |
| 1.00 | 0.64 | 0.76 | 0.78 | 0.96 | 00:42:54 | 14.21 | |
| 1.00 | 0.62 | 0.73 | 0.77 | 0.93 | 02:48:41 | 14.21 | |
Fig. 2MeShClust v3.0 in action on the Numerous-97 training data set. The top plot shows the number of centers as the algorithm runs. The middle plot shows the number of sequences accumulated in the reservoir; this number changes in the first data pass (Pass 1) and is zero in the second and third data passes (Pass 2 and Pass 3). The bottom plot shows the number of sequences read during the three data passes
Fig. 3The effects of the size of the all-vs-all block on cluster quality, percentage of true centers, time, and memory. Figures a–d are produced by evaluating MeShClust v3.0 using different block sizes (1k, 2k, 5k, 10k, 15k, 20k, and 25k) on three small data sets: Short 60, Medium 70, and Long 80; each of these sets consists of less than 25k sequences and includes 100 clusters. Figures e–h are produced by evaluating MeShClust v3.0 using different block sizes (1k, 2k, 5k, 10k, 15k, 20k, 25k, and 46k) on one large data set (the Numerous 97 set), which includes more than one million sequences and 5,000 clusters