| Literature DB >> 21718538 |
Mohammadreza Ghodsi1, Bo Liu, Mihai Pop.
Abstract
BACKGROUND: Clustering is a fundamental operation in the analysis of biological sequence data. New DNA sequencing technologies have dramatically increased the rate at which we can generate data, resulting in datasets that cannot be efficiently analyzed by traditional clustering methods.This is particularly true in the context of taxonomic profiling of microbial communities through direct sequencing of phylogenetic markers (e.g. 16S rRNA) - the domain that motivated the work described in this paper. Many analysis approaches rely on an initial clustering step aimed at identifying sequences that belong to the same operational taxonomic unit (OTU). When defining OTUs (which have no universally accepted definition), scientists must balance a trade-off between computational efficiency and biological accuracy, as accurately estimating an environment's phylogenetic composition requires computationally-intensive analyses. We propose that efficient and mathematically well defined clustering methods can benefit existing taxonomic profiling approaches in two ways: (i) the resulting clusters can be substituted for OTUs in certain applications; and (ii) the clustering effectively reduces the size of the data-sets that need to be analyzed by complex phylogenetic pipelines (e.g., only one sequence per cluster needs to be provided to downstream analyses).Entities:
Mesh:
Substances:
Year: 2011 PMID: 21718538 PMCID: PMC3213679 DOI: 10.1186/1471-2105-12-271
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Dynamic programming table example. Partially filled dynamic programming table. The query sequence is represented on the horizontal axis. At this point the algorithm has computed the alignment costs for a prefix of length 4 of the data sequences - which is shown on the vertical axis. Since we are calculating semi-global alignment the first row is initialized to all zeroes, i.e. The alignment of the shorter data sequence can start at any position of the longer query sequence without any penalty. In this figure, the distance threshold is 2, and any values larger than this threshold are set to the maximum value represented by ∞. To optimize the running time, since there are only three valid values on the last finished row, only the values for the three gray cells need to be computed on row right above it.
Figure 2Running times. Plot of running time as a function of cluster radius for various tools and settings, on the twins dataset. The dataset contains 1.1 million pyrosequencing reads from the V2 region of the 16S rRNA gene. The reads have an average length of 231 base pairs. The running times were measured on a single 1.8 GHz processor of an Intel x86-64 Linux laptop with 4 GB RAM. The command line options were: dnaclust infile.fasta -s 0.9x -k 3 [--approximate-filter] > outfile.cluster uclust --input infile-sorted.fasta --uc outfile.cluster --id 0.9x [--exact].
Number of clusters
| 0.99 | 0.97 | 0.95 | |
|---|---|---|---|
| DNACLUST exact | 233879 | 73726 | 28241 |
| DNACLUST inexact | 240125 | 76391 | 28661 |
| UCLUST exact | 144339 | 48418 | 20039 |
| UCLUST inexact | 253108 | 71361 | 26685 |
| CD-HIT | 245851 | 100280 | 55208 |
The number of clusters produced by DNACLUST, UCLUST and CD-HIT at various identity/similarity thresholds, on the twins dataset. Since each tool uses slightly different distance measures, the number of clusters can not be directly compared between different tools. (Namely the identity measure used by UCLUST and CD-HIT underestimates the distance between two sequences, as computed by the similarity measure used by DNACLUST). Instead we compare the change in the number of clusters when switching between the exact and inexact modes of each tool - a smaller change indicating better performance.
Running times on RDP dataset
| 0.99 | 0.97 | 0.95 | |
|---|---|---|---|
| DNACLUST exact | 204 | 372 | 960 |
| UCLUST exact | 7800 | 5040 | |
| DNACLUST inexact | 74 | 76 | 150 |
| UCLUST inexact | 43 | 29 | 16 |
The running times (minutes) of DNACLUST and UCLUST with various similarity thresholds on the RDP dataset.
The running times were measured on a single 2.8GHz processor of an AMD64 Linux workstation. The command line options were:
dnaclust infile.fasta -l -s 0.9x -k 5 [--approximate-filter] > outfile.cluster uclust --input infile-sorted.fasta --uc outfile.cluster --id 0.9x [--exact]
Multiple sequence alignment building times
| MSA method | Time (sec.) | Diameter (DNADIST) |
|---|---|---|
| ClustalW | 1545.5 | 0.251 |
| ClustalW -quicktree | 87.6 | 0.264 |
| MUSCLE | 197.8 | 0.198 |
| UCLUST | 0.1 | 0.156 |
| DNACLUST | 0.8 | 0.094 |
Time spent building a Multiple Sequence Alignment of a sample cluster using different tools, and the diameter of the MSA produced, as reported by DNADIST. The diameter is expected to be less that or equal to 0.10.
Figure 3Distribution of cluster MSAs based on their average pairwise distance. Figures 3a, 3b and 3c show the distribution of sampled cluster multiple sequence alignments based on their average pairwise distance for thresholds 99%, 97% and 95%, respectively. The figures show that DNACLUST cluster MSAs (thick blue line) are tighter (i.e. have smaller average pairwise distance) than UCLUST cluster MSAs (thick red lines). Furthermore computing a "traditional" MSA using ClustalW from the clusters produced by DNACLUST and UCLUST results in an overestimation of the distances between sequences (dashed lines).
Average frequency of gaps in the multiple sequence alignments
| 0.99 | 0.97 | 0.95 | |
|---|---|---|---|
| DNACLUST | 0.016 | 0.071 | 0.103 |
| UCLUST | 0.071 | 0.117 | 0.146 |
The average frequency of gaps in multiple sequence alignments for sampled clusters at various similarity thresholds. For each MSA, the frequency of gaps is the number of gaps divided by the total number of characters in the MSA. Gaps before the beginning and after the end of each sequence are excluded. Note that since an insertion in one sequence results in a gap in all other sequences in the MSA, the ratio of gaps may be higher than the clustering threshold. Since the sequence identity measure used by UCLUST does not take gaps into account the number of gaps in UCLUST MSAs are higher than the gaps in DNACLUST MSAs, specially at more stringent thresholds.