| Literature DB >> 21304686 |
Alexander F Auch, Hans-Peter Klenk, Markus Göker.
Abstract
DNA-DNA hybridization (DDH) is a widely applied wet-lab technique to obtain an estimate of the overall similarity between the genomes of two organisms. To base the species concept for prokaryotes ultimately on DDH was chosen by microbiologists as a pragmatic approach for deciding about the recognition of novel species, but also allowed a relatively high degree of standardization compared to other areas of taxonomy. However, DDH is tedious and error-prone and first and foremost cannot be used to incrementally establish a comparative database. Recent studies have shown that in-silico methods for the comparison of genome sequences can be used to replace DDH. Considering the ongoing rapid technological progress of sequencing methods, genome-based prokaryote taxonomy is coming into reach. However, calculating distances between genomes is dependent on multiple choices for software and program settings. We here provide an overview over the modifications that can be applied to distance methods based in high-scoring segment pairs (HSPs) or maximally unique matches (MUMs) and that need to be documented. General recommendations on determining HSPs using BLAST or other algorithms are also provided. As a reference implementation, we introduce the GGDC web server (http://ggdc.gbdp.org).Entities:
Keywords: BLAST; GBDP; GGDC web server; MUMmer; genomics; microbial taxonomy; phylogeny; species delineation
Year: 2010 PMID: 21304686 PMCID: PMC3035261 DOI: 10.4056/sigs.541628
Source DB: PubMed Journal: Stand Genomic Sci ISSN: 1944-3277
Figure 1Flowchart outlining the steps required to calculate in-silico DDH values. Either Genbank accession numbers or FASTA files are uploaded on the server. The final values are received via e-mail.
HSP determination and filtering
| | | | | | |
|---|---|---|---|---|---|
| Run time | Very high [M] | Low [M] | High [M] | Very low [M] | Moderate [M] |
| Memory consumption | High [M] | Moderate [M] | Moderate [M] | Very low [M] | Low [M] |
| Typical effect on | decrease [M] | increase [M] | increase [M] | moderate | decrease [M] |
| W= | -W | -tileSize | -l | T=0 W= | |
| Typical effect on runtime, | higher → speedup | higher → speedup | higher→ speedup; | higher → speedup | higher → speedup |
| Typical effect on | N/A | N/A | lower → decrease | higher → increase | N/A |
| score based, i.e., | score based, i.e., | -minIdentity | 100% (fixed) | score based, i.e., | |
| Typical effect on runtime, | N/A | N/A | insignificant [M] | (none) | N/A |
| Typical effect on | N/A | N/A | lower → increase | N/A | N/A |
| e-value | e-value | substitution score | (makes no sense) | substitution score | |
| Typical effect on | insignificant [E] | insignificant [E] | lower → small | (none) | lower → small |
| Typical effect on | insignificant [E] | insignificant [E] | lower → small | N/A | higher → slight |
The table shows different parameters of the similarity search algorithms and their influence on the correlation with DDH values (for details, see [1]). Note that the best possible correlation of DDH values (similarities) with GGD (dissimilarities) is -1.0; that is, 'high' correlations indicate more negative ones. Seed parameter: Minimum length for a stretch of DNA used as HSP starting point. Identity parameter: Minimum identity within HSP for prolongation. Evidence codes: [M] measured; [E] extrapolated.
aVersion 2.0MP-WashU [04-May-2006], website http://blast.wustl.edu/. [2]
bVersion 2.2.18, website ftp://ftp.ncbi.nlm.nih.gov/blast/executables/, [2]
cVersion 34, website http://users.soe.ucsc.edu/~kent/src/, [3]
dVersion 3.0, website http://mummer.sourceforge.net/ . [5]
eVersion 7, website http://www.bx.psu.edu/miller_lab/, [4]
Command line parameters for similarity search tools as used by the web server. Recommended parameters are in bold.
| | |
|---|---|
| NCBI BLAST | blastall -p blastn -i QUERY -d SUBJECT -m 7 -a 1 -S 3 -e 10 |
| WU-BLAST | blastn SUBJECT QUERY mformat=7 cpus=1 E=10 |
| BLAT | blat SUBJECT QUERY OUTFILE -t=dna -q=dna -out=blast |
| BLASTZ | blastz QUERY SUBJECT B=2 C=2 |
| Mummer | mummer -b -c -F |
Figure 2Example of a CGVIZ file. The e-value is stored using its logarithmic value (base 10).
HSP overlap filtering and distance calculation.
| | | | |
| Algorithm/implementation | no filtering (coverage | HSP overlap filtering using the | |
| Typical effect on correlation | decrease (except for | increase [M] | |
| | | | |
| Typical effect on correlation | increase [M] | moderate decrease; | increase [M] |
| | | ||
| Typical effect on correlation | insignificant [M] | insignificant [M] | |
| | | ||
| Typical effect on correlation | no effect [M] | no effect [M] |
Evidence codes: [M] measured; [E] extrapolated.
Distance thresholds and conversion values.
| | | ||||
|---|---|---|---|---|---|
| NCBI BLAST | Trimming (1) | 0.2676 | 0.0860 | 96.8979 | -121.4848 |
| Trimming (2) | 0.0412 | 0.0430 | 90.3998 | -438.3134 | |
| Trimming (3) | 0.2945 | 0.0860 | 98.6313 | -118.8770 | |
| WU-BLAST | Trimming (1) | 0.0436 | 0.2796 | 122.9402 | -406.2128 |
| Trimming (2) | 0.0870 | 0.0430 | 82.1068 | -166.2293 | |
| Trimming (3) | 0.2870 | 0.0860 | 115.4105 | -191.9086 | |
| BLAT | Trimming (1) | 0.2672 | 0.0753 | 97.7166 | -127.9852 |
| Trimming (2) | 0.0416 | 0.0430 | 87.0748 | -376.3038 | |
| Trimming (3) | 0.2811 | 0.0645 | 100.6280 | -122.5151 | |
| BLASTZ | Trimming (1) | 0.2389 | 0.2043 | 89.8757 | -102.0887 |
| Trimming (2) | 0.0575 | 0.0538 | 85.1650 | -273.1803 | |
| Trimming (3) | 0.3344 | 0.1828 | 111.9235 | -125.1989 | |
| MUMmer | Coverage (1) | 0.6110 | 0.0430 | 130.9618 | -116.4258 |
1 - Analogous to 70% DDH