| Literature DB >> 21037964 |
Markus Göker1, Guido W Grimm, Alexander F Auch, Ralf Aurahs, Michal Kučera.
Abstract
Identifying species is challenging in the case of organisms for which primarily molecular data are available. Even if morphological features are available, molecular taxonomy is often necessary to revise taxonomic concepts and to analyze environmental DNA sequences. However, clustering approaches to delineate molecular operational taxonomic units often rely on arbitrary parameter choices. Also, distance calculation is difficult for highly alignment-ambiguous sequences. Here, we applied a recently described clustering optimization method to highly divergent planktonic foraminifera SSU rDNA sequences. We determined the distance function and the clustering setting that result in the highest agreement with morphological reference data. Alignment-free distance calculation, when adapted to the use with partly non-homologous sequences caused by distinct primer pairs, outperformed multiple sequence alignment. Clustering optimization offers new perspectives for the barcoding of species diversity and for environmental sequencing. It bridges the gap between traditional and modern taxonomic disciplines by specifically addressing the issue of how to optimally account for both genetic divergence and given species concepts.Entities:
Keywords: SSU rDNA; automated taxonomy; linkage clustering; parameter optimization; planktonic foraminifera
Year: 2010 PMID: 21037964 PMCID: PMC2964048 DOI: 10.4137/ebo.s5504
Source DB: PubMed Journal: Evol Bioinform Online ISSN: 1176-9343 Impact factor: 1.625
Figure 1.Corrected alignment-free distance formulae.
Notes: How to correct GBDP for the violation of fragment homology (trimmed sequence ends). Symbols used: x and y, sequences; grey boxes, location of HSPs; f, first position; F, first position within the first HSP; L, last position within the last HSP; and l, globally last position, within sequence x; f and l are defined analogously. Without background information, fragment homology is only established explicitly between the F – L part of x and the F – L part of y. If the sequences violate the fragment homology condition, using the full sequence lengths in the denominator (λ0; Formula 2) will thus overestimate the number of base pairs that can be compared in a biologically meaningful way. The corresponding distances that use λ0 will thus be overestimated (Formula 1). The modifications of the denominator in formulae 3 and 4 correct the distances that use λ1 and λ2 downwards.
Optimal clustering results in dependence on the distance formula.
| GBDP, uncorrected (λ0) | ∼ | 0.27295 | 0.05 | 0.77440 | 0.74153 |
| GBDP, corrected (λ1) | ∼ | 0.25475/0.25735 | 0.70/0.75 | 0.80006 | 0.77958 |
| GBDP, corrected (λ2) | ∼ | 0.12705 | 0.00 | 0.78781 | 0.77282 |
| clustalw | F84+G | 0.39070/0.40250 | 0.40/0.45 | 0.77177 | 0.73574 |
| clwopt | F81+G/TamNei+G | 0.37270/0.38265/0.38330/ | 0.25/0.35/ | 0.76263 | 0.73247 |
| 0.38690/0.38800/0.40380/ | 0.40/0.45/ | ||||
| 0.40525/0.40670/0.40805/ | 0.50/0.30 | ||||
| 0.41165 | |||||
| einsi | GTR/GTR+G/TamNei/ | 0.10300/0.10490/0.10935/ | 0.15/0.20/ | 0.75983 | 0.72110 |
| TamNei+G | 0.11420/0.11615/0.12260/ | 0.25/0.30/ | |||
| 0.12125/0.12340/0.12955/ | 0.35/0.40/ | ||||
| 0.13600/0.13860/0.14860/ | 0.10 | ||||
| 0.11355/0.11550/0.12190/ | |||||
| 0.11560/0.12145/0.12785/ | |||||
| 0.13360/0.13650/0.14565000 | |||||
| ginsi | GTR+G | 0.70360 | 1.00 | 0.76705 | 0.72616 |
| kalign | RAxML/F81/F81+G/F84/ | 0.06205/0.05910/0.06410/ | 0.00 | 0.77756 | 0.75664 |
| F84+G/GTR/GTR+G/JC/ | 0.05915/0.06420/0.05950/ | ||||
| JC+G/K2P/K2P+G/K3P/ | 0.06560/0.06405/0.05920/ | ||||
| K3P+G/LogDet/TamNei/ | 0.06430/0.05785/0.05680/ | ||||
| TamNei+G | 0.064450 | ||||
| linsi | GTR/GTR+G/LogDet/ | 0.12520/0.12625/0.12955/ | 0.35/0.40/ | 0.73862 | 0.71920 |
| TamNei/TamNei+G | 0.15285/0.15495/0.15805/ | 0.45 | |||
| 0.12365/0.12465/0.12440/ | |||||
| 0.12525/0.14930/0.15090/ | |||||
| 0.15540000 | |||||
| mafft | P/TamNei | 0.10545/0.12495/0.131550 | 0.25/0.35/ | 0.76322 | 0.73867 |
| 0.40 | |||||
| muscle | TamNei | 0.17965 | 0.15 | 0.74915 | 0.70428 |
| nralign | F81/F84/GTR+G/JC/ | 0.16595/0.16600/0.21925/ | 0.15 | 0.76217 | 0.71718 |
| JC+G | 0.16575/0.20845 | ||||
| poa | F81+G/JC/JC+G/K2P/ | 0.16755/0.13815/0.16710/ | 0.10 | 0.76758 | 0.73450 |
| K2P+G/K3P/TamNei+G | 0.13830/0.16765/0.13840/ | ||||
| 0.16870 | |||||
| poaglo | TamNei | 0.13610 | 0.10 | 0.78435 | 0.71880 |
Note: Highest MRI and corresponding best values of T and F for the MSA-based and GBDP distance functions. For each GBDP formula, the arithmetic mean over all F values is also indicated and for each MSA the mean over all F values and distance formulae.
Figure 2.Partition agreement optimization plot.
Note: The relationship between distance threshold T, MRI and number of clusters using the globally optimal F value (0.75) and distance function λ1 (Formula 3).
Figure 3.Neighbor-joining GBDP tree.
Notes: Radial NJ phylogram based on λ1 GBDP distances. Branch lengths are scaled in terms of the estimates from the distance values. Colors represent the affiliation to TU; their abbreviations are explained in Table 2.
Figure 4.Maximum-likelihood tree.
Notes: ML phylogram inferred from the MAFFT alignment and rooted according the separation of planktonic foraminifera in spinose macroperforate, non-spinose macroperforate and non-spinose microperforate taxa. Branch lengths are scaled in terms of the number of substitutions per site. Bootstrap support values are shown on the branches. TU annotations from optimal settings (as in Fig. 3 and Table 2) are provided on the right side. The names of the morphotaxa are provided near the leaves of the tree. Stars indicate morphotaxa that are not supported as monophyletic (cf.);28 the position of the two sequences with missing data (white dots) is indicated by respective cluster numbers in brackets (see also Table 2).
Interpretation of taxonomic units.
| 17 | BUL | OK | |
| 17 | BUL | Identified using clustering | |
| 17 | BUL | Identified using clustering | |
| Indiv. R043, determined as | 5 | [Cluster 5] | Missing data (‘N’) artefact |
| Indiv. R043, determined as | 12 | SIP-A | Possible misdetermination |
| 3 | FAL | Few data | |
| 12 | SIP-A | Possible misdetermination | |
| 20 | SIP-B | Includes | |
| 12 | SIP-A | Includes | |
| Undetermined spinose individual P125 | 12 | SIP-A | Identified using clustering |
| Undetermined spinose individual P155 | 20 | SIP-B | Identified using clustering |
| 22 | GLU | OK | |
| 4 | UVU-A | Provisional, very few data | |
| Two small individuals, possibly | 2 | UVU-B | Provisional, few data |
| 9 | CON | Very few data | |
| 8 | RUB | Synonym of | |
| 9 | CON | Synonym of | |
| G. sacculifer Z69600 | 9 | CON | Known misnomer |
| 7 | SAC | OK | |
| 13 | MAC-A | Possible misdetermination (Kimoto and Tsuchiya, unpublished; available in GenBank | |
| 13 | MAC-A | OK | |
| 1 | [Cluster 1] | Missing data (‘N’) artefact | |
| 18 | HIR | OK | |
| R021, undetermined globorotaliid (possibly | 18 | HIR | Identified using clustering |
| R034, undetermined globorotaliid | 14 | MAC-B | Could be first true |
| 11 | MEN | Very few data | |
| 19 | TRU | OK | |
| 6 | PEL-A | OK | |
| 21 | PEL-B | OK | |
| 14 | MAC-B | Few data | |
| 16 | INC | Synonym of | |
| 13 | MAC-A | Synonym of N. pachyderma type I; | |
| 10 | ORB | Synonym of Orbulina mediterranean, caribian, Sargasso type | |
| 14 | MAC-B | Very few data | |
| 0 | QUI-A | Very few data | |
| 15 | QUI-B | Few data |
Note: Molecular taxonomic units (TU) from clustering using the optimal parameters and their correspondence with the reference taxonomy. The “original cluster number” is an arbitrary number directly found in the OPTSIL results with a 1:1 correspondence to the TU names.
Accession nos. refer to sequences downloaded from GenBank; individuals refer to data of Aurahs et al28(cf. Material and Methods). If no accession nos. or individuals are given, all sequences assigned to the respective taxon are adressed;
Both sequences comprise missing data in the center of the sequences (incompletely sequenced clones). The distance formulae have not been corrected for this situation.