| Literature DB >> 30522439 |
Manuel Zahariev1, Wen Chen2, Cobus M Visagie3, C André Lévesque4.
Abstract
BACKGROUND: Oligonucleotide signatures (signatures) have been widely used for studying microbial diversity and function in wet-lab settings, but using them for accurate in silico identification of organisms from high-throughput sequencing (HTS) data is only a proof of concept. Existing signature design programs for sequence signatures (signatures matching exactly one sequence) or clade signatures (signatures matching every sequence in a phylogenetic clade) are not able to identify all possible polymorphic sites for sequences with high similarity and perform poorly when handling large genome sequencing datasets.Entities:
Keywords: DNA hybridization; Metabarcoding; Metagenomics; Oligonucleotide signatures; Regulated pathogens
Mesh:
Substances:
Year: 2018 PMID: 30522439 PMCID: PMC6284311 DOI: 10.1186/s12859-018-2363-3
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Performance of the matching algorithm using the 4Mycotoxins training set (1,338 sequences) and the 97AerobiotaSamples testing set by signature length λ
| aodp | |||||
|---|---|---|---|---|---|
|
|
|
|
|
| |
| 16 | 1352 | 0.93 | 0.317 | 17.41 | 17039 |
| 24 | 1353 | 0.94 | 0.311 | 13.27 | 9720 |
| 32 | 1342 | 0.95 | 0.299 | 11.83 | 6362 |
| 40 | 1325 | 0.94 | 0.298 | 11.06 | 3031 |
| USEARCH | 32560 | ||||
| BLAST | 74335 | ||||
μ98: number of matching query sequences with similarity α≥1−2ε=0.98, t: running time in seconds (system description in “Comparisons with other algorithms” section). Average values (algorithm 1) are reported for: size of the matching kernel , number of sequences in all matching clusters . Ratio : average size of the result set to the average size of the matching kernel. Running times are also reported for USEARCH and BLAST
Data sets included in database 17DataSets
| Data set |
|
|
|
|
|
|
|
|
|
|
|
| |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
| 15248 | 28 | 26 | 54 | 545 | 94 | 33 | 139 | 2.6 | 61% | 1 | 4% | 24 | 86% | 27 | 96% | 28 | 100% | 4% |
|
| 72624 | 37 | 42 | 79 | 1671 | 290 | 43 | 258 | 3.3 | 54% | - | - | 25 | 68% | 28 | 76% | 35 | 95% | 19% |
|
| 24645 | 37 | 35 | 72 | 647 | 60 | 36 | 137 | 1.9 | 50% | 7 | 19% | 24 | 65% | 25 | 68% | 34 | 92% | 24% |
|
| 23078 | 48 | 46 | 94 | 481 | 64 | 45 | 143 | 1.5 | 48% | 7 | 15% | 32 | 67% | 37 | 77% | 48 | 100% | 23% |
|
| 54964 | 88 | 86 | 174 | 625 | 220 | 126 | 626 | 3.6 | 72% | - | - | 87 | 99% | 88 | 100% | 88 | 100% | - |
|
| 79740 | 132 | 63 | 195 | 586 | 146 | 54 | 199 | 1.0 | 28% | 1 | 1% | 37 | 28% | 40 | 30% | 43 | 33% | 2% |
|
| 77453 | 140 | 139 | 279 | 553 | 45 | 92 | 376 | 1.3 | 33% | 16 | 11% | 58 | 41% | 63 | 45% | 82 | 59% | 14% |
|
| 112291 | 193 | 179 | 372 | 582 | 205 | 115 | 631 | 1.7 | 31% | 52 | 27% | 74 | 38% | 82 | 42% | 149 | 77% | 35% |
|
| 201815 | 253 | 238 | 491 | 798 | 24 | 319 | 1103 | 2.2 | 65% | - | - | 149 | 59% | 166 | 66% | 184 | 73% | 7% |
|
| 213202 | 399 | 338 | 737 | 530 | 99 | 196 | 1008 | 1.4 | 27% | 149 | 37% | 140 | 35% | 150 | 38% | 266 | 67% | 29% |
|
| 428994 | 513 | 400 | 913 | 824 | 377 | 349 | 1984 | 2.2 | 38% | 64 | 12% | 200 | 39% | 222 | 43% | 310 | 60% | 17% |
|
| 280418 | 551 | 550 | 1101 | 509 | 11 | 187 | 734 | 0.7 | 17% | - | - | 78 | 14% | 86 | 16% | 101 | 18% | 3% |
|
| 547127 | 1032 | 1032 | 2064 | 530 | 39 | 591 | 2331 | 1.1 | 29% | 19 | 2% | 285 | 28% | 313 | 30% | 414 | 40% | 10% |
|
| 691867 | 1198 | 918 | 2116 | 576 | 297 | 477 | 2010 | 0.9 | 23% | 562 | 47% | 379 | 32% | 397 | 33% | 667 | 56% | 23% |
|
| 743335 | 1200 | 915 | 2115 | 618 | 259 | 574 | 2666 | 1.3 | 27% | 394 | 33% | 376 | 31% | 403 | 34% | 649 | 54% | 20% |
|
| 743954 | 1438 | 1437 | 2875 | 517 | 12 | 597 | 2675 | 0.9 | 21% | 57 | 4% | 310 | 22% | 325 | 23% | 413 | 29% | 6% |
|
| 1604775 | 2946 | 2261 | 5207 | 533 | 133 | 1165 | 4417 | 0.8 | 22% | 1492 | 51% | 969 | 33% | 1001 | 34% | 1778 | 60% | 26% |
N: size of data set (nucleotides), S: number of sequences (other than sequences with more than 5 ambiguous bases), i: number of internal clades in the phylogenetic tree, n: total number of phylogenetic clades n=S+i, : average length of sequences in the data set (rounded to closest integer), σ(L): corrected sample standard deviation for the sequence length (rounded to closest integer). n∗: number of signable clades, c: number of clusters (λ=36) identified by aodp, c/n: ratio between clusters and phylogenetic clades, n∗/n: ratio between signable clades and phylogenetic clades, s0: number of sequences that are not included in any signable clades, s: signable sequences (also unique signable sequence patterns), s: unique signable clade patterns, s: unique cluster patterns, δ=s−s: discrimination increase attributable to clusters (difference between unique cluster patterns and unique signable clade patterns)
Fig. 1Distribution of number of sequences per cluster (cumulative percentage), data set Unite, λ=36
Fig. 2Distribution of number of signatures per cluster (cumulative percentage), data set Unite, λ=36
Fig. 3Dependency of the number of clusters (groups of sequences for which at least one signature can be found) and number of signable clades (phylogenetic clades to which oligonucleotide signatures can be assigned) on number of sequences within each dataset (database 17DataSets, λ=36)
Fig. 4Dependency of the number of clusters (groups of sequences for which at least one signature can be found) and number signable clades (phylogenetic clades to which oligonucleotide signatures can be assigned) on signature length, data set Penicillium
Fig. 5Sequence by cluster incidence matrix for the Ceratorhiza data set (λ=36). Each row contains cluster matches associated with the sequence with numeric identifier on the y-axis (a fingerprint of the sequence). Each column represents sequences contained in a given cluster. Signable sequences and signable internal clades are at the left. The remaining (bare) clusters are at the right
Precision and recall of our matching algorithm (aodp) and USEARCH using the 4Mycotoxins training set and the 4MicotoxinsBootstrap testing set
| aodp | ||||||||||
|
| 8 | 16 | 24 | 32 | 40 | 8 | 16 | 24 | 32 | 40 |
|
| Precision | Recall | ||||||||
| 0.05 | 0.74 | 0.90 | 0.91 | 0.92 | 0.93 | 0.71 | 0.49 | 0.25 | 0.15 | 0.08 |
| 0.04 | 0.78 | 0.92 | 0.92 | 0.92 | 0.93 | 0.76 | 0.64 | 0.38 | 0.23 | 0.14 |
| 0.03 | 0.83 |
| 0.95 | 0.95 | 0.96 | 0.80 |
| 0.55 | 0.38 | 0.24 |
| 0.02 | 0.89 |
| 0.97 | 0.97 | 0.97 | 0.84 |
| 0.74 | 0.57 | 0.43 |
| 0.01 | 0.95 |
| 0.99 | 0.99 | 0.99 | 0.87 |
| 0.88 | 0.78 | 0.68 |
| 0.00 | 1.00 |
| 1.00 | 1.00 | 1.00 | 1.00 |
| 1.00 | 1.00 | 1.00 |
| USEARCH | ||||||||||
|
| 4 | 16 | 64 | 256 | 1024 | 4 | 16 | 64 | 256 | 1024 |
|
| Precision | Recall | ||||||||
| 0.05 | 0.21 | 0.40 | 0.62 |
|
| 0.19 | 0.38 | 0.60 |
|
|
| 0.04 | 0.21 | 0.41 | 0.63 |
|
| 0.20 | 0.38 | 0.59 |
|
|
| 0.03 | 0.21 | 0.41 | 0.64 | 0.89 |
| 0.20 | 0.39 | 0.60 | 0.84 |
|
| 0.02 | 0.22 | 0.43 | 0.66 | 0.91 |
| 0.20 | 0.40 | 0.61 | 0.84 |
|
| 0.01 | 0.24 | 0.45 | 0.67 | 0.92 |
| 0.21 | 0.41 | 0.61 | 0.83 |
|
| 0.00 | 0.28 | 0.50 | 0.72 | 0.95 |
| 0.26 | 0.50 | 0.72 | 0.95 |
|
Rows have a given error rate ε For aodp, columns have a given signature length λ. For USEARCH, columns have a given value χ for the “maxaccepts” parameter. Cells where USEARCH outperforms aodp on the F measure are in bold. Cells where aodp outperforms USEARCH on the F measure for χ≤256 are also in bold