| Literature DB >> 35153529 |
Kessy Abarenkov1, Erik Kristiansson2, Martin Ryberg3, Sandra Nogal-Prata4, Daniela Gómez-Martínez5, Katrin Stüer-Patowsky6, Tobias Jansson5, Sergei Põlme1, Masoomeh Ghobad-Nejhad7, Natàlia Corcoll8, Ruud Scharn9, Marisol Sánchez-García10, Maryia Khomich11, Christian Wurzbacher6, R Henrik Nilsson8.
Abstract
The international DNA sequence databases abound in fungal sequences not annotated beyond the kingdom level, typically bearing names such as "uncultured fungus". These sequences beget low-resolution mycological results and invite further deposition of similarly poorly annotated entries. What do these sequences represent? This study uses a 767,918-sequence corpus of public full-length fungal ITS sequences to estimate what proportion of the 95,055 "uncultured fungus" sequences that represent truly unidentifiable fungal taxa - and what proportion of them that would have been straightforward to annotate to some more meaningful taxonomic level at the time of sequence deposition. Our results suggest that more than 70% of these sequences would have been trivial to identify to at least the order/family level at the time of sequence deposition, hinting that factors other than poor availability of relevant reference sequences explain the low-resolution names. We speculate that researchers' perceived lack of time and lack of insight into the ramifications of this problem are the main explanations for the low-resolution names. We were surprised to find that more than a fifth of these sequences seem to have been deposited by mycologists rather than researchers unfamiliar with the consequences of poorly annotated fungal sequences in molecular repositories. The proportion of these needlessly poorly annotated sequences does not decline over time, suggesting that this problem must not be left unchecked. Kessy Abarenkov, Erik Kristiansson, Martin Ryberg, Sandra Nogal-Prata, Daniela Gómez-Martínez, Katrin Stüer-Patowsky, Tobias Jansson, Sergei Põlme, Masoomeh Ghobad-Nejhad, Natàlia Corcoll, Ruud Scharn, Marisol Sánchez-García, Maryia Khomich, Christian Wurzbacher, R. Henrik Nilsson.Entities:
Keywords: DNA barcoding; Data interoperability; data mining; scientific practice; species identification; taxonomic annotation
Year: 2022 PMID: 35153529 PMCID: PMC8828591 DOI: 10.3897/mycokeys.86.76053
Source DB: PubMed Journal: MycoKeys ISSN: 1314-4049 Impact factor: 2.984
Figure 1.A screenshot from species hypothesis SH1159264.08FU (; https://dx.doi.org/10.15156/BIO/SH1159264.08FU) in UNITE. Identifying a ITS sequence to at least the genus level is trivial, yet the screenshot hints at the swathes of kingdom level-annotated sequences regularly deposited in the INSDC. SequenceID – INSDC accession number. UNITE taxon name – taxonomic annotation in UNITE. INSD taxon name – original taxonomic annotation in INSDC. RefSeq – indicates a type-derived sequence. More than thirty studies have deposited kingdom-level annotations in this species hypothesis. The ones shown primarily stem from Nishizawa et al. (2010).
Figure 2.Pie chart representing all the 95,055 kingdom-level ITS sequences and the proportion of these that were true-positives (had no or only very distant taxonomically more well-annotated BLAST matches at the time of sequence deposition/release; red, 10%), false-negatives (had only reasonable matches; green, 17%) and false-negatives (had close matches; blue, 73%). The chart suggests that nearly all kingdom-level fungal ITS sequences in INSDC could have been given a more taxonomically-resolved name at the time of sequence deposition/release.
Figure 3.The top 15 most common countries of collection for the publication-associated sequences annotated at or beyond the phylum level (green) expressed as the proportion of the sequences stemming from each country out of all phylum-level-and-beyond sequences. The corresponding country for publication-associated sequences annotated only at the kingdom level (orange) is similarly expressed as the proportion of sequences stemming from that country out of all kingdom-level sequences. The figure is ordered in decreasing order by the country of collection for the phylum-level sequences.
Figure 4.The proportion of false-negative sequences (had reasonable matches; green) and false-negative sequences (had close matches; blue) out of all kingdom-level sequences over time (2001-2020). The figure suggests that the act of taking sequence annotation very lightly is not in an abating trend. The data for 2020 extend through early November 2020 and are thus partial.