| Literature DB >> 16022740 |
R Henrik Nilsson1, Erik Kristiansson, Martin Ryberg, Karl-Henrik Larsson.
Abstract
BACKGROUND: During the last few years, DNA sequence analysis has become one of the primary means of taxonomic identification of species, particularly so for species that are minute or otherwise lack distinct, readily obtainable morphological characters. Although the number of sequences available for comparison in public databases such as GenBank increases exponentially, only a minuscule fraction of all organisms have been sequenced, leaving taxon sampling a momentous problem for sequence-based taxonomic identification. When querying GenBank with a set of unidentified sequences, a considerable proportion typically lack fully identified matches, forming an ever-mounting pile of sequences that the researcher will have to monitor manually in the hope that new, clarifying sequences have been submitted by other researchers. To alleviate these concerns, a project to automatically monitor select unidentified sequences in GenBank for taxonomic progress through repeated local BLAST searches was initiated. Mycorrhizal fungi--a field where species identification often is prohibitively complex--and the much used ITS locus were chosen as test bed.Entities:
Mesh:
Year: 2005 PMID: 16022740 PMCID: PMC1186019 DOI: 10.1186/1471-2105-6-178
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1The fly agaric: a common mycorrhizal fungus. a) Fruiting-bodies of the ectomycorrhizal fly agaric (Amanita muscaria). b) Root-tip mycelia of the Amanita type.
Functions of the emerencia web-service. Functions of the emerencia web-service at ; some examples and an informal walkthrough are also given at this address. The output of the functions features relayed hyperlinks to GenBank, Google, and Tree of Life for quick retrieval of additional information; where applicable, insufficiently identified sequences are also hyperlinked to the SEARCH FOR INSUFFICIENTLY IDENTIFIED SEQUENCE BY ACCESSION NUMBER function for a more detailed description of the sequence and its matches.
| SEARCH FOR INSUFFICIENTLY IDENTIFIED SEQUENCE BY ACCESSION NUMBER | For any given accession number of an insufficiently identified sequence, this function shows the present and previous best BLAST matches from the table of identified sequences together with match scores and relevant annotation. A Clustal W multiple alignment [34] of the sequences is generated and shown as an important aid in interpreting the BLAST match values. In addition, all the above is shown for the present and previous best BLAST matches in the table of insufficiently identified sequences. This function requires that the accession number provided by the user be present in the table of insufficiently identified sequences. |
| CHECK SPECIFIC PUBLICATION FOR INSUFFICIENTLY IDENTIFIED SEQUENCES AND THEIR IDENTITY | This function retrieves all insufficiently identified sequences stemming from the user-specified publication and shows the present best identified BLAST match (and some additional information) for those sequences. The function expects 5–10 distinct key words from the title / author /journal fields of the publication and requires that at least one insufficiently identified sequence was released together with the publication in question. |
| SEARCH FOR INSUFFICIENTLY IDENTIFIED SEQUENCES MATCHING ACCESSION NUMBER OF IDENTIFIED TAXA | For any given accession number in the table of identified sequences, this function retrieves and details all entries in the table of insufficiently identified sequences for which this accession number represents the best BLAST match. It requires that the specified accession number be present in the table of identified sequences and will proceed only if that accession number indeed constitutes the best BLAST match of at least one insufficiently identified sequence. |
| SEARCH FOR INSUFFICIENTLY IDENTIFIED SEQUENCES BY KEY WORD | This function lets the user query the species annotation field of the table of insufficiently identified sequences using 2–5 key words, and displays all insufficiently identified sequences matching the key words. For those sequences, the best BLAST match to the table of identified sequences will be shown together with some additional information. |
A brief summary of the sequence data underlying the emerencia web service at as of May 2005. The threshold BLAST E-values for "good" and "poor" matches were arbitrarily set to 0.0 and 1e-100, respectively. Graphical illustrations showing the population of the database over time and additional aspects of emerencia are generated automatically on a monthly basis and are available at the above address.
| NUMBER OF INSUFFICIENTLY IDENTIFIED SEQUENCES | 7528 (21 % of total) |
| NUMBER OF IDENTIFIED SEQUENCES | 28959 (79% of total) |
| NUMBER OF INSUFFICIENTLY IDENTIFIED SEQUENCES WITH GOOD MATCHES (E-VALUE = 0.0) | 4791 (64 % of the insufficiently identified sequences) |
| NUMBER OF INSUFFICIENTLY IDENTIFIED SEQUENCES WITH POOR MATCHES (E-VALUE >1E-100) | 1135 (15 % of the insufficiently identified sequences) |
| TOTAL NUMBER OF SEQUENCES LAST UPDATED BEFORE 1995-01-01 | 180 (0.5%) |
| TOTAL NUMBER OF SEQUENCES LAST UPDATED BEFORE 2000-01-01 | 3651 (10 %) |
| TOTAL NUMBER OF SEQUENCES LAST UPDATED BEFORE 2005-01-01 | 31858 (87%) |
| NUMBER OF INSUFFICIENTLY IDENTIFIED SEQUENCES LAST UPDATED BEFORE 2000-01-01 | 264 (3.5 % of the insufficiently identified sequences) |
| NUMBER OF INSUFFICIENTLY IDENTIFIED SEQUENCES LAST UPDATED BEFORE 2000-01-01 AND WITH POOR MATCHES (E-VALUE > 1E-100) | 17 (0.2 % of the insufficiently identified sequences) |
| NUMBER OF IDENTIFIED SEQUENCES HAVING AT LEAST ONE INSUFFICIENTLY IDENTIFIED COUNTERPART AS IDENTIFIED BY BLAST | 2981 (10 % of the identified sequences) |
| NUMBER OF IDENTIFIED SEQUENCES WITHOUT INSUFFICIENTLY IDENTIFIED COUNTERPARTS | 25978 (90 % of the identified sequences) |