| Literature DB >> 29506565 |
Thomas Sauvage1,2, Sophie Plouviez3, William E Schmidt3, Suzanne Fredericq3.
Abstract
OBJECTIVE: The body of DNA sequence data lacking taxonomically informative sequence headers is rapidly growing in user and public databases (e.g. sequences lacking identification and contaminants). In the context of systematics studies, sorting such sequence data for taxonomic curation and/or molecular diversity characterization (e.g. crypticism) often requires the building of exploratory phylogenetic trees with reference taxa. The subsequent step of segregating DNA sequences of interest based on observed topological relationships can represent a challenging task, especially for large datasets.Entities:
Keywords: Barcoding; Biodiversity; Clone; Contaminant; Cryptic; Environmental; FigTree; Forensic; Metabarcoding; OTU; Phylogeny; Systematics
Mesh:
Year: 2018 PMID: 29506565 PMCID: PMC5838971 DOI: 10.1186/s13104-018-3268-y
Source DB: PubMed Journal: BMC Res Notes ISSN: 1756-0500
Fig. 1Simulated phylogeny displaying taxa named ‘A’ to ‘T’. a Basic workflow for FASTA sequence extraction with TREE2FASTA. An exploratory tree is built following multiple-alignment of FASTA data. The Newick tree string (NWK) is visualized and edited in the tree-viewer FigTree and saved as a NEXUS file (NEX). TREE2FASTA uses the FASTA alignment and the NEXUS file (NEX) to produce subsetted FASTA files according to user selection scheme (here color). b Example of possible color and/or annotation selection schemes in FigTree for TREE2FASTA sequence extraction. The FASTA icon marked with an asterisk ‘*’ contains FASTA sequences for taxa H and I lacking color selection (i.e. achromatic) or lacking annotation. For figure clarity annotation ‘Group1’ to ‘Group4’ are reported G1 to G4 within FASTA file icons. FASTA files output to different folders are delimited by dashed boxes
Fig. 2Sorting Genbank 16S rDNA for red seaweeds with TREE2FASTA. a Successive edits done in FigTree to establish an annotated design nested by color for reference and environmental red seaweeds (Florideophytes). b Folders and subsetted FASTA files output by TREE2FASTA for downstream analyses (folder content separated by dashed lines). For figure clarity, the Florideophyceae annotation was abbreviated to ‘Flo’ within FASTA file icons. The tree was produced with the 500 closest matches to Taenioma perpusillum (MF101452) on Genbank®
Elapsed time for TREE2FASTA execution on 1000+ sequence FASTA datasets
| Database (edited scheme) | Sequences | Length (bp) | Wall-clock | CPU |
|---|---|---|---|---|
| 1957 | 483 | 1.055 | 1.025 | |
| 1957 | 483 | 1.056 | 1.040 | |
| 16S PhytoREF (color) | 4191 | 3379 | 12.973 | 12.646 |
| 16S PhytoREF (color + annotation) | 4191 | 3379 | 13.164 | 12.871 |
Time reported as wall-clock (= ‘real’) and CPU (‘user’ + ‘sys’). See text for computing system specifications. Length refers to the sequence multiple-alignment length [in base pair (bp)]