| Literature DB >> 26442252 |
Sarah E Schmedes1, Jonathan L King1, Bruce Budowle2.
Abstract
Whole-genome data are invaluable for large-scale comparative genomic studies. Current sequencing technologies have made it feasible to sequence entire bacterial genomes with relative ease and time with a substantially reduced cost per nucleotide, hence cost per genome. More than 3,000 bacterial genomes have been sequenced and are available at the finished status. Publically available genomes can be readily downloaded; however, there are challenges to verify the specific supporting data contained within the download and to identify errors and inconsistencies that may be present within the organizational data content and metadata. AutoCurE, an automated tool for bacterial genome database curation in Excel, was developed to facilitate local database curation of supporting data that accompany downloaded genomes from the National Center for Biotechnology Information. AutoCurE provides an automated approach to curate local genomic databases by flagging inconsistencies or errors by comparing the downloaded supporting data to the genome reports to verify genome name, RefSeq accession numbers, the presence of archaea, BioProject/UIDs, and sequence file descriptions. Flags are generated for nine metadata fields if there are inconsistencies between the downloaded genomes and genomes reports and if erroneous or missing data are evident. AutoCurE is an easy-to-use tool for local database curation for large-scale genome data prior to downstream analyses.Entities:
Keywords: AutoCurE; automation; bacteria; curation; database; genomes; metadata
Year: 2015 PMID: 26442252 PMCID: PMC4566056 DOI: 10.3389/fbioe.2015.00138
Source DB: PubMed Journal: Front Bioeng Biotechnol ISSN: 2296-4185
Inconsistencies between genome downloads and genome reports.
| Bacteria | Archaea | Total | ||
|---|---|---|---|---|
| Round 1 (genome name search) | Downloaded genomes from ftp site | 2,605 | 164 | 2,769 |
| Round 2 (genome accession number search) | Accession number found in genome report, genome name change (includes strain name) | 87 | ||
| Round 3 (“one-by-one” manual curation) | Starting number of genomes | 2,402 |
.
AutoCurE Genome Filename and Report Tools.
| Prints list directory of downloaded genomes and file paths |
| Pulls out first line of text from files to provide RefSeq accession number and sequence file description |
| Parses metadata from genome reports and data downloads into lists to compare BioProject/UID, RefSeq accession number, genome folder name, file name, and file description |
| File manipulation to eliminate manual searching within directories. Allows the user to check desired genomes in the Excel workbook and a copy of the genome files is made to another directory for downstream use, thus keeping an unaltered master copy of the database |
Figure 1AutoCurE Genome Report Tool. AutoCurE compared content from the genome report, genome folder name, and fna file description to flaginconsistencies for nine metadata categories. Flags, shown as red Xs, were generated, indicating that a RefSeq accession number was not found in the genome report and inconsistencies in genus and species name. Additional columns in the Genome Report Tool, not shown, include a Comments section and metadata taken from the NCBI genome reports associated with each downloaded file. Columns E and F group the files associated with a particular genome by color (Column E) and by number (Column F). (FLT, First Line of Text within the fna file).