Literature DB >> 29069347

VecScreen_plus_taxonomy: imposing a tax(onomy) increase on vector contamination screening.

Alejandro A Schäffer1, Eric P Nawrocki1, Yoon Choi1, Paul A Kitts1, Ilene Karsch-Mizrachi1, Richard McVeigh1.   

Abstract

Motivation: Nucleic acid sequences in public databases should not contain vector contamination, but many sequences in GenBank do (or did) contain vectors. The National Center for Biotechnology Information uses the program VecScreen to screen submitted sequences for contamination. Additional tools are needed to distinguish true-positive (contamination) from false-positive (not contamination) VecScreen matches.
Results: A principal reason for false-positive VecScreen matches is that the sequence and the matching vector subsequence originate from closely related or identical organisms (for example, both originate in Escherichia coli). We collected information on the taxonomy of sources of vector segments in the UniVec database used by VecScreen. We used that information in two overlapping software pipelines for retrospective analysis of contamination in GenBank and for prospective analysis of contamination in new sequence submissions. Using the retrospective pipeline, we identified and corrected over 8000 contaminated sequences in the nonredundant nucleotide database. The prospective analysis pipeline has been in production use since April 2017 to evaluate some new GenBank submissions. Availability and implementation: Data on the sources of UniVec entries were included in release 10.0 (ftp://ftp.ncbi.nih.gov/pub/UniVec/). The main software is freely available at https://github.com/aaschaffer/vecscreen_plus_taxonomy. Contact: aschaffe@helix.nih.gov. Supplementary information: Supplementary data are available at Bioinformatics online. Published by Oxford University Press 2017. This work is written by US Government employees and are in the public domain in the US.

Entities:  

Mesh:

Year:  2018        PMID: 29069347      PMCID: PMC6030928          DOI: 10.1093/bioinformatics/btx669

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


  21 in total

1.  A RAPID algorithm for sequence database comparisons: application to the identification of vector contamination in the EMBL databases.

Authors:  C Miller; J Gurd; A Brass
Journal:  Bioinformatics       Date:  1999-02       Impact factor: 6.937

2.  Identifying adaptor contamination when mining DNA sequence data.

Authors:  Jeffrey Scott Coker; Eric Davies
Journal:  Biotechniques       Date:  2004-08       Impact factor: 1.993

3.  Corruption of genomic databases with anomalous sequence.

Authors:  E D Lamperti; J M Kittelberger; T F Smith; L Villa-Komaroff
Journal:  Nucleic Acids Res       Date:  1992-06-11       Impact factor: 16.971

4.  Figaro: a novel statistical method for vector sequence removal.

Authors:  James Robert White; Michael Roberts; James A Yorke; Mihai Pop
Journal:  Bioinformatics       Date:  2008-01-17       Impact factor: 6.937

5.  AlienTrimmer: a tool to quickly and accurately trim off multiple short contaminant sequences from high-throughput sequencing reads.

Authors:  Alexis Criscuolo; Sylvain Brisse
Journal:  Genomics       Date:  2013-08-01       Impact factor: 5.736

6.  Vecuum: identification and filtration of false somatic variants caused by recombinant vector contamination.

Authors:  Junho Kim; Ju Heon Maeng; Jae Seok Lim; Hyeonju Son; Junehawk Lee; Jeong Ho Lee; Sangwoo Kim
Journal:  Bioinformatics       Date:  2016-06-22       Impact factor: 6.937

7.  Contamination of cDNA sequences in databases.

Authors:  C Savakis; R Doelz
Journal:  Science       Date:  1993-03-19       Impact factor: 47.728

8.  TagCleaner: Identification and removal of tag sequences from genomic and metagenomic datasets.

Authors:  Robert Schmieder; Yan Wei Lim; Forest Rohwer; Robert Edwards
Journal:  BMC Bioinformatics       Date:  2010-06-23       Impact factor: 3.169

Review 9.  A broad spectrum PCR method for the detection of polyomaviruses and avoidance of contamination by cloning vectors.

Authors:  C Völter; H zur Hausen; D Alber; E M de Villiers
Journal:  Dev Biol Stand       Date:  1998

10.  Fast identification and removal of sequence contamination from genomic and metagenomic datasets.

Authors:  Robert Schmieder; Robert Edwards
Journal:  PLoS One       Date:  2011-03-09       Impact factor: 3.240

View more
  8 in total

1.  A Comprehensive Guide to Potato Transcriptome Assembly.

Authors:  Maja Zagorščak; Marko Petek
Journal:  Methods Mol Biol       Date:  2021

2.  Cultivar-specific transcriptome and pan-transcriptome reconstruction of tetraploid potato.

Authors:  Marko Petek; Maja Zagorščak; Živa Ramšak; Sheri Sanders; Špela Tomaž; Elizabeth Tseng; Mohamed Zouine; Anna Coll; Kristina Gruden
Journal:  Sci Data       Date:  2020-07-24       Impact factor: 6.444

3.  Viruses in unexplained encephalitis cases in American black bears (Ursus americanus).

Authors:  Charles E Alex; Elizabeth Fahsbender; Eda Altan; Robert Bildfell; Peregrine Wolff; Ling Jin; Wendy Black; Kenneth Jackson; Leslie Woods; Brandon Munk; Tiffany Tse; Eric Delwart; Patricia A Pesavento
Journal:  PLoS One       Date:  2020-12-17       Impact factor: 3.240

4.  HiFiAdapterFilt, a memory efficient read processing pipeline, prevents occurrence of adapter sequence in PacBio HiFi reads and their negative impacts on genome assembly.

Authors:  Sheina B Sim; Renee L Corpuz; Tyler J Simmonds; Scott M Geib
Journal:  BMC Genomics       Date:  2022-02-22       Impact factor: 3.969

5.  Identification of Antibiotic Resistance Proteins via MiCId's Augmented Workflow. A Mass Spectrometry-Based Proteomics Approach.

Authors:  Gelio Alves; Aleksey Ogurtsov; Roger Karlsson; Daniel Jaén-Luchoro; Beatriz Piñeiro-Iglesias; Francisco Salvà-Serra; Björn Andersson; Edward R B Moore; Yi-Kuo Yu
Journal:  J Am Soc Mass Spectrom       Date:  2022-05-02       Impact factor: 3.262

6.  Transcriptional variation of sensory-related genes in natural populations of Aedes albopictus.

Authors:  Ludvik M Gomulski; Mosè Manni; Davide Carraretto; Tony Nolan; Daniel Lawson; José M Ribeiro; Anna R Malacrida; Giuliano Gasperi
Journal:  BMC Genomics       Date:  2020-08-07       Impact factor: 3.969

7.  RefSeq database growth influences the accuracy of k-mer-based lowest common ancestor species identification.

Authors:  Daniel J Nasko; Sergey Koren; Adam M Phillippy; Todd J Treangen
Journal:  Genome Biol       Date:  2018-10-30       Impact factor: 13.583

8.  Terminating contamination: large-scale search identifies more than 2,000,000 contaminated entries in GenBank.

Authors:  Martin Steinegger; Steven L Salzberg
Journal:  Genome Biol       Date:  2020-05-12       Impact factor: 13.583

  8 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.