Literature DB >> 28516137

A curated dataset of complete Enterobacteriaceae plasmids compiled from the NCBI nucleotide database.

Alex Orlek^1,2, Hang Phan^1,2, Anna E Sheppard^1,2, Michel Doumith³, Matthew Ellington^2,3, Tim Peto^1,2, Derrick Crook^1,2, A Sarah Walker^1,2, Neil Woodford^2,3, Muna F Anjum^2,4, Nicole Stoesser¹.

Abstract

Thousands of plasmid sequences are now publicly available in the NCBI nucleotide database, but they are not reliably annotated to distinguish complete plasmids from plasmid fragments, such as gene or contig sequences; therefore, retrieving complete plasmids for downstream analyses is challenging. Here we present a curated dataset of complete bacterial plasmids from the clinically relevant Enterobacteriaceae family. The dataset was compiled from the NCBI nucleotide database using curation steps designed to exclude incomplete plasmid sequences, and chromosomal sequences misannotated as plasmids. Over 2000 complete plasmid sequences are included in the curated plasmid dataset. Protein sequences produced from translating each complete plasmid nucleotide sequence in all 6 frames are also provided. Further analysis and discussion of the dataset is presented in an accompanying research article: "Ordering the mob: insights into replicon and MOB typing…" (Orlek et al., 2017) [1]. The curated plasmid sequences are publicly available in the Figshare repository.

Entities: Gene

Keywords: Complete genomes; Enterobacteriaceae family; Plasmids; Sequence data curation

Year: 2017 PMID： 28516137 PMCID： PMC5426034 DOI： 10.1016/j.dib.2017.04.024

Source DB: PubMed Journal: Data Brief ISSN： 2352-3409

Specifications Table Value of the data To our knowledge, this is currently the only large curated dataset of complete plasmids, compiled according to well-defined, and transparently validated, inclusion and exclusion criteria. The data could be used to benchmark the performance of plasmid typing schemes [1]. The data could be used for reference-based plasmid analyses [2]; for example, contigs could be queried against the curated plasmid sequences with the aim of distinguishing plasmid from chromosomal contigs [3] or assessing plasmid genetic content [4]. The protein dataset is a useful resource for MOB typing [5]. Information about sequence conservation from aligned protein database sequences can be harnessed using more powerful profile-based homology searching [6], enabling improved MOB typing compared with standard protein BLAST. A bioinformatic protocol and code for MOB typing using the protein dataset are provided on GitHub (https://github.com/AlexOrlek/MOBtyping). Those interested in the epidemiology of plasmid-mediated antibiotic resistance in the Enterobacteriaceae family could use the data to extend previous analyses [1].

Data

The data consists of nucleotide sequences of 2097 complete Enterobacteriaceae plasmids, compiled from the NCBI nucleotide database (‘nucleotideseq.fa’). In addition, we provide a corresponding dataset of 12,582 protein sequences (‘translatedproteinseq.fa’), derived from translating each plasmid nucleotide sequence in all 6 frames. Nucleotide and protein sequence datasets are formatted as FASTA files. Headers in the protein FASTA file are in the following format: >accession id|strand|frame|protein sequence length. Furthermore, NCBI Genbank files, with detailed information on accessions, are also provided. One Genbank file contains the 2097 complete curated plasmid accessions (‘filtered_2097plasmids.gb.gz’). Another Genbank file contains 6952 accessions (‘6952plasmids.gb.gz’), obtained using an initial query, prior to removing duplicate sequences or applying inclusion/exclusion criteria.

Experimental design, materials and methods

Putative complete plasmid accessions were retrieved from the NCBI nucleotide database (https://www.ncbi.nlm.nih.gov/nucleotide/) on 26th August 2016, using an Entrez query with filters to exclude some incomplete or non-plasmid accessions at this stage. Following this initial query, duplicate sequences (those sharing 100% nucleotide sequence identity with another retrieved sequence) were removed. Biopython scripts [7] were used to filter-out non-coding sequences. Regular expression searches of accession title descriptions were used to apply exclusion and inclusion criteria. Subsequent filtering involved conducting multi-locus sequence typing (MLST) to exclude chromosomal accessions misannotated as plasmids. In addition, the ‘completeness’ annotation (included as accession metadata in NCBI) was used to further exclude partial plasmid sequences. Additional filtering involved manual inspection of putative plasmids at the tails of the sequence length distribution, to remove remaining accessions that represented chromosomal sequences or partial plasmid sequences. A more detailed description of these methods can be found in the accompanying research article [1].

Subject area	Microbiology, Bioinformatics
More specific subject area	Plasmids
Type of data	Sequence data
How data was acquired	Plasmid nucleotide sequences were compiled from Genbank and RefSeq accessions contained within the NCBI nucleotide database. Corresponding protein sequences were generated by translating each plasmid nucleotide sequence in all 6 frames.
Data format	FASTA files, Genbank files (zipped)
Experimental factors	N/A
Experimental features	N/A
Data source location	Sequences were retrieved from the NCBI nucleotide database (https://www.ncbi.nlm.nih.gov/nucleotide/); geographic location metadata was not retrieved.
Data accessibility	Data is publicly available in the Figshare repository.
	https://figshare.com/s/18de8bdcbba47dbaba41
	DOI: D10.6084/m9.figshare.4609303

6 in total

Review 1. The diversity of conjugative relaxases and its application in plasmid classification.

Authors: María Pilar Garcillán-Barcia; María Victoria Francia; Fernando de la Cruz
Journal: FEMS Microbiol Rev Date: 2009-05 Impact factor: 16.408

Review 2. A comprehensive review and comparison of different computational methods for protein remote homology detection.

Authors: Junjie Chen; Mingyue Guo; Xiaolong Wang; Bin Liu
Journal: Brief Bioinform Date: 2018-03-01 Impact factor: 11.622

3. Biopython: freely available Python tools for computational molecular biology and bioinformatics.

Authors: Peter J A Cock; Tiago Antao; Jeffrey T Chang; Brad A Chapman; Cymon J Cox; Andrew Dalke; Iddo Friedberg; Thomas Hamelryck; Frank Kauff; Bartek Wilczynski; Michiel J L de Hoon
Journal: Bioinformatics Date: 2009-03-20 Impact factor: 6.937

4. Plasmid Classification in an Era of Whole-Genome Sequencing: Application in Studies of Antibiotic Resistance Epidemiology.

Authors: Alex Orlek; Nicole Stoesser; Muna F Anjum; Michel Doumith; Matthew J Ellington; Tim Peto; Derrick Crook; Neil Woodford; A Sarah Walker; Hang Phan; Anna E Sheppard
Journal: Front Microbiol Date: 2017-02-09 Impact factor: 5.640

5. Beginner's guide to comparative bacterial genome analysis using next-generation sequence data.

Authors: David J Edwards; Kathryn E Holt
Journal: Microb Inform Exp Date: 2013-04-10

6. Ordering the mob: Insights into replicon and MOB typing schemes from analysis of a curated dataset of publicly available plasmids.

Authors: Alex Orlek; Hang Phan; Anna E Sheppard; Michel Doumith; Matthew Ellington; Tim Peto; Derrick Crook; A Sarah Walker; Neil Woodford; Muna F Anjum; Nicole Stoesser
Journal: Plasmid Date: 2017-03-09 Impact factor: 3.466

6 in total

16 in total

1. Horizontal gene transfer enables programmable gene stability in synthetic microbiota.

Authors: Teng Wang; Andrea Weiss; Ammara Aqeel; Feilun Wu; Allison J Lopatkin; Lawrence A David; Lingchong You
Journal: Nat Chem Biol Date: 2022-09-01 Impact factor: 16.174

2. Genome-Wide Characterization of Superoxide Dismutase (SOD) Genes in Daucus carota: Novel Insights Into Structure, Expression, and Binding Interaction With Hydrogen Peroxide (H₂O₂) Under Abiotic Stress Condition.

Authors: Roshan Zameer; Kinza Fatima; Farrukh Azeem; Hussah I M ALgwaiz; Muhammad Sadaqat; Asima Rasheed; Riffat Batool; Adnan Noor Shah; Madiha Zaynab; Anis Ali Shah; Kotb A Attia; Muneera D F AlKahtani; Sajid Fiaz
Journal: Front Plant Sci Date: 2022-06-08 Impact factor: 6.627

3. Genomic Epidemiology of Complex, Multispecies, Plasmid-Borne bla _KPC Carbapenemase in Enterobacterales in the United Kingdom from 2009 to 2014.

Authors: Nicole Stoesser; Hang T T Phan; Anna C Seale; Zoie Aiken; Stephanie Thomas; Matthew Smith; David Wyllie; Ryan George; Robert Sebra; Amy J Mathers; Alison Vaughan; Timothy E A Peto; Matthew J Ellington; Katie L Hopkins; Derrick W Crook; Alex Orlek; William Welfare; Julie Cawthorne; Cheryl Lenney; Andrew Dodgson; Neil Woodford; A Sarah Walker
Journal: Antimicrob Agents Chemother Date: 2020-04-21 Impact factor: 5.191

4. Illumina short-read and MinION long-read WGS to characterize the molecular epidemiology of an NDM-1 Serratia marcescens outbreak in Romania.

Authors: H T T Phan; N Stoesser; I E Maciuca; F Toma; E Szekely; M Flonta; A T M Hubbard; L Pankhurst; T Do; T E A Peto; A S Walker; D W Crook; D Timofte
Journal: J Antimicrob Chemother Date: 2018-03-01 Impact factor: 5.790

5. PlaScope: a targeted approach to assess the plasmidome from genome assemblies at the species level.

Authors: G Royer; J W Decousser; C Branger; M Dubois; C Médigue; E Denamur; D Vallenet
Journal: Microb Genom Date: 2018-09

6. Phylogenetic barriers to horizontal transfer of antimicrobial peptide resistance genes in the human gut microbiota.

Authors: Bálint Kintses; Orsolya Méhi; Eszter Ari; Mónika Számel; Ádám Györkei; Pramod K Jangir; István Nagy; Ferenc Pál; Gergely Fekete; Roland Tengölics; Ákos Nyerges; István Likó; Anita Bálint; Tamás Molnár; Balázs Bálint; Bálint Márk Vásárhelyi; Misshelle Bustamante; Balázs Papp; Csaba Pál
Journal: Nat Microbiol Date: 2018-12-17 Impact factor: 17.745

7. Ordering the mob: Insights into replicon and MOB typing schemes from analysis of a curated dataset of publicly available plasmids.

8. Hotspot mutations and ColE1 plasmids contribute to the fitness of Salmonella Heidelberg in poultry litter.

Authors: Adelumola Oladeinde; Kimberly Cook; Alex Orlek; Greg Zock; Kyler Herrington; Nelson Cox; Jodie Plumblee Lawrence; Carolina Hall
Journal: PLoS One Date: 2018-08-31 Impact factor: 3.240

9. PLSDB: a resource of complete bacterial plasmids.

Authors: Valentina Galata; Tobias Fehlmann; Christina Backes; Andreas Keller
Journal: Nucleic Acids Res Date: 2019-01-08 Impact factor: 16.971

10. One Health Genomic Surveillance of Escherichia coli Demonstrates Distinct Lineages and Mobile Genetic Elements in Isolates from Humans versus Livestock.

Authors: Catherine Ludden; Kathy E Raven; Dorota Jamrozy; Theodore Gouliouris; Beth Blane; Francesc Coll; Marcus de Goffau; Plamena Naydenova; Carolyne Horner; Juan Hernandez-Garcia; Paul Wood; Nazreen Hadjirin; Milorad Radakovic; Nicholas M Brown; Mark Holmes; Julian Parkhill; Sharon J Peacock
Journal: mBio Date: 2019-01-22 Impact factor: 7.867