Literature DB >> 34901351

Escherichia coli strain INF32/16/A: Dataset of raw reads and assembled draft genome.

Shuhaila Mat-Sharani1, Suhaila Sulaiman2, Nik Yusnoraini Yusof3.   

Abstract

Escherichia coli strain INF32/16/A is a gram-negative bacteria which is an extended-spectrum beta-lactamases (ESBL). ESBL is an enzyme that is produced by bacteria to become resistant to existing antibiotic such as extended-spectrum penicillin, cephalosporins, and have been threatening the ability to treat an infection. Therefore, genome analysis will provide an insight of how this bacteria able to evolve and the information obtained will able to facilitate in designing new antibiotics. The genome of E. coli strain was sequenced using Illumina MiSeq and raw genome sequence have been submitted into NCBI SRA database (SRR15334628) under Bioproject accession number PRJNA726861.
© 2021 The Authors. Published by Elsevier Inc.

Entities:  

Keywords:  Escherichia coli; Extended-spectrum beta-lactamase; Genome sequencing; Pathogenic

Year:  2021        PMID: 34901351      PMCID: PMC8639416          DOI: 10.1016/j.dib.2021.107640

Source DB:  PubMed          Journal:  Data Brief        ISSN: 2352-3409


Specifications Table

Value of the Data

The draft genome data of ESBL-producing E. coli strain from Malaysia could contribute fundamental knowledge of the emergence of ESBL-type. The data is crucial as it will be benefited researchers, medical and health sector to gain information on antimicrobial resistance genes of ESBL-producing E. coli strains from Malaysia. Whole genome sequence data from ESBL strain would be useful for comparative genomic analysis E. coli strains with other types of ESBL genes isolated in different countries. By unravelling the genome of this strain, the data may be leveraged by researchers to plan and design new antibiotics targeting the homologous pathogenic bacteria for better future management on emergence of new resistant strains.

Data Description

This data consists of raw reads of the E. coli strain INF32/16/A genome that was sequenced via Illumina MiSeq technology [1]. The paired-end data sets were named as 32-16-A_R1.fastq and 32-16-A_R2.fastq. Here, we report the pre-processing of the raw reads, assembly data statistics, assembled genome completeness and similarity search of the assembled genome with a curated public database. A total of 1,592,134 raw reads (consolidated from a paired-end dataset) were generated from the genome sequencing of E. coli strain INF32/16/A, that resulted into 381,590,477 total bases (Table 1). The reads were then pre-processed to filter out reads with low-quality, short, and adapter sequences, that accounts into 53.29% of clean reads. The clean reads were successfully assembled into 97 scaffolds with the longest scaffold being the same as N50 scaffold length of 3,201,741 bases (Table 2). The assembled genome size is 5,212,612 at 74 × sequence coverage with 50.32% GC content. This genome has 4771 protein coding sequences and 313 (9 rRNA, 69 tRNA, 6 ncRNA and 229 pseudo genes) non-coding sequences. The completeness of the genome assembly was evaluated by using Benchmarking Universal Single-Copy Orthologs (BUSCO) [2] with the lineage of enterobacterales_odb10, showing a 99.8% of complete BUSCOs found in the assembled data (Fig. 1). Of 4923 predicted genes by Prodigal [3], homology-based search indicated that 92.2% of them had homology with known curated proteins in Swiss-Prot database [4] with annotated functions in Gene Ontology (Fig. 2).
Table 1

Statistics of the pre-processing data of the genome reads containing forward (32-16-A_R1.fastq) and reverse (32-16-A_R2.fastq) reads.

Sample NameR1R2Total
Total Raw Reads796,067796,0671,592,134
Total Raw Reads Bases190,618,311190,972,166381,590,477
Total Clean Reads424,248424,248848,496
Total Clean Reads Bases85,679,12657,343,443143,022,569
Clean Reads (%)53.2953.2953.29 (average)
GC Content Clean Reads (%)5051-
Table 2

The main assembly statistics of the Medusa-assembled draft genome of E. coli strain INF32/16/A.

AttributesValue
Number of scaffolds97
Total size of scaffolds5,212,612
Longest scaffold3,201,741
Shortest scaffold220
Number of scaffolds > 1K nt62 (63.9%)
Number of scaffolds > 10K nt18 (18.6%)
Number of scaffolds > 100K nt4 (4.1%)
Number of scaffolds > 1M nt1 (1.0%)
Number of scaffolds > 10M nt0 (0.0%)
Mean scaffold size53,738
Median scaffold size1,667
N50 scaffold length3,201,741
L50 scaffold count1
GC Content50.32%
Fig. 1

Genome completeness of the assembled genome of E. coli strain INF32/16/A by using BUSCO tool with enterobacterales_odb10 lineage.

Fig. 2

Functional classification based on Gene Ontology for 99.2% of predicted gene models from Prodigal.

Statistics of the pre-processing data of the genome reads containing forward (32-16-A_R1.fastq) and reverse (32-16-A_R2.fastq) reads. The main assembly statistics of the Medusa-assembled draft genome of E. coli strain INF32/16/A. Genome completeness of the assembled genome of E. coli strain INF32/16/A by using BUSCO tool with enterobacterales_odb10 lineage. Functional classification based on Gene Ontology for 99.2% of predicted gene models from Prodigal.

Experimental Design, Materials and Methods

Material

The clinical isolate of ESBL E. coli strain was drawn from a patient's blood with nosocomial infection at Hospital Universiti Sains Malaysia. The clinical sample was collected and confirmed according to the previously described [5]. Briefly, the blood sample was cultured in Bactec 9240 blood culture system (Becton, Dickinson, USA). Then, the strain was tested with gram staining and biochemical techniques prior to evaluate the ESBL screening using Clinical and Laboratory Standards Institute (CLSI) [6]. It was then subcultured into nutrient broth and incubated in a shaking incubator overnight at 37 °C and 200 rpm. The DNA was extracted and examined with 2% electrophoresis gel and NanoDrop® ND-1000 Spectrophotometer (Thermo Scientific) for the purity quantification.

Genome sequencing

A total of 2.5 µg of DNA was used to prepare a 251 bp paired-end sequencing library using an Illumina paired-end DNA sample preparation kit. The quality of the library was assessed by real time PCR before continuing to cluster generation. Sequencing was performed using one lanes of Illumina MiSeq paired-end flow cell using 500 cycles to produce 2 × 251 bp paired-end reads.

Quality assessment and reads pre-processing

During preparation for sequencing, the bacteria genomic DNA fragments were attached to the sequencing adapters, which contains the anchoring site of the sequencing primers. Thus, all sequencing reads were scanned to filter the sequencing adapters to retain the portion containing the bacteria genomic DNA. Besides, trimming was also done for low quality bases (

Genome draft assembly

The high-quality reads of Illumina were assembled de novo using SPAdes v3.9.0 [7]. These contigs were subjected to scaffolding against the closest reference genomes [5] to produce a draft genome using Medusa (Multi-Draft based Scaffolder) software [8]. The top two hits complete genome from GenBank, E. coli strain AR-0427 (CP044148.1) and E. coli strain AR216.2b (CP043942), were used to construct the draft genome scaffold. The assembly statistics is shown in Table 2. The completeness of assembled draft genome was assessed by using BUSCO on a LINUX server. Bacteria dataset of enterobacterales_odb10 was used as its single-copy orthologs database (Fig. 1). The predicted genes by Prodigal that was generated by BUSCO analysis were searched against Swiss-Prot database to see the similarity of the assembled sequence. The similarity search shows about 92.2% of the predicted genes were similar to the manually curated protein database. These known genes were associated to different gene ontology classes, with the highest in biological process, molecular function and cellular component being cellular process (3490), binding (2,685) and cellular anatomical entity (3,077) (Fig. 2).

Ethics Statement

The study protocol was approved by the ethics committee of the Universiti Sains Malaysia (USM/JEPeM/20030152).

CRediT authorship contribution statement

Shuhaila Mat-Sharani: Software, Formal analysis, Data curation, Writing – review & editing. Suhaila Sulaiman: Conceptualization, Software, Formal analysis, Data curation, Writing – original draft. Nik Yusnoraini Yusof: Conceptualization, Software, Methodology, Resources, Writing – review & editing, Supervision, Funding acquisition.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships which have, or could be perceived to have, influenced the work reported in this article.
SubjectHealth and medical sciences
Specific subject areaMicrobiology and genomics.Genome sequencing of pathogenic bacteria by using next generation sequencing approach.
Type of dataTableFigureRaw reads of sequenced genomeAssembled draft genome of E. coli strain INF32/16/A
How data were acquiredPaired-end reads of extended spectrum beta lactamase (ESBL)-producing E. coli strain INF32/16/A genome were sequenced using Illumina MiSeq.
Data formatRaw and analyzed.
Parameters for data collectionGenomic DNA from pure culture. 10 µg/ng of DNA was utilized for a 251 bp paired-end sequencing library using an Illumina paired-end DNA sample preparation kit.
Description of data collectionWhole genome sequencing performed by Illumina MiSeq system. Raw reads were trimmed using BBDuk (BBTools v36) and assembled using SPAdes v3.9.0. The scaffolding was conducted using Medusa v1.6. The genome completeness of the assembled genome was assessed using BUSCO tool.
Data source locationInstitution: Institute for Research in Molecular Medicine (INFORMM)City/Town/Region: Kubang Kerian, KelantanCountry: MalaysiaLatitude and longitude for collected samples/data: 6.10 N 102.28 E
Data accessibilityThe data is hosted on a public repository.Bioproject: https://www.ncbi.nlm.nih.gov/bioproject/PRJNA726861Biosample: https://www.ncbi.nlm.nih.gov/biosample/SAMN18971244NCBI GenBank Accession Number: NZ_JAGWDO010000000.1https://www.ncbi.nlm.nih.gov/nuccore/2035338809Repository name: NCBI SRA databaseData identification number: SRR15334628Direct URL to data: https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR15334628
  7 in total

1.  SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing.

Authors:  Anton Bankevich; Sergey Nurk; Dmitry Antipov; Alexey A Gurevich; Mikhail Dvorkin; Alexander S Kulikov; Valery M Lesin; Sergey I Nikolenko; Son Pham; Andrey D Prjibelski; Alexey V Pyshkin; Alexander V Sirotkin; Nikolay Vyahhi; Glenn Tesler; Max A Alekseyev; Pavel A Pevzner
Journal:  J Comput Biol       Date:  2012-04-16       Impact factor: 1.479

2.  The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000.

Authors:  A Bairoch; R Apweiler
Journal:  Nucleic Acids Res       Date:  2000-01-01       Impact factor: 16.971

3.  MeDuSa: a multi-draft based scaffolder.

Authors:  Emanuele Bosi; Beatrice Donati; Marco Galardini; Sara Brunetti; Marie-France Sagot; Pietro Lió; Pierluigi Crescenzi; Renato Fani; Marco Fondi
Journal:  Bioinformatics       Date:  2015-03-25       Impact factor: 6.937

4.  BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs.

Authors:  Felipe A Simão; Robert M Waterhouse; Panagiotis Ioannidis; Evgenia V Kriventseva; Evgeny M Zdobnov
Journal:  Bioinformatics       Date:  2015-06-09       Impact factor: 6.937

5.  Prodigal: prokaryotic gene recognition and translation initiation site identification.

Authors:  Doug Hyatt; Gwo-Liang Chen; Philip F Locascio; Miriam L Land; Frank W Larimer; Loren J Hauser
Journal:  BMC Bioinformatics       Date:  2010-03-08       Impact factor: 3.169

6.  Illumina MiSeq Sequencing for Preliminary Analysis of Microbiome Causing Primary Endodontic Infections in Egypt.

Authors:  Sally Ali Tawfik; Marwa Mohamed Azab; Ali Abdellah Abdelrahman Ahmed; Dalia Mukhtar Fayyad
Journal:  Int J Microbiol       Date:  2018-04-03
  7 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.