Literature DB >> 30581915

Escherichia coli O157:H7 transcriptome datasets for comparison of RNA-seq and microarray platforms.

Ewa Grabowiecka1, David Martin1, Louise Crozier2, Nicola Holden2.   

Abstract

Whole transcriptome analysis to investigate differential gene expression and regulatory adaption can be carried out on two different technological platforms: by probe hybridisation to microarrays or by RNAseq for deep sequencing. Since there are difference in terms of their genome coverage, sensitivity and cost, there is a requirement for robust comparisons to determine the platform of choice. Here, we present datasets for the whole transcriptional response verocytoxigenic Escherichia coli (VTEC) obtained from RNA-seq and microarray platforms in response to spinach, together with a comparison between the datasets (available at Array Express: E-MTAB-3249, E-MTAB-4120, E-MTAB-7441).

Entities:  

Year:  2018        PMID: 30581915      PMCID: PMC6297237          DOI: 10.1016/j.dib.2018.11.136

Source DB:  PubMed          Journal:  Data Brief        ISSN: 2352-3409


Specifications table Value of the data Direct comparison between transcriptome platforms can allow for the optimal approach to be chosen. Microarray platforms can offer a cheap and easy approach for transcriptional analysis for model organisms, like Escherichia coli O157:H7, but are limited by the probe set and potentially, sensitivity. RNA-seq does not have the same limitations but is costlier and requires specific skills for analysis. There are few published reports that make direct comparisons of the platforms; most adopt either one or the other. Here, the same sample set was applied to RNAseq and microarray transcriptome platforms to provide the most robust comparison. The comparison of the datasets showed a strong correlation between the platforms (R > 0.8) but the presence of outliers highlights differences in their outputs.

Data

The microarray [1] and RNA-seq datasets are available in ArrayExpress. A comparative analysis pipeline (Fig. 1) was implemented for bioinformatics analysis and downstream assessment. The microarray dataset required additional processing steps since it is based on probes rather than genes and contains non-target probes from multiple Escherichia coli isolates. Correlation between the datasets from each platform required normalization before the comparison of the Log-fold change (spinach leaf lysate condition relative to the control no-plant condition) could be made. The Pearson and Kolmogorov–Smirnov tests of the Log-fold change datasets indicated strong correlation, although the distribution was tailed outside the range of −2 to +2-fold change (Fig. 2).
Fig. 1

Flow diagram of analysis steps and processes. RNAseq (blue) and microarray (green) process steps are in blue and green respectively. Other steps were done in R (pink) or externally (yellow).

Fig. 2

Correlation graph of expression profiles. Correlation of differential expression of genes by Log fold-change (logFC), acquired from the RNAseq and microarray datasets.

Flow diagram of analysis steps and processes. RNAseq (blue) and microarray (green) process steps are in blue and green respectively. Other steps were done in R (pink) or externally (yellow). Correlation graph of expression profiles. Correlation of differential expression of genes by Log fold-change (logFC), acquired from the RNAseq and microarray datasets.

Experimental design, materials and methods

Sample preparations

RNA and cDNA samples from the E. coli O157:H7 Sakai strain were obtained and used for microarray analysis as described previously [1]. The cDNA library for RNA-Seq was constructed from the same RNA samples, using the approach described for previously Xanthomonas [2], and run on an Illumina Hiseq. 2000 (Genomic Sequencing Unit, the University of Dundee, Dundee, UK).

Data analysis

All bioinformatics scripts and processes are listed in Table 1 and illustrated in Fig. 1.
Table 1

Analysis steps (EA) and associated scripts and processes.

EA #DescriptionScripts (refer to GitHub) and codes
1Process flow diagramn/a: graphics (Fig. 1)
2Script for microarray analysis1.Microarray_Data_Analysis
3AScript for generating the Bowtie2 indexBowtie2-build –f sakaigenome.fasta EcoliSakai
3BScript for both paired end and unpaired alignment using Bowtie2Unpaired: bowtie2 -N -x EcoliSakai -U file1.fastq.gz, file2.fastq.gz.sam file
Paired end: bowtie2 -N -x EcoliSakai -1 file.fastq.gz -2 file2.fastq.gz –S.sam file
3CScript for index of featureCounts{ printf ׳GeneID/tChr/tStart/tEnd/tStrand/n׳; (grep -v CDS Sakai.gff | grep gene| sed -e ׳s/[^/t]*gene=//׳ -e ׳s//.[0–9]//׳| awk ׳BEGIN{OFS="/t"}{print $9, $1, $4, $5,$7}׳)} |uniq > exons.saf
3DfeatureCounts scriptfeatureCounts –a exons.saf –F SAF –o outputfile.txt.sam file
3EScript for using blastdb for creation of Sakai databaseformatdb name.fasta –n databaseName –t title –p F
3FScript for using matching the probe sequences to the databaseblastall -p blastn –m8 –I Agilentprobes.fasta –d SakaiDatabase –v |-o output.txt
4Script for RNA-Seq analysis2.RNASeq_Data_Analysis/3. DEG_analysis
5Python Script for changing sequence headers in a fasta file3.Microarray_Vs_RNASeq/1.fasta_file
6Pyhon Script for removing blast hits which are not suitable3.Microarray_Vs_RNASeq/2.blastn and sort
7FastQC reports2.RNASeq_Data_Analysis/1.FastQC
8Generation of Volcano PlotsVolcano plot
9Microarray Top Table1.Microarray_Data_Analysis
10RNASeq Top Table2.RNASeq_Data_Analysis/3. DEG_analysis
11Comparison between RNAseq samples for paired end and random alignment2.RNASeq_Data_Analysis/2.Alighment_and_Count
12Comparison between Microarray and RNASeq3.Microarray_Vs_RNASeq/3.table merge
Analysis steps (EA) and associated scripts and processes.

Microarray

The published microarray dataset [1] was reanalysed to permit a dataset comparison. Raw data were normalised between samples to ensure consistency across data sets. Probes from each sample were filtered in accordance with the corresponding fluorescence of the negative control probes on the microarray plate. Only probes which were 10% brighter than the 95% percentile of the negative probes were maintained for the analysis. Linear modelling was applied to the data using the R (v3.0.2) statistical language in the R studio software (v0.97.551) [3] utilising the Bioconductor Limma library (v3.18.7) [4] and following the Limma user guide (Script EA 2 – Table 1). A dendrogram was produced using the gPlots 2.12.1 [5].

RNA-seq

FastQC software (v0.10.1) [6] was used to perform a quality check of the raw data, according to the software specifications. The reads were then aligned to the Sakai reference genome (GenBank accession number BA000007) using the Bowtie (v2 2.1.0) aligner [7]. Firstly, an index was built using the Index Builder provided by Bowtie2 (Script EA 3A) and this was then used to align the RNA samples to the reference genome (Script EA 3B). Paired-end and random alignment was performed to compare the two (Script EA 11), producing one.sam file per each sample. Reads were then summarised using the featureCount software [8]. First, an annotation file was generated using the Sakai reference genome (GenBank accession number BA000007) (Script EA 3C). The generated reference file was then used to extract the count information from the aligned.sam files (Script EA 3D). The output was a tab delimited text file containing the name of the gene, start and end positions of the gene on the strand, count of mRNA and the strand direction. Files were then imported into R studio, normalised, and fitted into a linear model using the Voom function from the Bioconductor libraries Limma [9] and edgeR (v3.2.4) [10]. The log of the fold change between experiment and control cultures was obtained from this linear model (Script EA 4). Volcano plots were generated (Script EA 12) using the ggplots2 package [11].

Dataset comparisons

E. coli O157:H7 Sakai sequences, containing gene identifiers, were downloaded from the NCBI database (Sakai: NC_002695.1, pOSAK1: NC_002127.1, pO157: NC_002128.1) and converted into the appropriate fasta format using the Sequence Format Converter. A Python (v2.7.6) script [12] was written to convert fasta identifiers into a suitable format for the microarray against the RNA-Seq comparison (Script EA 5). The re-formatted fasta file was then used to construct a searchable database by using Blast (v2.2.17) [13] tool blastdb (Script EA 3E). The Blastall tool was used to match the microarray probe sequences to the above described database (Script EA 3F), resulting in a list of microarray to ECs number matches. The list of matches was filtered to extract only matches in which the sequence length was higher than 50 and mismatch was lower or equal to 7 (Script EA 6). Raw data and Log-transformed fold-change (logFC) values of normalised microarray and RNA-Seq data were compared and plotted using R studio (Script EA 12); (Fig. 2).
Subject areaBiology
More specific subject areaBioinformatics; Microbiology
Type of dataTable, graph, figure
How data were acquiredHigh-throughput RNA-sequencing; Microarray
Data formatFiltered and analysed with statistical tests
Experimental factorsE. coli O157:H7 was grown in minimal medium at 18 °C to mid-log phase and transferred to medium containing 40% (v/v) spinach leaf lysate for 1 h
Experimental featuresTotal RNA was extracted using commercial kits and a cDNA library generated with enterobacteria-specific primers and hybridized to a microarray (E. coli v2 array – Agilent), or rRNA was depleted and paired-end cDNA libraries generated for sequence on an Illumina Hi-Seq. 2000. A series of statistical analyses was used for comparison between the datasets.
Data source locationJames Hutton Institute, Dundee, DD2 5DA, UK.
Data accessibilityData are with this article and also available at ArrayExpress:
E-MTAB-3249 (microarray) https://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-3249/
E-MTAB-4120 (microarray) https://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-4120/
E-MTAB-7441 (RNAseq) https://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-7441
Scripts used for data analysis are available on GitHub: https://github.com/TheMicroGirl/SakaiRNASeq
Related research articleL. Crozier, P. Hedley, J. Morris, C. Wagstaff, S.C. Andrews, I. Toth, R.W. Jackson, N. Holden, Whole-transcriptome analysis of verocytotoxigenic Escherichia coli O157:H7 (Sakai) suggests plant-species-specific metabolic responses on exposure to spinach and lettuce extracts, Front Microbiol, 7, 2016. doi: 10.3389/fmicb.2016.01088[1]
  1 in total

1.  Dataset of Escherichia coli O157: H7 genes enriched in adherence to spinach root tissue.

Authors:  Ashleigh Holmes; Leighton Pritchard; Peter Hedley; Jenny Morris; Sean P McAteer; David L Gally; Nicola J Holden
Journal:  Data Brief       Date:  2020-05-28
  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.