Literature DB >> 30942868

Digital expression explorer 2: a repository of uniformly processed RNA sequencing data.

Mark Ziemann^1,2, Antony Kaspi², Assam El-Osta^2,3.

Abstract

BACKGROUND: RNA sequencing (RNA-seq) is an indispensable tool in the study of gene regulation. While the technology has brought with it better transcript coverage and quantification, there remain considerable barriers to entry for the computational biologist to analyse large data sets. There is a real need for a repository of uniformly processed RNA-seq data that is easy to use.
FINDINGS: To address these obstacles, we developed Digital Expression Explorer 2 (DEE2), a web-based repository of RNA-seq data in the form of gene-level and transcript-level expression counts. DEE2 contains >5.3 trillion assigned reads from 580,000 RNA-seq data sets including species Escherichia coli, yeast, Arabidopsis, worm, fruit fly, zebrafish, rat, mouse, and human. Base-space sequence data downloaded from the National Center for Biotechnology Information Sequence Read Archive underwent quality control prior to transcriptome and genome mapping using open-source tools. Uniform data processing methods ensure consistency across experiments, facilitating fast and reproducible meta-analyses.
CONCLUSIONS: The web interface allows users to quickly identify data sets of interest using accession number and keyword searches. The data can also be accessed programmatically using a specifically designed R package. We demonstrate that DEE2 data are compatible with statistical packages such as edgeR or DESeq. Bulk data are also available for download. DEE2 can be found at http://dee2.io.

Entities: CellLine Chemical Disease Gene Species

Keywords: RNA-seq; data reuse; gene expression; transcriptome

Mesh：

Year: 2019 PMID： 30942868 PMCID： PMC6446219 DOI： 10.1093/gigascience/giz022

Source DB: PubMed Journal: Gigascience ISSN： 2047-217X Impact factor: 6.524

Background

Since its first description 10 years ago, RNA sequencing (RNA-seq) has become a powerful method in transcriptomics, allowing highly accurate gene expression quantification [1]. As the cost of sequencing decreases, RNA-seq data are becoming more ubiquitous in the scientific literature. It is standard practice in the field and a compulsory requirement for journals to deposit these data to Gene Expression Omnibus (GEO) and Sequence Read Archive (SRA) [2,3] in the form of raw and processed files, with the aim of fostering greater reuse and transparency. In practice, however, there are several hurdles that impede widespread reuse by biologists. First, processing raw sequence data from SRA requires significant computational resources and command-line expertise. Second, the processed RNA-seq data hosted by GEO are prepared in assorted formats that utilize various software tools and genome annotation sets, which complicates meta-analyses. Despite the value of these data to the scientific community and tremendous cost to generate them, RNA-seq data aggregation efforts have been largely limited to human and mouse [4,5] or are closed source/subscription services [6]. BgeeDB provides array and sequencing-based expression data on many animal species with a particular focus on high-quality measurements of baseline samples at different life stages (excluding disease, treatments, or genetic perturbations) [7]. Expression Atlas is one of the most comprehensive repositories of processed expression microarray data with an informative graphical interface, but only a comparatively small number of RNA-seq data sets are currently included [8]. In an effort to boost reuse of public transcriptome data, we developed Digital Expression Explorer 2 (DEE2), an open-access web-based repository of uniformly processed RNA-seq digital gene-level and transcript-level expression data for several major organisms that is compatible with many types of downstream analyses.

Data Processing

DEE2 consists of 3 parts: (i) a pipeline that downloads and process raw data sets from SRA; (ii) a data repository where processed files are collected, filtered, and organized/stored and job queues are generated; and (iii) a web server where users can search metadata and obtain data sets of interest. A schematic diagram of the organization of DEE2 is provided in Fig. 1. Data processing nodes request SRA run accession numbers from the web server and obtain raw data from SRA. Processed data are sent to the web server, validated, and relayed to the DEE2 repository server. The repository server performs further validation checks, incorporates new data sets into the repository, collects corresponding metadata from SRAdbV2 [9], and queues outstanding jobs. The repository server then sends updated metadata and job queue information to the web server. End users obtain data from the web browser, command line, or bulk dumps.

Figure 1:

Overview of RNA-seq data processing, storage, and provision.

Pipeline Features

The DEE2 pipeline uses containerization to enable rapid application deployment and guarantees analytical reproducibility across different computer systems. End users can run the Docker image [10] on their own hardware to process SRA data sets of interest as specified with a species name and SRA run accession. After completion of the processing, users will have immediate access to the outputs, and after validation by the DEE2 repository server, the data sets will be available publicly. In this way, power users obtain benefit by using an established analysis pipeline and simultaneously contribute to expanding the public resource. One concern with Docker images is that they cannot be run without administrator "root" permissions, e.g., by users of a shared high-performance computing system. To address this limitation, the image can be converted for use by Singularity [11] or UDocker [12] without root permissions. The steps involved in data processing are summarized in Fig. 2. The pipeline fetches the appropriate reference genome, annotation, and complementary DNA sequence data from Ensembl (August 2017 version) [13]. Transcriptome sequencing data sets are downloaded from SRA using Aspera. The pipeline handles both single-end (SE) and paired-end (PE) sequencing data with the exclusion of colorspace sequence data. A sample of 4,000 reads is used to perform basic checks including read and quality string format using FastQC [14] prior to extraction of fastq files with a parallel implementation of fastq-dump [15]. Skewer [16] is used to trim bases with phred quality <10 on the 3′ ends and discards reads shorter than 18 nucleotides. Adapter sequences at the 3′ end are detected using Minion, part of the Kraken package [17]. Adapter sequences are clipped using Skewer if the predicted adapter sequence is not present in the genome and exceeds a frequency of 2.5%. To handle nonreference 5′ bases including unique molecular identifiers, a sample of 10,000 reads undergo progressive clipping of 5′ ends (4, 8, 12, 20 nucleotides) followed by genomic mapping with STAR to determine the optimal number of bases to clip from the 5′ end, as determined by the proportion of uniquely mapped reads. STAR [18] is then used to map all reads that pass quality control (QC) to the genome and generate gene-wise expression counts with the "–quantMode GeneCounts" (no alignment files are generated). STAR output is also used to diagnose whether the data set is strand specific. To classify a data set as strand specific, there needs to be a 5:1 strand bias in assigned reads according to STAR. This option is passed to Kallisto, which maps reads to the transcriptome to generate estimated transcript counts [19]. Gene and transcript counts, along with analysis logs and QC metrics, are zipped and transferred to the web server by sftp. The pipeline has the added ability to process users' own fastq files using the same pipeline, although the results remain private. The pipeline code is open source and available online [20]. Software versions and parameters used in the pipeline are provided in Supplementary Table S1.

Figure 2:

Overview of steps in the RNA-seq data processing pipeline.

Data Provided

Currently DEE2 hosts data from 9 organisms selected because they are important model organisms and have a large number of corresponding transcriptome data sets in SRA. Currently, there are >580,000 RNA-seq data sets available, with each data set corresponding to a specific SRA run number. Together the 9 species included constitute 73.5% of all transcriptome data sets available from SRA. DEE2 consists of >5.3 trillion assigned sequence reads (Table 1). The data provided include gene-wise expression counts, transcript-wise estimated counts, gene information, transcript information, summary metadata, full metadata, and QC metrics, provided as 7 separate tables in tsv format. The gene information table contains the gene accession number, corresponding gene symbol, and gene length as calculated by GTFtools v0.6.5 [21]. The transcript information file contains transcript-parent gene relationships, gene symbol, and transcript length. Gene and transcript length information will allow straightforward normalization of expression by contig length. The full metadata table contains all corresponding metadata from SRAdbV2, while the summary metadata contains only corresponding SRA accession numbers and experiment title. Moreover, analysis logs for each data set are provided. Classification of data sets by QC metrics is discussed below.

Table 1:

Hosted gene expression data as of 11 January 2019

Species	Projects	Experiments	Runs	QC classification pass/warn/fail	Assigned reads (STAR)	Assigned reads (Kallisto)
Arabidopsis thaliana	986	17,095	26,061	5,602/15,122/5,337	2.87e+11	2.92e+11
Caenorhabditis elegans	339	5,759	7,722	1,647/2,446/3,629	8.71e+10	7.88e+10
Drosophila melanogaster	678	14,401	18,713	4,410/7,471/6,832	1.75e+11	1.87e+11
Danio rerio	457	26,246	28,100	1,084/5,826/21,190	1.11e+11	6.20e+10
Escherichia coli	180	1,488	1,638	355/376/907	1.26e+10	9.40e+09
Homo sapiens	6,768	197,836	229,634	42,225/77,254/110,155	2.27e+12	2.51e+12
Mus musculus	7,078	204,850	252,058	23,840/85,874/142,344	1.84e+12	2.08e+12
Rattus norvegicus	349	4,965	5,799	426/2,651/2,900	5.95e+10	6.42e+10
Saccharomyces cerevisiae	442	10,239	11,369	3,025/2,783/5,561	7.41e+10	7.32e+10
Total	17,277	482,879	581,094	82,614/199,803/298,855	4.92e+12	5.35e+12

Hosted gene expression data as of 11 January 2019

Quality Control Metrics

QC is paramount for a resource such as this. A range of quality metrics are accessible and can be viewed on the search results page, which includes mean base quality scores, number of reads, alignment rates, and read assignment statistics. Detailed analysis logs are distributed alongside expression data. Summary statistics for human data sets are provided in Fig. 3. There are roughly equal numbers of runs with SE and PE sequencing (Fig. 3A). Almost all data sets are encoded in Illumina 1.9 format (also known as Sanger), and a small number of data sets with Illumina 1.5 quality encoding (Fig. 3B). The median read length is 75 base pairs ( bp) and mode is 50 bp (Fig. 3C). Most (71.3%) data sets had ≥95% of reads pass QC filtering (Fig. 3D). The median number of reads passing QC filtering was 4.6 million (Fig. 3E). The median proportion of STAR unique mapping was 82% (Fig. 3F). Median assignment proportion recorded a median of 59.4% (Fig. 3G). The majority of data sets were classified as unstranded (67.3%), with smaller numbers of runs biased towards each strand (Fig. 3H). The median proportion of reads mapped with Kallisto was 63.2% (Fig. 3I).

Figure 3:

Summary QC metrics for human data sets. (A) sequence format. (B) Base quality encoding, Illumina version 1.9 and 1.5. (C) Read length histogram. (D) Proportion of reads that pass QC filtering. (E) Number of QC passed reads per run. (F) Proportion of STAR uniquely mapped reads. (G) Proportion of reads assigned to genes. (H) Classification of reads by strandedness. (I) Proportion of reads mapped with Kallisto. Data accessed 20 December 2018. Although there are no definitive thresholds for what constitutes a valid RNA-seq data set, there are 2 main principles: (i) the digital nature of RNA-seq means that data sets with more reads will provide more accurate quantification, and (ii) data sets with a large proportion of reads excluded from downstream analysis will be less representative of the original sample, and suggest issues with sample quality, library preparation, or sequencing instrumentation. Using these principles, we have classified the data sets as “pass,” “warn,” and “fail” according to heuristics outlined in Table 2. Each rule has a numeric code, and this is provided in the search results. Because sequencing depth recommendations are larger for more complex organisms, the metrics describing integer counts are proportional to transcriptome complexity (the number of protein-coding genes as defined by Ensembl).

Table 2:

Criteria for data set quality classification

Metric	Meaning	Fail threshold	Warn threshold	Code
NumReadsQcPass	No. reads passed QC filtering	<50 reads per gene^a	<500 reads per gene^a	1
QcPassRate	Proportion of reads passed QC filtering	<60%	<80%	2
STAR_UniqMapRate	Proportion of reads mapped uniquely to the reference genome using STAR	<50%	<70%	3
STAR_AssignRate	Proportion of reads assigned to genes with STAR	<40%	<60%	4
STAR_AssignedReads	No. reads assigned to genes with STAR	<50 reads per gene^a	<500 reads per gene^a	5
Kallisto_MapRate	Proportion of reads assigned to transcripts with Kallisto	<40%	<60%	6
Kallisto_MappedReads	No. reads assigned to transcripts with Kallisto	<50 reads per gene^a	<500 reads per gene^a	7
DatasetCorrel	Pearson correlation coefficient to passed data average	–	<0.5	8

Number of protein-coding genes was obtained from Ensembl and used as an estimator of transcriptome complexity.

Criteria for data set quality classification Number of protein-coding genes was obtained from Ensembl and used as an estimator of transcriptome complexity. Furthermore, if a data set profile is substantially different to the bulk of “pass” data sets, this may be useful information for end users. To quantify this, an average gene expression profile (STAR) of “pass” data sets is calculated and each data set is compared by Pearson correlation (Methods). If the correlation coefficient is <0.5, then the data set is flagged as “warn.” As new data sets are periodically added, these correlation values may vary slightly over time. To understand why some data sets have low correlation to the bulk of "pass" data sets, we undertook an unsupervised clustering analysis of correlation in 5,808 S. cerevisiae data sets. While most Spearman correlation coefficients were >0.7, there was a small fraction <0.5 (Fig. 4A). Most data sets (5,571) were classified into 2 large clusters (Fig. 4B; blue and light blue). The remaining 236 data sets belonged to several smaller clusters. These smaller clusters mostly contained data sets derived from nonstandard RNA-seq library construction protocols such as 3′ end RNA-seq (ERP004367, SRP048715, SRP048715, SRP021938), Ribo-Seq (SRP075766, SRP082147), and RNA-IP-Seq (SRP032276). One of the smaller clusters contained data sets of cells undergoing sporulation and meiosis (e.g., SRP092588, SRP061166, SRP032309). From this analysis, we can conclude that highly correlated data sets are standard RNA-seq/mRNA-seq and data sets with low correlation are mostly due to the use of nonstandard library construction protocols, but also some data sets derived from less characterized biological states (e.g., meiosis/sporulation in the case of S. cerevisiae).

Figure 4:

Unsupervised clustering analysis of the correlation of 5,807 S. cerevisiae data sets. (A) Colour key and histogram of Spearman correlation coefficients. (B) Heat map of pairwise correlation values with data sets clustered by similarity. Red indicates high correlation and blue indicates low correlation.

Pipeline Validation

To demonstrate the accuracy of the pipeline, we performed a simulation study. Synthetic Illumina HiSeq RNA-seq data were generated from Ensembl transcripts and processed with the pipeline (see Methods). The reads per million (RPM) values were compared between the ground truth and DEE2-processed data, and Spearman correlation coefficients (ρ) were calculated (Fig. 5; Supplementary Table S2). We observed that analyses of simpler organisms were, in general, more accurate than for more complex transcriptomes of human and mouse. Overall there was only a small improvement in accuracy in PE over SE reads. Transcript quantification results from Kallisto were less accurate than gene level quantification with STAR. On the other hand, Kallisto transcript counts collapsed into their parent gene were substantially more accurate than STAR gene counts (Fig. 5; Supplementary Table S2), consistent with previous a previous report [22].

Figure 5:

Comparison of ground truth and DEE2 pipeline inferred expression profiles. Human SE 100 bp RNA-seq reads simulated with ART [24] underwent mapping with the DEE2 pipeline, generating gene-level and transcript-level expression counts. Inferred expression count values were normalized for library size and plotted against the corresponding ground truth values. Dashed lines show the 2 and –2 fold expression differences. STAR gene counts were generated with the –quantMode GeneCounts feature. Kallisto-estimated counts were used to quantify transcripts. Kallisto gene counts were calculated by aggregating (sum) estimated transcript counts to their parent gene. In a separate validation exercise, we compared author-supplied expression count data present in GEO with corresponding DEE2-STAR counts, and quantified the similarity at the level of individual runs as well as across contrasts (see Methods for details). At the level of individual runs, there was a tight correlation between DEE2-derived and author-supplied RPM values, with Spearman coefficients in the range of 0.95–0.99 (Fig. 6A). After differential expression analysis with edgeR [23], genes were ranked by significance. Author-derived differential expression results were then compared with DEE2-derived differential expression results, enabling us to generate a single Spearman correlation coefficient for each contrast. Using this approach, the correlation in differential expression results between DEE2-STAR and author-supplied counts ranged between 0.55 and 0.95, with a median of 0.81 (Fig. 6B). Differential expression correlation was higher in comparisons with more replicates (ρ = 0.757, n = 9, P = 0.018). Both exercises support the validity of DEE2 data.

Figure 6:

Comparison of DEE2 data with author-uploaded gene-level count data. (A) Sample level RPM correlation between author-provided and DEE2-processed expression profiles. (B) Correlation of differential expression results. Ath: A. thaliana; cel: C. elegans; dme: D. melanogaster; dre: D. rerio; eco: E. coli, hsa: H. sapiens; mmu: M. musculus; rno: R. norvegicus; sce: S. cerevisiae.

A Brief Meta-analysis of Yeast Gene Expression

To demonstrate the utility of DEE2 data we undertook an exploratory analysis of gene expression in S. cerevisiae. We correlated the expression of all genes in 5,808 data sets in DEE2 and performed unsupervised hierarchical clustering. This resulted 7,126 genes being classified into 10 clusters (Fig. 7A). The largest cluster consisted of 3,634 genes (light blue), and the remaining clusters contained between 634 and 175 genes. Gene ontology analysis was performed to detect overrepresented biological pathways in each cluster (Fig. 7B). Interestingly, each cluster was involved in different biochemical specializations. For example, the dark green cluster was overrepresented in genes involved in translation, while the nearest neighbour, light purple, was overrepresented in amino acid metabolism. Similarly the light orange cluster was enriched for genes involved in mitochondrial function and the nearest neighbour, pink, was involved in adenosine triphosphate metabolism. These findings illustrate one way in which DEE2 facilitates meta-analysis of gene expression.

Figure 7:

Unsupervised hierarchical clustering of gene expression in S. cerevisiae. (A) Clustering and correlation heat map. Red indicates high correlation and blue indicates low correlation. (B) Biological process gene ontology enrichments of each cluster. Only the top 3 biological processes are shown (as ranked by significance).

Reuse Potential

The financial cost of generating these raw data sets is substantial. A rough estimate of the cost to generate raw data included in DEE2 is ∼$162 million US. In contrast, the estimated cost to process these data sets on Amazon EC2 infrastructure is estimated at just $97,000 but could be reduced to ∼$24,000 using off-peak resources. Therefore, data aggregation efforts like DEE2 can, with a modest budget, add substantial value to these existing data by enabling straightforward reuse. Another benefit of aggregation is that genome annotations are updated over time as compared to author-submitted data that remain static. To enhance the reuse potential, we have designed a simple and easy-to-use website to access the data. Users select 1 of the 9 species featured and provide either keywords or accession numbers to identify data sets of interest (Fig. 8A). The web interface provides data sets in batches of ≤500 runs. When 501–5000 matches are obtained, users can download the corresponding metadata and are given options to access expression data (see below). The search results page contains corresponding SRA accession numbers, experiment title, and keyword context if a keyword was used (Fig. 8B). The results page provides links to QC information so that users can be assured of data set quality. If ≤500 matches are found, the QC information can be seen simply by hovering the mouse over the QC summary field (Fig. 8B). Users then tick the box of every data set they would like to download, and by hitting the "Get Counts" button, the data sets are downloaded. The searching and retrieval steps for the example depicted in Fig. 8 took 13 seconds. Fig. 8C demonstrates how data are delivered to end users: as a zip archive containing tab-separated expression count, contig information, metadata, and QC information. The web server is limited to fetching 500 data sets at a time. To enable easy access to large data sets we provide zip "bundles" for each project with ≥200 runs that have been fully processed by DEE2; there are 425 such bundles as of 11 January 2019 [27].

Figure 8:

Example of the DEE2 web interface. (A) Users select the species of interest and supply either accession numbers or keywords to search. (B) In this case, a search for "fatty acid" for S. cerevisiae yields 20 hits. Mouse pointer hovering over the QC summary field shows the key quality criteria of the processed data sets. (C) Users can select the data sets to download by ticking the boxes and hitting the "Get counts" button at the bottom of the page. The downloaded zip file (C left side) contains detailed processing logs of each run, along with a matrix of gene-wise counts, transcript-wise counts, metadata information, QC information, and information on transcripts and genes. The gene-wise count matrix and QC information are shown in C, middle and right side, respectively. Because R is the main language for downstream statistical analysis of RNA-seq data, we also provide an R package getDEE2 to obtain DEE2 data. In the example shown in Box 1, data sets belonging to an experiment with GEO series GSE33569 are obtained. Transcript-wise counts can be aggregated to gene-level counts with a single command (Tx2Gene).

Box 1:

An example of obtaining gene and transcript expression data sets using the R functions (GEO series: GSE33569). The Tx2Gene function is used to aggregate (sum) transcript counts to gene-level counts.

An example of obtaining gene and transcript expression data sets using the R functions (GEO series: GSE33569). The Tx2Gene function is used to aggregate (sum) transcript counts to gene-level counts. For power users, bulk data dumps are available and will be of use to researchers wishing to do wholesale meta-analyses, as 3 studies already have [28-30]. Irrespective of the method of acquisition, DEE2 data are compatible with many different downstream applications including R/Bioconductor [31,32], Degust [33], and Galaxy [34].

Conclusion

DEE2 provides a unique framework and user-friendly resource of processed RNA-seq data that alleviates many of the bottlenecks researchers currently face with analysis of public RNA-seq data. Our testing shows that DEE2 and Degust enable analysis of public RNA-seq data on mobile devices such as smartphones. Bulk data provided by DEE2 are a useful starting point for researchers performing meta-analyses of RNA-seq data.

Methods

Analysis of correlation

Correlation of a data set to the bulk of other high-quality data sets could be a useful metric to filter by when performing meta-analyses. To make this a tractable calculation, first a mean gene expression profile is generated using STAR gene expression data from ≤10,000 randomly selected data sets that pass all other QC checks. Next, each data set is compared to the average pass profile and a Pearson correlation coefficient is calculated. As new data sets will be added to DEE2 regularly, the average pass profile will vary slightly over time and as such, the correlation coefficient is calculated to 2 significant figures only. In the analysis of data set correlation (Fig. 4), the Spearman correlation of all 5,808 S. cerevisiae data sets classified as "pass" and "warn" was calculated in a pairwise fashion (data as of 3 January 2019). Clustering was performed using the hclust function in R with the complete linkage method, followed by tree cutting at 0.375 of the maximum branch length. The heatmap.2 package was used to generate the heat map. In the analysis of gene expression correlation (Fig. 7A), the S. cerevisiae data were transposed prior to correlation analysis, then tree cutting was performed at 0.17 of the maximum branch length. The clusters obtained were subjected to gene ontology analysis at the level of biological pathways using the enrichGO tool in the clusterProfiler package version 3.8.1 (clusterProfiler, RRID:SCR_016884) [35]. Only the top 3 pathways for each cluster were plotted.

Pipeline validation using simulated data

To validate the accuracy of the DEE2 pipeline, we generated Illumina HiSeq2500-like sequence reads from Ensembl complementary DNA sequences using ART (v2016-06-05) [24] with a defined seed (1,540,165,885) and a uniform fold coverage of 2. Read lengths were 50 and 100 bp in SE and PE format, respectively. The read sets were processed with the DEE2 pipeline and the observed expression data were compared with the ground truth, using Spearman correlation of library size normalized profiles in RPM as an indicator of accuracy. For Kallisto-based analysis, estimated transcript counts "est_counts" were used. Transcript-estimated counts were totalled for each parent gene to generate gene-wise expression counts. These analyses were performed for all 9 organisms currently included in DEE2.

Pipeline validation using public data

Another way to validate the accuracy of DEE2 data is to compare with author-submitted results available on GEO. We searched for studies that reported expression data as raw counts with official gene names or Ensembl accession numbers, ≥2 replicates, and acceptable read depth and genome mapping rate. Author-supplied counts were obtained from GEO for the data sets listed in Table 3 [36-44]. Spearman correlation of RPM values was used to quantify the similarity of DEE2 and author-supplied data at the level of individual runs. To determine the similarity in differential expression results, the same edgeR v3.22.3 [23] analysis was performed in parallel on author-supplied counts and DEE2 counts. The runs defined as control and case for each experiment are listed in Table 3. To rank genes by significance in differential expression, the sign of the fold change was multiplied by the negative log2 P-value. Spearman correlation analysis was used to quantify the similarity in differential expression results using these 2 data sources.

Table 3:

Details of author-supplied processed data used to compare to DEE2 gene expression counts

Species, GEO series	Contrast (control/case)	Spots	Author pipeline
A. thaliana, GSE53078 [36]	GSM1281703	15,143,653	Genome: TAIR10
	GSM1281704	12,498,123	Annotation version: Unknown
	GSM1281705	22,721,359	Mapper: TopHat
	GSM1281706	17,255,612	Counter: HTSeq
C. elegans, GSE46344 [37]	GSM1128862	30,650,959	Genome: WS220/ce10
	GSM1128863	47,245,721	Annotation: Ensembl v66
	GSM1128864	54,573,311	Mapper: TopHat
	GSM1128868	49,315,179	Counter: HTSeq
	GSM1128869	56,295,663
	GSM1128870	68,641,842
D. melanogaster, GSE43180 [38]	GSM1057982	24,902,977	Genome: dm3
	GSM1057983	36,434,276	Annotation: Ensembl v64
	GSM1057984	32,591,508	Mapper: Tophat
	GSM1057985	35,375,654	Counter: HTSeq
D. rerio, GSE80768 [39]	GSM2136810	19,404,674	Genome: Zv10
	GSM2136811	22,820,115	Annotation: Ensembl (version unknown)
	GSM2136812	25,181,184	Mapper: USeq and Novoalign
	GSM2136813	21,487,514	Counter: USeq
	GSM2136814	24,831,643
	GSM2136815	22,664,352
	GSM2136816	22,629,782
	GSM2136817	21,842,104
	GSM2136818	20,601,291
	GSM2136819	18,183,746
	GSM2136820	21,007,467
	GSM2136821	20,992,396
	GSM2136822	24,708,106
	GSM2136823	21,105,462
	GSM2136824	28,069,482
E. coli, GSE80251 [40]	GSM2122743	5,221,858	Genome: E. coliK12 MG1655
	GSM2122744	6,503,454	Annotation: GenBank NC_000913.3
	GSM2122745	6,209,263	Mapper: TMAP (map4)
	GSM2122746	6,391,549	Counter: Bedtools
	GSM2122747	6,197,872
	GSM2122748	5,090,669
H. sapiens, GSE63776 [41]	GSM1556982	30,007,994	Genome: hg19
	GSM1556983	27,252,897	Annotation: UCSC (version unknown)
	GSM1556984	42,212,497	Mapper: Bowtie2 (after adapter clipping)
	GSM1556985	31,456,271	Counter: HTSeq
	GSM1556986	31,569,339
	GSM1556987	37,477,777
M. musculus, GSE59970 [42]	GSM1462883	32,015,112	Genome: GRCm38.70/mm10
	GSM1462884	30,997,187	Annotation: Ensembl v70
	GSM1462885	32,612,584	Mapper: Olego
	GSM1462886	31,485,760	Counter: BedTools
	GSM1462887	30,207,461
	GSM1462888	31,028,501
R. norvegicus, GSE65715 [43]	GSM1604049	42,296,446	Genome: rn4
	GSM1604050	34,887,323	Annotation: Ensembl (version unknown)
	GSM1604051	42,725,865	Mapper: Tophat2
	GSM1604052	28,210,194	Counter: HTSeq
	GSM1604053	30,748,641
	GSM1604054	28,450,626
S. cerevisiae, GSE76444 [44]	GSM2809655	35,869,614	Genome: EF 4
	GSM2809656	37,425,737	Annotation: Ensembl v72
	GSM2809657	39,227,797	Mapper: Bowtie
	GSM2809658	33,974,055	Counter: HTSeq
	GSM2809659	33,339,067
	GSM2809660	37,546,069

Details of author-supplied processed data used to compare to DEE2 gene expression counts

Availability of source code and requirements

Project name: Digital Expression Explorer 2 Project home page: http://dee2.io Operating systems (data set): Platform independent Operating systems (pipeline): Unix and MacOS License: GNU GPL v3 Any restrictions on use by nonacademics: none

Availability of supporting data and materials

Data set access: http://dee2.io (RRID:SCR_016929) https://dee2.io/bulk.htmlBulk data access: Source code: https://github.com/markziemann/dee2 (RRID:SCR_016930) Pipeline Docker image: https://hub.docker.com/r/mziemann/tallyup/(RRID:SCR_016931) A snapshot of the latest update of the bulk data presented in this article is available in the GigaScience GigaDB repository [45].

Additional files

Supplementary Table S1. Software versions and parameters used in the pipeline. Supplementary Table S2. Spearman correlation coefficients (ρ) between ground truth and DEE2 processed expression profiles (RPM) from simulated data.

Abbreviations

bp: base pairs; DEE2: Digital Expression Explorer 2; GEO: Gene Expression Omnibus; PE: paired end; QC: quality control; RNA-seq: RNA sequencing; RPM: reads per million; SE: single end; SRA: Sequence Read Archive; SRAdbV2: An R Package to Query the Sequence Read Archive.

Competing interests

The authors declare that they have no competing interests.

Funding

A.E.-.O is a Senior Research Fellow supported by National Health and Medical Research Council (APP1154650). A.E.-O. receives funding from the National Health and Medical Research Council—Natural Science Foundation of China (NHMRC-NSFC International Joint Call APP1113188). A.E.-O receives funding from the National Health and Medical Research Council—European Union Collaborative Research Grants Scheme (APP1075563).

Author contributions

M.Z. and A.E.-O. conceived and designed the study. M.Z. and A.K. wrote the computer code. M.Z. coordinated data processing and drafted the manuscript. All authors read, revised, and approved the final manuscript. Click here for additional data file. Click here for additional data file. Click here for additional data file. 12/4/2018 Reviewed Click here for additional data file. 2/5/2019 Reviewed Click here for additional data file. 12/10/2018 Reviewed Click here for additional data file. 2/4/2019 Reviewed Click here for additional data file. Click here for additional data file.

30 in total

Review 1. Orchestrating high-throughput genomic analysis with Bioconductor.

Authors: Wolfgang Huber; Vincent J Carey; Robert Gentleman; Simon Anders; Marc Carlson; Benilton S Carvalho; Hector Corrada Bravo; Sean Davis; Laurent Gatto; Thomas Girke; Raphael Gottardo; Florian Hahne; Kasper D Hansen; Rafael A Irizarry; Michael Lawrence; Michael I Love; James MacDonald; Valerie Obenchain; Andrzej K Oleś; Hervé Pagès; Alejandro Reyes; Paul Shannon; Gordon K Smyth; Dan Tenenbaum; Levi Waldron; Martin Morgan
Journal: Nat Methods Date: 2015-02 Impact factor: 28.547

2. STAR: ultrafast universal RNA-seq aligner.

Authors: Alexander Dobin; Carrie A Davis; Felix Schlesinger; Jorg Drenkow; Chris Zaleski; Sonali Jha; Philippe Batut; Mark Chaisson; Thomas R Gingeras
Journal: Bioinformatics Date: 2012-10-25 Impact factor: 6.937

3. Reproducible RNA-seq analysis using recount2.

Authors: Leonardo Collado-Torres; Abhinav Nellore; Kai Kammers; Shannon E Ellis; Margaret A Taub; Kasper D Hansen; Andrew E Jaffe; Ben Langmead; Jeffrey T Leek
Journal: Nat Biotechnol Date: 2017-04-11 Impact factor: 54.908

4. NCBI GEO: archive for functional genomics data sets--update.

Authors: Tanya Barrett; Stephen E Wilhite; Pierre Ledoux; Carlos Evangelista; Irene F Kim; Maxim Tomashevsky; Kimberly A Marshall; Katherine H Phillippy; Patti M Sherman; Michelle Holko; Andrey Yefanov; Hyeseung Lee; Naigong Zhang; Cynthia L Robertson; Nadezhda Serova; Sean Davis; Alexandra Soboleva
Journal: Nucleic Acids Res Date: 2012-11-27 Impact factor: 16.971

5. The Sequence Read Archive: explosive growth of sequencing data.

Authors: Yuichi Kodama; Martin Shumway; Rasko Leinonen
Journal: Nucleic Acids Res Date: 2011-10-18 Impact factor: 16.971

6. Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences.

Authors: Charlotte Soneson; Michael I Love; Mark D Robinson
Journal: F1000Res Date: 2015-12-30

7. Context-dependent deposition and regulation of mRNAs in P-bodies.

Authors: Congwei Wang; Fabian Schmich; Sumana Srivatsa; Julie Weidner; Niko Beerenwinkel; Anne Spang
Journal: Elife Date: 2018-01-03 Impact factor: 8.140

8. Promoter architecture determines cotranslational regulation of mRNA.

Authors: Lorena Espinar; Miquel Àngel Schikora Tamarit; Júlia Domingo; Lucas B Carey
Journal: Genome Res Date: 2018-03-22 Impact factor: 9.043

9. Kraken: a set of tools for quality control and analysis of high-throughput sequence data.

Authors: Matthew P A Davis; Stijn van Dongen; Cei Abreu-Goodger; Nenad Bartonicek; Anton J Enright
Journal: Methods Date: 2013-06-29 Impact factor: 3.608

10. The BET protein FSH functionally interacts with ASH1 to orchestrate global gene activity in Drosophila.

Authors: Tobias Kockmann; Moritz Gerstung; Tommy Schlumpf; Zhu Xhinzhou; Daniel Hess; Niko Beerenwinkel; Christian Beisel; Renato Paro
Journal: Genome Biol Date: 2013-02-25 Impact factor: 13.583

18 in total

1. Mesenchymal stromal cell exosomes prevent and revert experimental pulmonary fibrosis through modulation of monocyte phenotypes.

Authors: Nahal Mansouri; Gareth R Willis; Angeles Fernandez-Gonzalez; Monica Reis; Sina Nassiri; S Alex Mitsialis; Stella Kourembanas
Journal: JCI Insight Date: 2019-11-01

2. SARS-CoV-2 Infection Causes Dopaminergic Neuron Senescence.

Authors: Shuibing Chen; Yuling Han; Liuliu Yang; Tae Kim; Manoj Nair; Oliver Harschnitz; Pengfei Wang; Jiajun Zhu; So Yeon Koo; Xuming Tang; Lauretta Lacko; Vasuretha Chandar; Yaron Bram; Tuo Zhang; Wei Zhang; Feng He; James Caicedo; Yaoxing Huang; Todd Evans; Paul van der Valk; Maarten J Titulaer; Jochem K H Spoor; Robert L Furler; Peter Canoll; James Goldman; Serge Przedborski; Robert Schwartz; David Ho; Lorenz Studer
Journal: Res Sq Date: 2021-05-21

3. An integrated brain-specific network identifies genes associated with neuropathologic and clinical traits of Alzheimer's disease.

Authors: Cui-Xiang Lin; Hong-Dong Li; Chao Deng; Weisheng Liu; Shannon Erhardt; Fang-Xiang Wu; Xing-Ming Zhao; Yuanfang Guan; Jun Wang; Daifeng Wang; Bin Hu; Jianxin Wang
Journal: Brief Bioinform Date: 2022-01-17 Impact factor: 11.622

4. mitch: multi-contrast pathway enrichment for multi-omics and single-cell profiling data.

Authors: Antony Kaspi; Mark Ziemann
Journal: BMC Genomics Date: 2020-06-29 Impact factor: 3.969

5. A Human Pluripotent Stem Cell-based Platform to Study SARS-CoV-2 Tropism and Model Virus Infection in Human Cells and Organoids.

Authors: Liuliu Yang; Yuling Han; Benjamin E Nilsson-Payant; Vikas Gupta; Pengfei Wang; Xiaohua Duan; Xuming Tang; Jiajun Zhu; Zeping Zhao; Fabrice Jaffré; Tuo Zhang; Tae Wan Kim; Oliver Harschnitz; David Redmond; Sean Houghton; Chengyang Liu; Ali Naji; Gabriele Ciceri; Sudha Guttikonda; Yaron Bram; Duc-Huy T Nguyen; Michele Cioffi; Vasuretha Chandar; Daisy A Hoagland; Yaoxing Huang; Jenny Xiang; Hui Wang; David Lyden; Alain Borczuk; Huanhuan Joyce Chen; Lorenz Studer; Fong Cheng Pan; David D Ho; Benjamin R tenOever; Todd Evans; Robert E Schwartz; Shuibing Chen
Journal: Cell Stem Cell Date: 2020-06-19 Impact factor: 24.633

6. Pluripotent stem cell-derived epithelium misidentified as brain microvascular endothelium requires ETS factors to acquire vascular fate.

Authors: Tyler M Lu; Sean Houghton; Tarig Magdeldin; José Gabriel Barcia Durán; Andrew P Minotti; Amanda Snead; Andrew Sproul; Duc-Huy T Nguyen; Jenny Xiang; Howard A Fine; Zev Rosenwaks; Lorenz Studer; Shahin Rafii; Dritan Agalliu; David Redmond; Raphaël Lis
Journal: Proc Natl Acad Sci U S A Date: 2021-02-23 Impact factor: 11.205

7. Deep learning suggests that gene expression is encoded in all parts of a co-evolving interacting gene regulatory structure.

Authors: Jan Zrimec; Christoph S Börlin; Filip Buric; Azam Sheikh Muhammad; Rhongzen Chen; Verena Siewers; Vilhelm Verendel; Jens Nielsen; Mats Töpel; Aleksej Zelezniak
Journal: Nat Commun Date: 2020-12-01 Impact factor: 14.919

8. Performance of Regression Models as a Function of Experiment Noise.

Authors: Gang Li; Jan Zrimec; Boyang Ji; Jun Geng; Johan Larsbrink; Aleksej Zelezniak; Jens Nielsen; Martin Km Engqvist
Journal: Bioinform Biol Insights Date: 2021-06-27

9. Iron Hack - A symposium/hackathon focused on porphyrias, Friedreich's ataxia, and other rare iron-related diseases.

Authors: Gloria C Ferreira; Jenna Oberstaller; Renée Fonseca; Thomas E Keller; Swamy Rakesh Adapa; Justin Gibbons; Chengqi Wang; Xiaoming Liu; Chang Li; Minh Pham; Guy W Dayhoff Ii; Ben Busby; Rays H Y Jiang; Linh M Duong; Luis Tañón Reyes; Luciano Enrique Laratelli; Douglas Franz; Segun Fatumo; Atm Golam Bari; Audrey Freischel; Lindsey Fiedler; Omkar Dokur; Krishna Sharma; Deborah Cragun
Journal: F1000Res Date: 2019-07-19

10. Style transfer with variational autoencoders is a promising approach to RNA-Seq data harmonization and analysis.

Authors: Nikolai Russkikh; Denis Antonets; Dmitry Shtokalo; Alexander Makarov; Yuri Vyatkin; Alexey Zakharov; Evgeny Terentyev
Journal: Bioinformatics Date: 2020-12-22 Impact factor: 6.937