Literature DB >> 36035514

A toolkit for enhanced reproducibility of RNASeq analysis for synthetic biologists.

Benjamin J Garcia¹, Joshua Urrutia², George Zheng³, Diveena Becker⁴, Carolyn Corbet⁴, Paul Maschhoff⁴, Alexander Cristofaro¹, Niall Gaffney², Matthew Vaughn², Uma Saxena¹, Yi-Pei Chen³, D Benjamin Gordon¹, Mohammed Eslami³.

Abstract

Sequencing technologies, in particular RNASeq, have become critical tools in the design, build, test and learn cycle of synthetic biology. They provide a better understanding of synthetic designs, and they help identify ways to improve and select designs. While these data are beneficial to design, their collection and analysis is a complex, multistep process that has implications on both discovery and reproducibility of experiments. Additionally, tool parameters, experimental metadata, normalization of data and standardization of file formats present challenges that are computationally intensive. This calls for high-throughput pipelines expressly designed to handle the combinatorial and longitudinal nature of synthetic biology. In this paper, we present a pipeline to maximize the analytical reproducibility of RNASeq for synthetic biologists. We also explore the impact of reproducibility on the validation of machine learning models. We present the design of a pipeline that combines traditional RNASeq data processing tools with structured metadata tracking to allow for the exploration of the combinatorial design in a high-throughput and reproducible manner. We then demonstrate utility via two different experiments: a control comparison experiment and a machine learning model experiment. The first experiment compares datasets collected from identical biological controls across multiple days for two different organisms. It shows that a reproducible experimental protocol for one organism does not guarantee reproducibility in another. The second experiment quantifies the differences in experimental runs from multiple perspectives. It shows that the lack of reproducibility from these different perspectives can place an upper bound on the validation of machine learning models trained on RNASeq data. Graphical Abstract.

Entities: Chemical

Keywords: automation; machine learning; standardization; transcriptomics

Year: 2022 PMID： 36035514 PMCID： PMC9408027 DOI： 10.1093/synbio/ysac012

Source DB: PubMed Journal: Synth Biol (Oxf) ISSN： 2397-7000

Introduction

Synthetic design is often an iterative process, where components are improved over time. However, despite the improvement of components, controls can often stay the same across multiple iterations (unless alterations are made to the base strain in such a way that the altered strain becomes the new control). As such, it is important to know whether or not changes are the result of design choices, experimental variability or other unintended perturbations due to inherent variability in the pipeline. Furthermore, researchers conducting experiments in synthetic biology often have combinatorial designs that test all conditions in replicates, resulting in sample sizes in the 1000s or greater. It is not always feasible to run all of these samples at the same time, so it is important to consider how experiments change across batches (i.e. differences in the execution of the pipeline, differences in experiment conditions, different technicians running the pipeline, dilutions, culturing, etc.). It then becomes important to ensure that variability is low and that controls are repeatable across different batches to ensure that experiments can be aggregated and compared. RNASeq has become a ubiquitous tool in the design, build, test and learn cycle for synthetic biology (1, 11, 14, 17, 18, 42). Applications include debugging design failures (13, 17, 33) and optimization of genetic designs (31). Transcriptional analysis provides an opportunity to measure chassis response holistically and identify differentially expressed genes (DEGs) (17), infer transcriptional networks or co-regulated genes (22, 25, 32, 43), identify read-through and unexpected expression in synthetic parts (13, 17, 33) and also infer the impacts on mechanistic pathways encompassing a multitude of biomolecular ‘entities’ (‘transcripts’, proteins and metabolites) (15, 19, 35, 38). Collection and processing of RNASeq data, however, is a complex, multistep process (9, 14). Processing pipelines are highly parameterized, computationally intensive and often not comparable across experiments without significant effort in normalization and harmonization of the output measurements (1, 11). Furthermore, these tools often require a significant amount of additional biological context to explain how and why the dataset was generated, which is often disconnected from the data itself (16, 36, 45). Collection and processing is one aspect of reproducibility, which can be more broadly delineated into three categories: (i) biological, (ii) experimental and (iii) analytical (21, 44). Biological variability results from different responses to similar stimuli, either as a mechanism for survival, due to responses not being strictly regulated or due to experimental conditions being variable enough to elicit different biological responses. Experimental variability is the result of imperfect tool precision, slight alterations in handling during cell growth, lysis, ribonucleicacid (RNA) extraction, library construction or data analysis. Experimental variability can also impact biological reproducibility. To minimize the experimental variability, consistency of conditions, reagents, instruments and protocols used to grow organisms, extract RNA and process the sample are required. Finally, analytical variability arises from either an inconsistency in the use or a lack of information about processing pipelines, tools and parameterization used to analyze the data and extract conclusions (7, 26, 28). While there are a number of consortiums that have provided guidance on methods to assess the reproducibility of RNASeq, a toolkit to enable that reproducibility for synthetic biology is seldom found (23, 47). The toolkit provided in this paper helps mitigate analytical variability by providing a clear and consistent set of methodologies that can be applied across different RNASeq experiments. Given the pipeline provides a clear set of reproducible analytical methods, we are then able to explore biological and experimental reproducibility that can then impact analytical results. Here, we seek to present the minimal set of information required from both the experimental conditions as well as the data processing pipelines to maximize the analytical reproducibility. We first present an overview of the software infrastructure required to make analyses reproducible across RNASeq experiments. The tool outputs a set of consistently formatted datasets for downstream analyses and can be used by researchers that seek to analyze RNASeq at scale (100s to 1000s of sequencing runs). We then detail a configuration-driven pipeline that enables the reproducible analysis of data that were utilized to analyze data from a large experimental condition space (e.g. many genetic designs and inducers) (12). We use these tools to measure the variability of controls in multiple experimental runs with perturbing controls of two popular model organisms: Escherichia coli MG1655 and Bacillus subtilis Marburg 168. After quantifying the variability in controls, we examine the impact of the variability on machine learning models built using two B. subtilis datasets collected using nearly identical protocols surveying a large space of growth conditions.

Materials and methods

Infrastructure to automate and track secondary analysis stages

Secondary analysis of the sequencing data involves a set of hierarchical steps that depend on the experimental design, conditions and reference files to generate counts and normalized counts data. Given the scale of experimental conditions (e.g. genetic designs, organisms and induction conditions), software infrastructure is required to ensure the analytical reproducibility of an RNASeq pipeline to be parameterized and automated to process RNASeq data at scale. The software infrastructure we have developed has three main components: (i) applications, (ii) databases and (iii) actors.

Applications.

Allow the software infrastructure to capture versions, any configuration parameters used during the alignment and quality control (QC) process and input/output data identifiers. They ensure that the specific processing or analysis pipeline that was run by the application can be run again exactly the same way, in the same environment, on any system. Tools such as Docker containers provide an opportunity for researchers to ensure that their code is self-contained, shareable and reproducible (20, 34). Applications are containerized versions for each stage in the pipeline. The main motivation behind modularization was to provide us the flexibility to rerun specific stages, if needed, without rerunning the entire pipeline. Furthermore, the modularity has the added benefit of updating specific components and tracking versions of software and configurable parameters for every execution.

Databases.

Allow the software infrastructure to store locations of raw data and processed data, metadata and full parameterizations and provenance of processing pipelines that link the data together in a queryable manner. We had two databases used in our infrastructure, integrated together as part of a larger system for handling data, metadata and knowledge in the Defense Advanced Research Projects Agency SD2 program (6, 41). The first was a data catalog that stored links to the raw data, universal identifiers and the experimental conditions. The second was a database specifically designed to store genetic parts known as SynBioHub. Some experimental conditions, such as strain and inducer names, stored links to SynBioHub (29) that led to more complete information about these materials.

Actors.

Allow the software infrastructure to respond to events (e.g. file upload and completion of an intermediate stage) to automate the processing of data at scale. In our system, reads/writes to the data catalog and job submissions are managed by Abaco Actors (5). Actors are lightweight, containerized scripts that are triggered in response to events. Actors do not perform computationally intensive tasks but instead oversee and coordinate job processing throughout the pipeline. Computationally intensive tasks (trimming, alignment and aggregation) are performed by Tapis applications, a framework that provides a web-based application programming interface (API) to manage computational workloads developed by the Texas Advanced Computer Center (8). Application jobs are triggered by actors and have a unique job identification number. Given software for each stage of the secondary analysis pipeline (preprocessing, alignment, post-processing and QC), our emphasis was to use the software infrastructure to ensure that the pipeline is manageable and reproducible for the large datasets made available over time (Figure 1).

Figure 1.

Diagram of the RNASeq processing pipeline. (A) The ingester monitors a metadata file uploaded to the data catalog to see if the experiment includes RNASeq data to trigger the processing pipeline. (B) The preprocessing actor triggers and stores the job ID and version of the preprocessing app to the data catalog. (C) The auditing actor receives notifications from applications to validate the outputs and triggers the next actor. The auditing actor will resubmit applications up to three times to handle stochastic failures. (D) The alignment actor queries the data catalog to determine the reference genome that is used for each preprocessed sample and triggers the alignment application. (E) The post-processing application annotates the alignments and aggregates samples into counts (raw, FPKM and TPM) dataframes. (F) The QC and metadata actor use metadata stored in the data catalog and information from the logs/outputs of each job to add experimentally relevant metadata and QC flags to the counts dataframes. The applications used in our RNASeq pipeline consist of trimming and quality filtering raw RNASeq data with Trimmomatic (v0.36) (3), and FastQC (2) is used to generate reports for paired, trimmed reads. Trimmomatic was chosen due to its high accuracy, although trimming methods had low overall impact on the overall RNASeq accuracy (10). Preprocessed reads are aligned to an indexed reference genome with BWA (v0.7.17) (8, 27). Burrows-Wheeler Aligner (BWA) was chosen because, for short reads, it has high coverage and alignment rates, for minimal cost in run time efficiency (30). For alignment, there is the prerequisite that the relevant reference genome has been identified, FASTA and GFF have been provided and the linkage between sample and reference genome (strain) is enumerated in the provided metadata. After alignment, the resulting Sequence Alignment Map (SAM) files are sorted by Picard tools (v2.18.15) function SortSam (36, 4) and then AddOrReplaceGroups is run on the output of SortSam. Gene-level quantification of counts is performed using the featureCounts function of Rsubread (v1.34.4) (3). Annotations are performed using the General Feature Format (GFF) provided. For RNA-sequencing, we identify any coding sequence (CDS) feature type from the GFF and annotate these in the dataframe using the ‘Name=’ attribute. All annotated samples are aggregated into a dataframe, and outputs include both raw counts and rudimentary normalization functions: transcripts per million (TPM) and fragment per kilobase per million mapped reads (FPKM). The following two Boolean metrics were used to measure sample quality: Number of mapped reads ≥ 500 kb. Replicate correlation of TPM values of a condition >0.8. If any of these metrics did not pass, the sample would be flagged as a low-quality sample and not used for downstream analysis. These applications can then be run individually on a high-performance computing system. For each job that is run in the software infrastructure, an actor queries a centralized database of information (the data catalog) to identify the required metadata for processing. For each job submission, there is also a record created in the data catalog enumerating the inputs, references, software versions and outputs for a given job. Tapis applications stage assets (inputs, references and container images) to a temporary working directory, submit resource requirements to a Simple Linux Utility for Resource Management (46) queue on a high-performance computing system, archive outputs to a predefined location (aka storage system) and send a notification to an actor providing the exit status of a job (success/failure). By leveraging Tapis applications, Abaco Actors and the data catalog, an automated pipeline is constructed that responds to events (raw data and metadata uploads) and processes samples in parallel for trimming and alignment and serially for aggregation and annotation, without the need for human intervention. This serves both to increase efficiency and throughput and also reduce the probability for human error. Since the core components for processing are containerized and the inputs/parameters for every job are stored in a centralized database, the results are reproducible across different machines.

Configurable tertiary analysis pipeline

A challenge often faced in synthetic biology is that experimental designs are often combinatorial in nature (multiple inducers, designs, timepoints, etc.); therefore, the number of differential expression comparisons can become unwieldy and unreliable to set up and analyze in a manual fashion. Additionally, when an experiment has hundreds to thousands of different comparisons, manual processing of differential expression is often impossible, so automated solutions are necessary to analyze the data. To help facilitate the automated calculation of differential expression between the different factors, we developed a Python-based configurable toolkit, which we call ‘omics_tools’ (Figure 2). Instead of having to write R scripts for every desired comparison, omics_tools automatically generates each R script, runs them and then generates consistently labeled output. The automated process increases repeatability/reliability by standardizing output and by reducing the risk of producing manual errors, such as mislabeling comparisons, mislabeling groups, including the wrong samples, etc. Standardizing output across multiple different experiments also allows for development and use of tools that analyze differential expression results without having to generate formatting scripts for each different experiment, facilitating cross-experiment comparisons. The tool aggregates the outputs from the parallelized runs and combines all the data into a single unified dataset where each row represents a gene, its differential expression, statistical significance and the condition for downstream machine learning.

Figure 2.

Omics_tools facilitates the analysis of combinatorial design RNASeq by integrating metadata (inducers, growth conditions, genetic manipulations, etc.) to automatically calculate differential expression with edgeR across all conditions of interest. The tool kit allows for parallel processing of thousands of samples and comparisons in a high-throughput manner and utilizes a standardized schema to help facilitate reproducibility in analytical analysis. Omics_tools uses edgeR (39) with a generalized linear model (GLM) to conduct differential expression analysis (DEA) using trimmed mean of M-values (TMM) normalization across the set of design variables (40). EdgeR was chosen, specifically with GLM and TMM, because the combination is a well-known method that has a high accuracy across a variety of datasets (Corchete et al., 2020). Additionally, TMM normalization has a high accuracy irrespective of other components within the tool chain (such as trimmer, mapper, differential expression, etc.) (Corchete et al., 2020). Other methods can replace the aforementioned method by either replacing omics_tools with another tool of your choice (starting with the counts data generated in previous steps) or modifying the R script generation of omics_tools to replace edgeR with another method that utilizes R to function.

Results and discussion

Automation enhances scale and processing of RNASeq data at scale

Data used for this experiment consisted of raw, gzipped RNASeq data of 2.4 TB in total that was processed into datasets that were 35 MB in total. The data included both metadata as well as data from sequencers for 1344 samples. Raw and processed data are available via the Gene Expression Omnibus (GSE206047). The infrastructure presented in Figure 1 was not available during all of the experiments conducted with E. coli. During this time, it would typically take ∼3 months to process the data, where most of the time was spent by a developer finding, verifying and troubleshooting data processing and metadata integration issues. This included mismatches of number of files to metadata rows, incomplete strain references and gaps in inducer concentrations as well as timepoints. The infrastructure now checks for all of this with the auditing actor and can alert users immediately if there is a mismatch, gap or processing error in any stage. The experiments with B. subtilis were able to benefit from the complete infrastructure and took a ∼3-month process down to 3 days, where the majority of the time was processing time. We also tested our infrastructure of Gene Expression Omnibus (GEO)-derived datasets to ascertain the extendability of our pipeline to data that were not designed for internal purposes. For example, with the manual restructuring of the metadata associated with a Pseudomonas putida experiment (37), we were able to run their data through our pipeline to get gene counts. While the results are not published with this paper, details of this process can be found in our documentation.

Using omics_tools to analyze the repeatability of controls

To quantify the reproducibility of controls, we ran two experiments involving E. coli MG1655 in media at three different timepoints (5, 8 and 18 h growth) on two different days. The timepoints were selected to represent common phases of growth in circuit characterization (17). Four replicates were used for each timepoint on each day. Our processing pipeline and omics_tools were used to analyze the data for all comparisons across all timepoints. The results from this comparison are presented in Table 1.

Table 1.

Percentage of the total 4097 genes assayed that were significantly differentially expressed across comparisons (sample_time, FDR < 0.05 and log2FC > 2)

Timepoint	Day 1: 5 h	Day 1: 8 h	Day 1: 18 h	Day 2: 5 h	Day 2: 8 h	Day 2: 18 h
Day 1: 5 h	0.0%	3.8%	29.4%	0.3%	4.6%	32.5%
Day 1: 8 h	3.8%	0.0%	20.2%	2.4%	0.3%	24.2%
Day 1: 18 h	29.4%	20.2%	0.0%	27.2%	20.1%	0.2%
Day 2: 5 h	0.3%	2.4%	27.2%	0.0%	2.2%	29.6%
Day 2: 8 h	4.6%	0.3%	20.1%	2.2%	0.0%	22.6%
Day 2: 18 h	32.5%	24.2%	0.2%	29.6%	22.6%	0.0%

Percentage of the total 4097 genes assayed that were significantly differentially expressed across comparisons (sample_time, FDR < 0.05 and log2FC > 2) Samples from the same timepoint on different days were found to be most similar (average 0.3% of the genome differentially expressed), and the least similar were the 5 h versus 18 h comparison (average 29.9% differential expression). The differences in 5 h versus 8 h, 5 h versus 18 h and 8 h versus 18 h timepoints are as expected and can be largely attributed to differences in growth phase (log phase versus transition versus stationary), whereas the minor differences in the same hour comparison can be attributed to slight differences in the state of the bacteria and aggregate of error and variability across the pipeline. We also compared different means of performing sample-to-sample normalization and identifying DEGs. The sample similarities can also be observed when using an FPKM normalization (Figure 3). We also utilized FPKMs to compute the differential expression of genes with t-test and Benjamini–Hochberg multiple testing correction, log2FC > 2 (fold change) and false discovery rate (FDR) < 0.05 (Supplemental Table S1) and compared those to omics_tools. Despite having a similar number of DEGs between the two methods, the resulting significant differential genes have a subset of genes that is not shared between the two. Unsurprisingly, the Jaccard coefficients (Supplemental Table S1) for the same timepoints across the 2 days averaged only 0.214 (range, 0.083–0.308) due to the very low number of genes differentially expressed (10.8 on average with a range of 1–21). The different timepoint and day comparisons had much higher similarities in differential genes with an average Jaccard of 0.598 (range 0.468–0.762) and an average percentage of 82.6% (range 75.3–88.8%) of genes that were significant from both omics_tools and FPKM comparisons. The main difference between omics_tools and FPKM is from intersample normalization of TMM. While normalization and differential strategy do matter, the majority of genes that are significant are shared between the methods.

Figure 3.

Pearson’s correlations (upper right numbers) between log2 FPKM of E. coli samples at each of the measured timepoints on both days. Samples taken on different days, at the same hour, are much more similar (average 0.99) than at different hours. The 5 and 18 h have high correlations (average 0.95), suggesting that they are in similar growth states, whereas the 5 and 18 h have the least similar expression profiles (average 0.64) due to their differences in growth states. Scatterplots are gene–gene log2 FPKM comparisons, with a red linear regression line. Histogram plots are for frequencies of gene FPKMs. To better understand how different species are affected by similar experimental conditions, we also examined data collected for B. subtilis controls at 5 h on three different days. The data were processed through the same high-throughput sequencing pipeline and analyzed with omics_tools. Unlike the E. coli strain, B. subtilis had much higher variability with an average of 4.87% significant differential expression (range 1.1–7.16%) (Supplemental Table S2) compared to 0.269% for the 5 h E. coli timepoint. The FPKM correlations (Figure 4) have a similar pattern as the 5–8 h E. coli comparison (Figure 3), suggesting that the growth rates may be different between the different days for B. subtilis. It is hypothesized that B. subtilis was more impacted by overnight growth and recovery in wells as compared to E. coli. The differences in dilution potentially impact growth and sporulation as well. Fundamentally, we showed that experimental conditions that produce reproducible results for one organism do not necessarily produce reproducible results for different organisms.

Figure 4.

Pearson’s correlations (upper right numbers) between log2 FPKM of 5 h timepoints for B. subtilis on each of the three different days. Samples 2 and 3 have much more similar patterns compared to 1–2 and 1–3. Additionally, these two comparisons have a lower correlation than the 5–8 h E. coli comparisons, suggesting greater variability in the growth/biological conditions for B. subtilis. Scatterplots are gene–gene log2 FPKM comparisons, with a red linear regression line. Histogram plots are for frequencies of gene FPKMs.

The impact of reproducibility on the validation of machine learning predictions

One challenge associated with the validation of machine learning models for high-throughput experiments is regarding reproducibility of the training data, specifically, if a model is built from a set of data that lacks or has underdeveloped biological and experimental context (i.e. metadata). Despite the immediate accuracy concerns, there is no guarantee that the model will generalize to future runs of experiments with similar experimental/biological context. Furthermore, extensibility of a given model has an upper bound of the experimental reproducibility of the set of conditions being modeled. Here, we evaluate the implications of reproducibility on a machine learning model trained with response from single inducers to predict the transcriptional response to a combination of inducers. The test data were generated twice on different days to measure the impact reproducibility has on the evaluation of the model. The samples were prepared in 96-well plates using a combination of automation and manual liquid handling methods that benefited from consistent sample volume throughout the process. We noticed that optical density (OD) values varied from 0.5 to 2.0 OD units across the replicates per condition with the first run of our experiment at 500 µl volume. Given the variability in OD, increasing the amount of culture collected allowed for more samples to meet the minimum material input criteria for concentration in the starting volume of samples. Thus, the only difference between the two experimental runs was the increase in volume of cell culture used for RNASeq from 500 to 900 µl to increase the amount of RNA available. We hypothesized that the increase would improve the quality of the samples generated, which would result in data for conditions that would drop out. Upon increasing the volume, we now observed a variability of 0.35–0.45 OD units. The decrease in variability with the increased volume was encouraging. Interestingly, cell culture volume is a factor that is often not used in downstream normalization, differential expression and pathway analyses directly, but as we will show, does have implications for sample quality that in turn impacts all other analyses. Furthermore, the increase in volume did not unanimously increase the number of replicates/samples available for analysis either, which we found surprising. This comparison can be made at multiple levels:

Sample QC.

(Figure 5A) Given the set of experimental parameters, QC can filter out a set of different samples per condition. For example, a sample induced with xylose and vanillic acid measured at 18 h post-induction has lost all four replicates in the repeated test condition. This implies that the condition will not generate the data necessary to identify how reproducible the model is at that condition. A researcher could look to see if the model predicted any impact on essential genes or pathways that can be linked to dropout, especially if the sample failed in both experimental runs. This could help the researcher determine if the dropout is biological in nature and not an experimental error. The more difficult scenario is when a set of samples pass QC in one run, but not the other. In total, there are 5/18 discrepancies across the two experimental runs, which means that the day-to-day variability could cause upward of 27% difference depending on the experimental run that is used.

Figure 5.

Comparison of experimental conditions across test and test repeat experimental runs. The 0/1 indicates the absence/presence of inducers for Panels A/B and differentially expressed status (0 not DEG and 1 is DEG for D). Each comparison will have a different impact on the experiment’s ability to validate a machine learning model trained on single inducers (conditions not shown here). (A) Different sample dropouts for the two runs mean that if both experiments were not run, some predictions would not be validatable. (B) Different genes being filtered will impact the set of genes from a condition that can be validated. (C) Genes can have quantitatively different responses, which can further add complications to validation and the gene dropouts (horizontal black and vertical red lines) across the two experimental runs means different genes will be validated. (D) While the set of DEGs that are different between days are in the minority, these discrepancies can have mechanistic consequences on the inferences made.

Gene QC.

(Figure 5B) Tertiary analysis, such as DEA, uses expression levels across genes to identify outlying genes that do not fit a dispersion model. Genes that do not fit within a dispersion estimate are filtered out. Filtering out genes means that a model that is making gene-level predictions will not have predictions of those genes validated quantitatively. However, if the model predicts the gene is not differentially expressed, then its expression likely falls below a noise threshold and that could also be why the dispersion model filtered it out. Thus, it is necessary to choose a dispersion model that accounts for expression level changes across experiments. Here, we use a local regression dispersion method to determine if a gene is an outlier or not. Looking across the conditions, the impact of filtering out genes leads to a different set of genes being measured per condition. On average, there are 93 out of 4267 genes that are different between each experimental run. Of the 93, there is an average of 67 unique genes or 1.6% of the transcriptome that are different between the experiments. Namely, these genes will have different responses given the day the experiment was run and thus place an upper bound on the repeatability of the model.

Gene quantitative response.

(Figure 5C) The next question to address is if the response levels of the genes (differential expression) are consistent across experimental runs. Response is measured with a quasi-likelihood negative binomial generalized log-linear model that outputs an FC as compared to a control. As mentioned in (2), the black and red lines are the genes that were filtered out as outliers by the dispersion model for a single experimental run. Beyond the different set of genes, it is clear that the largest variabilities in response between the experimental runs are genes that have lower expression in both runs (those closer to the origin). Genes with low expression are inherently unreliable through traditional processing and analysis pipelines (24). These are genes that would not pass a threshold for subsequent analysis in either experimental run and so their responses are statistically insignificant. There are, however, 44 unique genes whose response falls above the noise threshold in both experiments but are outside the 95% confidence bounds of the experiment. Repeatability of a machine learning model with common metrics such as R2 should not take these genes into account as they fall below a noise threshold.

If a gene is differentially expressed.

(Figure 5D) In some applications, such as pathway analysis, the quantitative response of a gene to a perturbation is not as important as whether or not the gene was dysregulated. In such instances, genes are labeled as DEGs if they satisfy magnitude (often in terms of log FC) and statistical significance (FDR) thresholds. DEGs are fed to downstream enrichment and pathway analysis tools to identify mechanistic and functional changes. Thus, evaluating whether the set of DEGs remains consistent is also an important aspect of reproducibility. Classification methods that are trained to predict if a gene is differentially expressed should account for both class imbalance and the inconsistencies between runs. In our case, we see that the majority of genes across the experimental runs are consistently labeled as impacted (1) or not impacted (0). There are, on average, 140 genes, or 3% of the transcriptome that is labeled differently (labeled as 0–1 or 1–0 in the bar plot). If these genes are essential genes, the set of mechanistic insights that are predicted by the model can be very different based on the experimental run used for validation.

Conclusion

Reproducibility is a cornerstone for utilizing RNASeq data to analyze and improve synthetic designs. One important aspect of reproducible designs is computation pipelines that help standardize files, metadata, analysis techniques and protocols in a high-throughput and easy to use environment. As such, we have developed and released both a secondary analysis pipeline and omics_tools to help facilitate analysis of synthetic biology RNASeq experiments. We have also utilized these two to better understand reproducibility in two different contexts: similarity of controls run on different days and the impact that reproducibility has on the test and training data utilized in machine learning techniques. We identified that the experimental protocols utilized for E. coli MG1655 produced highly reproducible results. However, when those same protocols were applied to B. subtilis, the reproducibility decreased in a manner that suggested the growth states (or some other biological phenomena) were not the same on different days. Given that we standardized the analysis through the analytical pipeline, any differences we observe are guaranteed to be biological or experimental in nature. Additionally, when exploring induction conditions, we found that while the majority of induction conditions were reproducible across the different days/extraction volumes, some were different. The differences have an impact on the ability to validate a model to predict response to inducers if you only have one run to choose from (i.e. you were only able to run the experiment once). While the majority of genes were consistently present and consistently differentially expressed across the two days/volumes, some genes were not. Such differences limit the upper bound of accuracy of the model, as areas where the test data are inconsistent produce a response distribution versus a single response. A standardized analytical pipeline will enable researchers to identify areas of low reproducibility (either experimental or biological) and focus their experiments on reducing the variability of those response distributions. Ultimately, computation pipelines that help facilitate reproducibility will allow for more consistent analysis of the data. Click here for additional data file.

37 in total

1. SynBioHub: A Standards-Enabled Design Repository for Synthetic Biology.

Authors: James Alastair McLaughlin; Chris J Myers; Zach Zundel; Göksel Mısırlı; Michael Zhang; Irina Dana Ofiteru; Angel Goñi-Moreno; Anil Wipat
Journal: ACS Synth Biol Date: 2018-01-30 Impact factor: 5.110

2. Reproducibility of High-Throughput Plate-Reader Experiments in Synthetic Biology.

Authors: Michael Chavez; Jonathan Ho; Cheemeng Tan
Journal: ACS Synth Biol Date: 2016-11-09 Impact factor: 5.110

3. Round Trip: An Automated Pipeline for Experimental Design, Execution, and Analysis.

Authors: Daniel Bryce; Robert P Goldman; Matthew DeHaven; Jacob Beal; Bryan Bartley; Tramy T Nguyen; Nicholas Walczak; Mark Weston; George Zheng; Josh Nowak; Peter Lee; Joe Stubbs; Niall Gaffney; Matthew W Vaughn; Chris John Myers; Robert C Moseley; Steven Haase; Anastasia Deckard; Bree Cummins; Nick Leiby
Journal: ACS Synth Biol Date: 2022-01-31 Impact factor: 5.110

4. Data-driven mechanistic analysis method to reveal dynamically evolving regulatory networks.

Authors: Jukka Intosalmi; Kari Nousiainen; Helena Ahlfors; Harri Lähdesmäki
Journal: Bioinformatics Date: 2016-06-15 Impact factor: 6.937

5. GNE: a deep learning framework for gene network inference by aggregating biological information.

Authors: Kishan Kc; Rui Li; Feng Cui; Qi Yu; Anne R Haake
Journal: BMC Syst Biol Date: 2019-04-05

6. A versatile workflow to integrate RNA-seq genomic and transcriptomic data into mechanistic models of signaling pathways.

Authors: Martín Garrido-Rodriguez; Daniel Lopez-Lopez; Francisco M Ortuno; María Peña-Chilet; Eduardo Muñoz; Marco A Calzado; Joaquin Dopazo
Journal: PLoS Comput Biol Date: 2021-02-11 Impact factor: 4.475

7. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data.

Authors: Mark D Robinson; Davis J McCarthy; Gordon K Smyth
Journal: Bioinformatics Date: 2009-11-11 Impact factor: 6.937

8. Trimmomatic: a flexible trimmer for Illumina sequence data.

Authors: Anthony M Bolger; Marc Lohse; Bjoern Usadel
Journal: Bioinformatics Date: 2014-04-01 Impact factor: 6.937

9. The Escherichia coli transcriptome mostly consists of independently regulated modules.

Authors: Anand V Sastry; Ye Gao; Richard Szubin; Ying Hefner; Sibei Xu; Donghyuk Kim; Kumari Sonal Choudhary; Laurence Yang; Zachary A King; Bernhard O Palsson
Journal: Nat Commun Date: 2019-12-04 Impact factor: 14.919

10. Reproducibility in systems biology modelling.

Authors: Krishna Tiwari; Sarubini Kananathan; Matthew G Roberts; Johannes P Meyer; Mohammad Umer Sharif Shohan; Ashley Xavier; Matthieu Maire; Ahmad Zyoud; Jinghao Men; Szeyi Ng; Tung V N Nguyen; Mihai Glont; Henning Hermjakob; Rahuman S Malik-Sheriff
Journal: Mol Syst Biol Date: 2021-02 Impact factor: 11.429