Dobril K Ivanov1,2, Gerrit Bostelmann1, Benoit Lan-Leung2, Julie Williams2, Linda Partridge3,4, Valentina Escott-Price2, Janet M Thornton1. 1. European Molecular Biology Laboratory, The European Bioinformatics Institute (EMBL-EBI), Hinxton, Cambridge, United Kingdom. 2. UK Dementia Research Institute at Cardiff (UKDRI), College of Biomedical and Life Sciences, Cardiff University, Cardiff, United Kingdom. 3. Max Planck Institute for Biology of Ageing, Cologne, Germany. 4. Institute of Healthy Ageing, and Department of Genetics, Evolution and Environment, UCL, London, United Kingdom.
Abstract
Many research teams perform numerous genetic, transcriptomic, proteomic and other types of omic experiments to understand molecular, cellular and physiological mechanisms of disease and health. Often (but not always), the results of these experiments are deposited in publicly available repository databases. These data records often include phenotypic characteristics following genetic and environmental perturbations, with the aim of discovering underlying molecular mechanisms leading to the phenotypic responses. A constrained set of phenotypic characteristics is usually recorded and these are mostly hypothesis driven of possible to record within financial or practical constraints. We present a novel proof-of-principal computational approach for combining publicly available gene-expression data from control/mutant animal experiments that exhibit a particular phenotype, and we use this approach to predict unobserved phenotypic characteristics in new experiments (data derived from EBI's ArrayExpress and ExpressionAtlas respectively). We utilised available microarray gene-expression data for two phenotypes (starvation-sensitive and sterile) in Drosophila. The data were combined using a linear-mixed effects model with the inclusion of consecutive principal components to account for variability between experiments in conjunction with Gene Ontology enrichment analysis. We present how available data can be ranked in accordance to a phenotypic likelihood of exhibiting these two phenotypes using random forest. The results from our study show that it is possible to integrate seemingly different gene-expression microarray data and predict a potential phenotypic manifestation with a relatively high degree of confidence (>80% AUC). This provides thus far unexplored opportunities for inferring unknown and unbiased phenotypic characteristics from already performed experiments, in order to identify studies for future analyses. Molecular mechanisms associated with gene and environment perturbations are intrinsically linked and give rise to a variety of phenotypic manifestations. Therefore, unravelling the phenotypic spectrum can help to gain insights into disease mechanisms associated with gene and environmental perturbations. Our approach uses public data that are set to increase in volume, thus providing value for money.
Many research teams perform numerous genetic, transcriptomic, proteomic and other types of omic experiments to understand molecular, cellular and physiological mechanisms of disease and health. Often (but not always), the results of these experiments are deposited in publicly available repository databases. These data records often include phenotypic characteristics following genetic and environmental perturbations, with the aim of discovering underlying molecular mechanisms leading to the phenotypic responses. A constrained set of phenotypic characteristics is usually recorded and these are mostly hypothesis driven of possible to record within financial or practical constraints. We present a novel proof-of-principal computational approach for combining publicly available gene-expression data from control/mutant animal experiments that exhibit a particular phenotype, and we use this approach to predict unobserved phenotypic characteristics in new experiments (data derived from EBI's ArrayExpress and ExpressionAtlas respectively). We utilised available microarray gene-expression data for two phenotypes (starvation-sensitive and sterile) in Drosophila. The data were combined using a linear-mixed effects model with the inclusion of consecutive principal components to account for variability between experiments in conjunction with Gene Ontology enrichment analysis. We present how available data can be ranked in accordance to a phenotypic likelihood of exhibiting these two phenotypes using random forest. The results from our study show that it is possible to integrate seemingly different gene-expression microarray data and predict a potential phenotypic manifestation with a relatively high degree of confidence (>80% AUC). This provides thus far unexplored opportunities for inferring unknown and unbiased phenotypic characteristics from already performed experiments, in order to identify studies for future analyses. Molecular mechanisms associated with gene and environment perturbations are intrinsically linked and give rise to a variety of phenotypic manifestations. Therefore, unravelling the phenotypic spectrum can help to gain insights into disease mechanisms associated with gene and environmental perturbations. Our approach uses public data that are set to increase in volume, thus providing value for money.
Despite the flood of molecular omics data, with a few notable exceptions, such as the Genotype-Tissue Expression (GTEx) project [1], most datasets are rarely re-used, mainly due to challenges with combining the data from different sources. However, in most experimental studies, additional measures are made of biochemical, and physiological changes and of changes in the phenotypic characteristics that they bring about. Phenotypes can include, for instance, morphology, behaviour and pathology. Usually, a limited number of phenotypes are recorded, due to various study constraints. An intermediate phenotype, or sub-phenotype, is one that underlies the study phenotype, but crucially is influenced by fewer genes [2]. For instance, sub-phenotypes of Parkinson’s Disease (PD) can include olfactory impairment, gut function disturbance, motor impairments and cognitive decline, each of which may be mediated by subsets of the genes that together result in PD pathology. Quantifying a wide variety of sub-phenotypes associated with animal models of a disease could therefore help to identify causal mechanisms.The aim of the present study was to develop an in-silico approach for inferring unobserved phenotypic characteristics from published gene-expression data resulting from genetic or environmental perturbations. To do this, we generated molecular signatures for two target phenotypes in the fruit fly Drosophila, starvation stress response defective (starvation-sensitive) and sterile, using available gene-expression data. Using machine learning, we were able to show that these molecular signatures are able to reliably predict the starvation-sensitive and sterile phenotypic traits solely using expression datasets from studies where these phenotypes were not originally measured, thus adding value to already deposited data.
Materials and methods
A schematic overview of the generation of a gene-expression molecular signature for a specific phenotype of interest is presented in Fig 1.
Fig 1
Flow diagram of the overall generation of molecular signatures for a phenotype of interest.
a) Building the molecular signature and selecting model parameters for a particular phenotype. b) Predicting phenotypic manifestation in unknown experiments utilising the molecular signature.
Flow diagram of the overall generation of molecular signatures for a phenotype of interest.
a) Building the molecular signature and selecting model parameters for a particular phenotype. b) Predicting phenotypic manifestation in unknown experiments utilising the molecular signature.
Data collection
Linking phenotypes to perturbed genes in Drosophila
In order to identify perturbed genes that lead to a particular phenotype, we downloaded several datasets from FlyBase (http://flybase.org/). These comprised: allele phenotypic data, synonyms, annotation identifiers, control vocabulary and alleles to gene identifiers. Using in-house custom programs, we parsed and linked all these identifiers with the phenotypic data. That is, for each FlyBase phenotype, we obtained a list of identifiers (e.g. FlyBase gene numbers, allele symbols, synonyms).
Obtaining expression data from EBI’s ArrayExpress
To maximise the number of experiments for each phenotype chosen for this study, we used the Affymetrix GeneChip Drosophila Genome 2.0 Array (EBI’s ArrayExpress identifier A-AFFY-35). At the time of conducting the analysis, the largest number of experiments had been performed using the Affymetrix Genome 2.0 microarray platform (number of experiments: 330).Using the above-mentioned FlyBase identifiers (linking phenotypes to perturbed genes) we searched EBI’s ArrayExpress for any potential match using the textual representation of EBI’s web resource, i.e. REST-style queries. The identifiers were used as keywords to form a URL and the XML result was parsed using a custom-made Perl program. The nature of the allele constructs for experiments deposited in EBI’s ArrayExpress does not follow a specific nomenclature and the authors/depositors are allowed relative freedom in describing the gene constructs. For example, EBI’s ArrayExpress identifier E-GEOD-18576 lists a genotype description as a DHR96 mutant. We did not assume that different allele constructs for the same gene will exhibit the same phenotype. Therefore, for each of the experiments that matched any of the FlyBase identifiers for the two target phenotypes, we manually curated the data first by reading all the accompanying manuscripts and subsequently retained experiments where the same allele construct was used. Furthermore, only experiments with raw gene-expression data (data with available raw cel files) were retained.
Normalised gene-expression values
Raw gene-expression data (cel files) were downloaded from the EBI’s ArrayExpress (https://www.ebi.ac.uk/arrayexpress/). An ‘experiment’ throughout this manuscript was considered to be a set of control/mutant gene-expression microarray assays, submitted to EBI’s ArrayExpress under the same identifier and exhibiting the phenotype of interest, unless otherwise specified (see Fig 2). Separately, for each experiment, the raw data were summarised and normalised by using the rma (bioconductor’s package affy [3]). Log2-normalised expression data for all experiments that exhibited a particular phenotype were combined in a single dataset.
Fig 2
Definition of an experiment exhibiting a phenotype of interest.
EBIs ArrayExpress identifier: E-GEOD-24978.
Definition of an experiment exhibiting a phenotype of interest.
EBIs ArrayExpress identifier: E-GEOD-24978.
Removal of batch effects within an experiment
Individual experiments for the two target phenotypes were examined for the presence of batch effects. For each ArrayExpress accession number, all individual microarray cel files were downloaded, including any microarray assays that did not exhibit the phenotypes in question but were submitted under the same ArrayExpress identifier. For each experiment, we performed principal component analysis (PCA) of the log2-normalised microarray expression data. Where significant batch effects were detected, we used bioconductor’s ber package [4] to correct for them. For example, if an experiment that exhibited the phenotype of interest had sets of controls/mutants derived from different tissues, and that therefore exhibited significant heterogeneity in pattern of gene expression, the tissue effect was used as a factor in the batch effect correction.
Generation of the molecular signatures (linear-mixed effects model)
A random intercept linear mixed-effects model (LMEM) was used to generate normalised residuals for each gene within the Affymetrix Genome 2.0 microarray, accounting for a number of consecutive principal components. Fixed and random effects comprised the principal components and the different experiments, respectively, with gene-expression as the dependent variable. The residuals were then used to perform a logistic regression to assess the statistical significance of each gene. For the LMEM, the lmer function in R was used. The number and nature of the underlying biological and technical factors that differ between the different experiments are largely unknown. In order to determine how many principal components to use, the molecular signatures for the two target phenotypes were generated using LMEM, including a number of consecutive principal components to account for these biological/technical effects, e.g. sex, tissue. The consecutive principal components used started with using LMEM with no principal components progressing up to a LMEM with the first 7 consecutive principal components included (8 different models).
Gene Ontology (GO) enrichment analysis
The Wilcoxon rank sum test, as implemented in Catmap [5], was used to perform functional analysis to test for significant enrichment of Gene Ontology categories. Ranks of genes were based on the p-value derived from the logistic regression, irrespective of beta-coefficients. To account for multiple hypotheses testing the Benjamini-Hochberg false discovery rate was used (FDR). To assess if there was a significant enrichment of GO terms associated with the two target phenotypes of interest in the derived molecular signatures, we selected GO terms that we considered representative of the two phenotypes (S1 and S2 Figs in S1 File).
Leave-one-out cross-validation
To assess how well the molecular signatures could be used to predict the target phenotype in other experiments that exhibit a phenotype of interest, we used randomForest package in R (default parameters with 1,000 trees). We used a leave-one-out cross-validation (LOOCV) in order to calculate an area under the curve (AUC). Iteratively for all experiments we left one experiment out and derived the molecular signature using the rest of the experiments that exhibited the target phenotype. For example, one iteration comprised removing the controls/mutants, part of the crol experiment (starvation-sensitive) and generating the molecular signature using the rest of the experiments (dhr96, mir14, p53 and rbf). Crucially, we derived the residuals from the random intercept LMEM, along with consecutive principal components, for all experiments that exhibited the target phenotype, and then left one experiment out. This ensured that the model was corrected for underlying technical factors before performing the LOOCV. The AUC was calculated using the class (control/mutant) probabilities derived from the randomForest package, using the top 200 genes from the molecular signature (based on the p-values from the logistic regression). We also tested a different number of top genes (from 50 to 3,000 genes, S6 and S7 Figs in S1 File for the starvation-sensitive and sterile phenotypes respectively). In addition, we also formally tested if the mean of the class probabilities was different from 0.5 using a t-test, separately for controls and mutants, for the left-one-out experiment. The probability of 0.5 is the null hypothesis and it is equivalent to a random assignment of the controls/mutants.
Predicting the presence of phenotypic expression in freely available data
Similarly to the LOOCV, we used the molecular signature (top 200 genes based on the p-value from the logistic regression) for the starvation-sensitive and sterile phenotypes to predict the presence of the phenotypes in all available data in ArrayAtlas (Affymetrix GeneChip Drosophila Genome 2.0 Array). Iteratively for each deposited experiment in ArrayAtlas, we first derived residuals from a random intercept LMEM, including consecutive PCs, from the combined log2-normalised data for the experiment and the experiments that were part of the two phenotypes (starvation-sensitive and sterile). This ensured that we accounted for any technical variability between experiments. These residuals were then used to derive the probabilities for class (control/mutant) separation with the randomForest package in R. Each individual control/mutant sample within an experiment was assigned a class probability (control or mutant). For each class (control or mutant) the probabilities were averaged across the number of samples, separately for controls and mutants. This mean probability was used to infer quantitatively the target phenotype.
Results
Experiments and expression data
Using the above protocol, we identified five and six experiments, respectively, with specific perturbed genes for which gene-expression data for the starvation stress response defective (FlyBase control vocabulary identifier FBcv:0000708) and the female sterile (FBcv:0000366) target phenotypes were available. These were dhr96 (E-GEOD-18576), mir-14 (E-GEOD-20202), rbf (E-GEOD-38430), p53 (E-GEOD-37404) and crol (E-GEOD-8775) for the starvation sensitive phenotype and loj (E-GEOD-10940), ovo (E-GEOD-48145), pxt (E-GEOD-29815), su(HW) (E-GEOD-36528), ttk (E-GEOD-42758) and vret (E-GEOD-30360) for the sterile phenotype. Additional information can be found in S1 and S2 Tables in S1 File. Following normalisation and excluding transcripts that did not match any known or predicted gene, there were 12,630 genes left for analysis. The normalised gene-expression data are available upon request.
GO-terms enrichment analysis
Figs 3 and 4 show the results for the GO-terms associated with the two target phenotypes respectively (full numerical data are shown in S5 and S6 Tables in S1 File). Enrichment of starvation-related GO terms for the starvation-sensitive phenotype was observed for LMEM with the inclusion of one to four PCs (Fig 3). In contrast, sterile-related GO terms were found to be mostly enriched with LMEM without the inclusion of PCs (Fig 4). This suggests that there is more inter-experiment variability associated with the starvation-sensitive phenotype as compared to the sterile. All of the individual gene perturbation experiments that exhibited the sterile phenotype comprised female flies and more homogeneous tissue used to derive the expression data (S2 Table in S1 File), whereas the individual experiments for the starvation-sensitive phenotype were mixed sex and the expression data were derived from a variety of tissues (S1 Table in S1 File).
Fig 3
Top GO terms for the starvation-sensitive molecular signature.
Red vertical line represents FDR p-value 0.05.
Fig 4
Top GO terms for the sterile molecular signature.
Red vertical line represents FDR p-value 0.05.
Top GO terms for the starvation-sensitive molecular signature.
Red vertical line represents FDR p-value 0.05.
Top GO terms for the sterile molecular signature.
Red vertical line represents FDR p-value 0.05.We also performed a GO enrichment analysis associated with individual control/mutant experiments exhibiting the two target phenotypes (e.g. crol part of E-GEOD-8775). Ranks of genes were derived using the limma package in R. Only two experiments showed any statistically significant evidence of GO-terms enrichment associated with the starvation phenotype (crol and p53; S3 Table in S1 File), whereas all of the experiments that were identified to exhibit the sterile phenotype showed statistically significant enrichment of reproduction-related GO terms (S4 Table in S1 File).Only one experiment (loj), with the sterile phenotype, exhibited a significant batch effect. The controls and mutants comprised two tissues (abdomen and head/thorax). We used the ber package to correct for the batch effect using the tissue as a factor. We observed two clusters for the first PC (89.34% variance explained) that separated the loj by tissue (S3a Fig in S1 File). Correcting for the tissue batch effect eliminated the tissue separation and the loj controls/mutants separated by the second PC (S3b Fig in S1 File).
Determining the number of PCs for unwanted variation
The maximum AUC for the leave-one-out cross-validation for the starvation sensitive phenotype was 97% with six consecutive PCs and 85% with LMEM with no PCs for the sterile phenotype (Figs 5 and 6).
AUC- Area Under the Curve; a through h LMEM with 0 to 7 PCs.Nevertheless, GO term enrichment analysis showed that the statistical significance of starvation-related GO terms disappeared (FDR p-value >0.05) when the first five or six PCs were included in the LMEM (Fig 3). GO terms enrichment results for the sterile phenotype are shown in Fig 4. Furthermore, PCA of the residuals of the starvation sensitive LMEM with five or six PCs showed near complete separation of the controls and mutants (S4f and S4g Fig in S1 File). Taken together, these results suggest that the first four PCs account for biological/technical variability, that the overall molecular signature is enriched with starvation-related GO terms, and the 5th and 6th PCs account for the starvation-sensitive phenotype. We hypothesise that when we account for the first 5–6 PCs, the signal that is left is a form of global gene-expression regulation following a gene perturbation. Thus, accounting for the first five or six PCs results in a prediction of the class separation, rather than the manifestation of the phenotype. A gene perturbation disrupts the global gene-expression equilibrium and results in differential expression of compensatory gene mechanisms. In other words, control/mutant experiments with seemingly different gene perturbations may result in a higher than expected by chance overlap of differentially expressed genes, i.e. genes that are part of the compensatory gene-expression regulatory network. In order to test this hypothesis, we performed 1,000 permutations, whereby we chose five random control/mutant experiments from EBI’s ArrayExpress. The number of controls/mutants per experiment was matched to the number of controls/mutants in the five experiments for the starvation-sensitive phenotype. Thus, the number of controls/mutants in a randomly chosen experiment was reduced to match the number of controls/mutants in S1 Table in S1 File. For each of these experiments we derived normalised gene-expression values using the same procedure as for the starvation-sensitive phenotype. We derived differentially expressed genes using the limma package in R. For each of these random sets of experiments, we selected the top 200 genes and calculated the number of genes that overlap within each set of experiments in a pairwise manner. For each of these permutations we calculated the median of the -log10 of the p-value for each pairwise overlap using hypergeometric distribution. We compared these results to the pairwise overlap of random 200 genes as part of 1,000 sets of experiments. The distributions of the results for the random 1,000 sets of experiments and for what is expected by chance are shown in Fig 7.
Fig 7
Distribution of the pairwise overlap of genes in 1,000 random sets of five experiments, derived from ArrayExpress, as compared to expected by chance.
Y-axis- Median -log10 hypergeometric p-value for significance of pairwise overlap.
Distribution of the pairwise overlap of genes in 1,000 random sets of five experiments, derived from ArrayExpress, as compared to expected by chance.
Y-axis- Median -log10 hypergeometric p-value for significance of pairwise overlap.The results presented in Fig 7 clearly show that a random combination of sets of five experiments exhibit a significantly greater number of differentially expressed genes that overlap between the experiments as compared to purely by chance alone. This observation has been also reported in humans [6]. Thus, for the leave-one-out cross-validation for the starvation-sensitive phenotype we used the first four PCs to account for biological/technical variation. For the sterile phenotype we did not use PCs (LMEM with 0 PCs). PCA graphs for the sterile molecular signature LMEM with 0 to 7 PCs are shown in S5 Fig in S1 File. For the calculation of the AUC for the LOOCV we tested a range of top genes (from 50 to 3,000). For the starvation-sensitive phenotype there was not a difference in the AUC with different number of top genes, although choosing more genes resulted in a slightly higher AUC (50 genes 87.76% AUC; 3,000 genes 90.31% AUC; S6 Fig in S1 File with 4PCs). The opposite was noted with the sterile phenotype, fewer number of top genes resulted in higher AUC (50 genes 90.58% AUC; 3,000 genes 73.68% AUC; S7 Fig in S1 File with 0PCs). These trends could potentially reflect the size of the transcriptional network involved in both phenotype, for example it has been previously reported that the starvation stress resistance involves transcriptional response of ~25% of the genome in Drosophila [7].The mean distribution of the control/mutant class probabilities from the random forest for both the starvation-sensitive and sterile phenotypes were significantly different from 0.5 (Table 1). The results in Table 1, along with the AUC for both phenotypes (Figs 5 and 6), show that we can confidently predict the phenotypic manifestation of a separate experiment that exhibits the phenotype of interest.
Table 1
One sample t-test for class probabilities (controls/mutants) in the two phenotypes following LOOCV.
Class
Phenotype
starvation-sensitive p-value (μ = 0.5)
Sterile p-value (μ = 0.5)
Controls
5.72x10-03
3.87x10-03
Mutants
5.84x10-03
3.10x10-03
Predicting freely available experiments for the presence of both phenotypes
In order to obtain freely available experiments we utilised EBI’s ExpressionAtlas (https://www.ebi.ac.uk/gxa/home) instead of ArrayExpress. We used EBI’s ExpressionAtlas due to the availability of normalised gene-expression values for a large number of the already available raw cel gene-expression data in ArrayExpress. This eliminated the need to normalise all of the available raw gene-expression data within ArrayExpress. For all experiments available in EBI’s ExpressionAtlas (total number of control/mutant experiments at the time of conducting the study: 211) we used the molecular signatures for the starvation sensitive and sterile phenotypes to derive a mean probability separately for controls and mutants in an experiment. The mean mutant probability was used to suggest a degree of phenotypic manifestation. Ranking of all available experiments is given in S7 and S8 Tables in S1 File for the starvation-sensitive and sterile phenotypes respectively.
Ranking EBI’s ExpressionAtlas experiments for the starvation-sensitive phenotype
The top three ranked experiments were all already used to generate the molecular signature (dhr96, crol and rbf), thus it is not unexpected that we can predict these experiments with the highest accuracy. The p53 (E-GEOD-37404) and mir-14 (E-GEOD-20202) experiments are not included in the EBI’s ExpressionAtlas datasets.For the rest of the freely available experiments available in EBI’s ExpressionAtlas we found no results from a direct lab-based assay of the starvation sensitivity. Nevertheless, for some of the top-ranked experiments we found additional evidence that can be potentially used to support the results from our prediction. All three gene mutants (rbf120a, rbf120a
wtslatsX1 and wtslatsX1), part of an experiment (E-GEOD-24978) were ranked with mutant class probabilities of 83%, 74% and 64% respectively. The two genes, rbf and wts regulate cell proliferation via the p16 and Hippo tumour suppressor pathways. There is only a direct lab-based measurement of the starvation-sensitive phenotype of rbf120a, which was used as part of the molecular signature. We speculate that the wtslatsX1 and the double-mutant rbf120a
wtslatsX1 may also exhibit starvation-sensitive phenotype.Several of the top-ranked experiments included fly lines from the Drosophila Genetic Reference Panel (DGRP) [8]. These included genes (esg, Pdcd4, mub, Gbs-70E) that were reported to exhibit a reduced starvation resistance, tested at six weeks.
Ranking EBI’s ExpressionAtlas experiments for sterile phenotype
The top four ranked experiments in the EBI’s ExpressionAtlas comprise four already used control/mutant experiments for the sterile molecular signature (ovo (ovo and ovo/cako) and loj (head and thorax)), thus it is not surprising that we can detect these with high accuracy. The rest of the experiments, part of the molecular signature, were not analysed as part of EBI’s ExpressionAtlas (not all experiments from ArrayExpress are analysed in ExpressionAtlas). Similarly to the starvation-sensitive molecular signature we found no direct evidence that the top-ranked experiments will exhibit the sterile phenotype. Nevertheless, there was additional evidence for some of the top-ranked experiments. For example, experiment E-GEOD-55187 comprises sesb1 homozygous female mutants that are predicted to exhibit the sterile phenotype with mean probability of 85% across the individual mutants. Sesb1 is listed as female sterile in flybase (http://flybase.org/reports/FBal0015434-phenotypic_data_sub). Due to lack of information, we could not verify whether the gene-mutant shown as sterile [9] is exactly the same as the gene-mutants with the microarray data in EBI’s ArrayExpress [10]. Similarly, in experiment E-MTAB-3546 [11], 3-week reproductive diapause under cold conditions (11C) was predicted to exhibit the sterile phenotype with a mean mutant probability of 91% across the individual mutants. Clearly, the mutant female flies are very likely to exhibit the sterile phenotype as they were induced into a diapause that is associated with a reproductive arrest. The 10 and 40 days aged dietary restricted female flies (E-GEOD-26726) also showed evidence of the sterile phenotype (84% and 79% respectively). There is a well-defined reduction in daily and lifetime fecundity under dietary restriction [12], therefore it is more than likely that the 10 and 40 days old flies will exhibit the sterile phenotype.
Discussion
In this paper we present a novel computational approach for integrating gene-expression data for two specific phenotypes (starvation-sensitive and sterile) in Drosophila from the vast and largely unutilised freely available public repositories. This integration is multi-layered with phenotypic information derived from a species-specific database (FlyBase) and gene-expression from the largest repository of publicly available genomic data, the ExpressionAtlas at the European Bioinformatics Institute. Crucially, we present an approach to utilise gene-expression data generated by completely independent groups across the scientific community.The results of this proof-of-concept study show that it is possible to integrate seemingly different gene-expression microarray data using a combination of linear-mixed effect models and principal components analyses and predict a potential phenotypic manifestation with a relatively high degree of confidence. Nevertheless, the applicability of this methodology to capture a wide range of phenotypes and organisms requires a considerable amount of additional work that is beyond the scope of this article.The premise of our methodology is based upon the assumption that specific cellular and physiological phenotypes are underlined by or associated with similar gene-expression changes. In addition, the number of such gene-expression changes that are shared between different perturbations and are associated with a specific phenotype, is likely to differ between different phenotypes. Currently, there is no simple way to derive a set number of gene-expression changes that describe a particular phenotype and this number is also likely to depend on the nature of the phenotype. We used an empirically derived number of genes for the two phenotypes that we tested (top 200 genes, based on p-value for differential expression), although this selection can potentially be automated using a different number of genes. Our approach might not be directly applicable if a specific phenotype is underlined by independent biological pathways or caused by mechanisms that do not result in changes in gene-expression. Nevertheless, additional genomic measurements can be incorporated as and when they become available. Furthermore, our methodology relies on freely available gene-expression data, which is only set to increase [13]. Thus, with the increase in repository data, our approach has a great potential to estimate relative degree of independence of biological pathways that influence or give rise to specific phenotypes.Biological phenotypes are rarely binary features, although they often get binarised for ease of use, for example gravitaxis defective phenotype (movement away from the source of gravity) can be expressed as defective/normal or a more complex measure can be used to account for the continuous nature of the phenotype [14]. Nevertheless, even with considerable efforts to standardise experimental protocols and measurement assays, differences will be exhibited between laboratories across the world. As such, it is difficult to utilise the continuous phenotype response measurements. In this study we only considered control/mutant type experiments. For such experiments the measured phenotypes can be taken as relative with respect to controls, thus minimising the differences in protocols. Nevertheless, for most such experiments in Drosophila, there is no unified system/database that collects and archives the outcomes of such measurements and currently these have to be extracted manually from the corresponding manuscripts and assessment made on how similar the protocols are. Our methodology of predicting potential phenotypic manifestation uses a machine learning approach, that is random forest. This could potentially be used to infer the two phenotypes probabilistically, although it is unclear what the relationship is between the similarity in gene-expression and the degree of phenotype manifestation.Although our study utilises gene-expression microarray data and such type of data is clearly superseded by RNA sequencing [13], we do not foresee any major challenges in adopting our methodology to work with RNA-seq data. For example, raw RNA-seq counts can be relatively easily transformed into transcripts per million (TPM) and log2 of TPM can be used in the linear-mixed effect models.Our methodology relies on linear-mixed effect models accounting for unwanted biological effects in the form of principal components. In order to estimate the number of PCs we utilised Gene Ontology enrichment analysis, whereby we chose consecutive number of PCs to maximise GO enrichment. One of the potential limitations is that there might be some degree of circularity when using GO terms to define phenotypic enrichment, since GO categories could have been partially defined using similar data. The other limitation is that the combination of PCs and linear-mixed effect model is likely to be overconservative, such that some variation in the phenotype of interest maybe already included in the PCs. Other approaches, such as probabilistic estimation of expression residuals (PEER) [15] could be used to facilitate estimation of unwanted factors.The proof-of-concept study presented here is a novel approach of predicting the manifestation of two phenotypes in Drosophila from gene-expression data. While, similar attempts have been previously performed [16-19], these studies rely on a single or a few well-defined datasets with few measured phenotypes. Our approach goes beyond single studies and it is not restricted to selective phenotypic measurements in a few datasets. The methodology described here captures the diverse genetic background and gene-perturbations from all the publicly available repository data and links them to phenotypic characteristics, thereby adding value to already deposited and largely unutilised data.(PDF)Click here for additional data file.9 Jul 2020PONE-D-20-03484A novel computational approach for predicting complex phenotypes by deriving their gene expression signatures from public dataPLOS ONEDear Dr. Ivanov,Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.Please submit your revised manuscript by Aug 23 2020 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.Please include the following items when submitting your revised manuscript:A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocolsWe look forward to receiving your revised manuscript.Kind regards,Xia Li, Ph.D.Academic EditorPLOS ONEJournal Requirements:When submitting your revision, we need you to address these additional requirements.1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found athttps://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf andhttps://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf2.Thank you for stating the following in the Acknowledgments Section of your manuscript:[This work was supported by theUK Dementia Research Institute which receives its funding from DRI Ltd, funded bythe UK Medical Research Council, Alzheimer's Society and Alzheimer's ResearchUK. The project was also part-funded by the European Regional Development Fundthrough the Welsh Government.]We note that you have provided funding information that is not currently declared in your Funding Statement. However, funding information should not appear in the Acknowledgments section or other areas of your manuscript. We will only publish funding information present in the Funding Statement section of the online submission form.Please remove any funding-related text from the manuscript and let us know how you would like to update your Funding Statement. Currently, your Funding Statement reads as follows:[The author(s) received no specific funding for this work.]Additional Editor Comments (if provided):According to the reviewers' comments, my final decision is "Major" revision.[Note: HTML markup is below. Please do not edit.]Reviewers' comments:Reviewer's Responses to QuestionsComments to the Author1. Is the manuscript technically sound, and do the data support the conclusions?The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.Reviewer #1: YesReviewer #2: YesReviewer #3: Yes**********2. Has the statistical analysis been performed appropriately and rigorously?Reviewer #1: YesReviewer #2: YesReviewer #3: Yes**********3. Have the authors made all data underlying the findings in their manuscript fully available?The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.Reviewer #1: YesReviewer #2: YesReviewer #3: Yes**********4. Is the manuscript presented in an intelligible fashion and written in standard English?PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.Reviewer #1: YesReviewer #2: YesReviewer #3: Yes**********5. Review Comments to the AuthorPlease use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)Reviewer #1: The manuscript is written well and is understandable. However I think it should be made clearer in the abstract and within the text that both Array Express and Expression Atlas were used and an explanation of why the Expression Atlas was used.Reviewer #2: Since the data was restricted to starvation and sterility, would suggest that the title reflect so, as phenotype is a very big bracket. And it is likely that the spectrum would vary between phenotypes e.g. colour, height etc vs binary cases like sterility. Also the data encompasses only Dropsophilia, so suggestion to be safer to be narrow in coverage in title and rest of the writing. By all means discuss on applicability across multiple organisms and phenotype in discussion bearing in mind limitations. It would be good to discuss applicability of towards different ranges of phenotypes. There is also a claim of capturing environmental influences, which is not really reflected in the results. Would suggest limiting claims and overall applicability, otherwise a lot of work should be done to substantiate those claims - multiple organisms, multiple range of phenotypes, etc.Reviewer #3: The paper by Dobril K. Ivanov et al. demonstrated a combination of linear mixed effect models and principal components analyses approach for integrating gene expression data for specific phenotypes from independent labs widely available in public repositories. As a proof-of-concept study to show the promising in-silico approach for inferring phenotypic characteristics behind the published gene-expression data resulting from genetic or environmental perturbations, the data they selected is excellent for testing their methods and hypothesis. The results they presented are partial but convincing. I would like to see this approach be further explored, tested, and improved by more computational researchers. Along this line, since there are still some ambiguous or arbitrary steps or parameters in the study, I recommend that this paper only be accepted after the following major comments/issues are resolved:The two represented phenotypes(starvation sensitive and sterile) were nicely investigated and presented. But in the current workflow, there are few steps that involve manual curation, like the selection of expression data and the representative GO terms. To show the application's universality, please add at least one additional test phenotype of interest that the related experiments and significant enrichment of GO terms were selected by predefined rules or algorithms.The current flow diagram in Fig.1 is quite confusing, at least to me. Please consider separating the signature finding and phenotype prediction into two parts. Meanwhile, since the paper already discussed the complexities of the relation between the gene-expression changes and the phenotypes and described the limitation of the current approach. If possible, I would like to see an evaluation part by the end of the signature-finding section to evaluate whether the selected phenotype could be significantly presented or identified by the expression profiles of the currently available microarray experiments.In the cross-validation section, the paper shows "the AUC was calculated using the class (control/mutant) probabilities derived from the randomForest package, using the top 200 genes from the molecular signature (based on the p-values from the logistic regression)". How this 200 was determined, whether this top-N influence the AUCs? I would like to know the results of a series of Ns to have a better understanding of this step. And prefer a more rigorous or data-driven way to calculate or determine the numbers of the signature genes for different phenotypes.**********6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.If you choose “no”, your identity will remain anonymous but your review may still be made public.Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.Reviewer #1: NoReviewer #2: NoReviewer #3: No[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.14 Aug 2020Reponse to ReviewersRE: PONE-D-20-03484R1"A novel computational approach for predicting complex phenotypes by deriving their gene expression signatures from public data" by Ivanov et al.We would like to express our gratitude to the reviewers for their time and constructive comments of our manuscript, especially during these difficult times.We have amended the main text of the manuscript and the supplementary materials taking into account and addressing all of the reviewer's comments. We believe that this has improved the manuscript.We believe that the detailed response and major revision of the manuscript will be satisfactory to the reviewers and the Editor and the manuscript will be accepted for a publication in PLOS ONE.Reviewer #1:The manuscript is written well and is understandable. However I think it should be made clearer in the abstract and within the text that both Array Express and Expression Atlas were used and an explanation of why the Expression Atlas was used.Response:We are pleased to see that the Reviewer thought that the manuscript was well written and understandable. We have amended the abstract to include that publicly available gene-expression and new experiments data were derived from EBI's Array Express and Expression Atlas respectively. We have also provided an explanation why the Expression Atlas was used. We utilised EBI's Expression Atlas due to the availability of normalised gene-expression values and contrasts for individual experiments. Without this, we would have needed to normalise all the raw microarray cel data files within EBI's Array Express. This would have required a substantial amount of time and resources. The text (section "Predicting freely available experiments for the presence of both phenotypes") has been amended to include an explanation of why we used the normalised data within Expression Atlas and not the raw data in Array Express. We think that adding the explanation above makes the applied methodology clearer.Reviewer #2:Since the data was restricted to starvation and sterility, would suggest that the title reflect so, as phenotype is a very big bracket. And it is likely that the spectrum would vary between phenotypes e.g. colour, height etc vs binary cases like sterility. Also the data encompasses only Dropsophilia, so suggestion to be safer to be narrow in coverage in title and rest of the writing. By all means discuss on applicability across multiple organisms and phenotype in discussion bearing in mind limitations. It would be good to discuss applicability of towards different ranges of phenotypes. There is also a claim of capturing environmental influences, which is not really reflected in the results. Would suggest limiting claims and overall applicability, otherwise a lot of work should be done to substantiate those claims - multiple organisms, multiple range of phenotypes, etc.Response:We thank the reviewer for the constructive criticism and helpful suggestions. Yes, the suggestion is very helpful and we have amended the title along with the manuscript body to reflect the case use of two phenotypes and that it was only tested in Drosophila.The new title is: "A novel computational approach for predicting complex phenotypes in Drosophila (starvation-sensitive and sterile) by deriving their gene expression signatures from public data".The Abstract and Introduction were amended throughout to be more specific that we have generated gene-expression signatures for two phenotypes in Drosophila. We completely agree that it is safer to narrow the applicability. We have also amended the Discussion to be more precise in the range of claims. To this effect, the Discussion was amended to always specify that the results and claims refer to the two specific Drosophila phenotypes. We also included a new paragraph in the Discussion section regarding different ranges of phenotypes and more specifically binary vs. continuous measures of phenotypes.Throughout the text we used the term environmental perturbation in a very limited scope. We do not claim that the derived gene-expression results for the two phenotypes capture environmental influences. That is, we were interested if genetic (e.g. gene knock-out) or environmental perturbations can induce or result in the presence of the two tested phenotypes. The term environmental perturbation was meant to mean for example a chemical compound or another type of intervention that has induced the phenotypes under investigation. For example, an ArrayAtlas experiment (E-MTAB-3546; 3-week reproductive diapause under cold conditions (11C)) was predicted to exhibit the sterile phenotype with a mean mutant probability of 91% across the individual mutants within that experiment. The cold conditions are an environmental perturbation that induces the reproductive diapause. In this specific instance we are able to confidently predict the phenotypic manifestation.Reviewer #3:The paper by Dobril K. Ivanov et al. demonstrated a combination of linear mixed effect models and principal components analyses approach for integrating gene expression data for specific phenotypes from independent labs widely available in public repositories. As a proof-of-concept study to show the promising in-silico approach for inferring phenotypic characteristics behind the published gene-expression data resulting from genetic or environmental perturbations, the data they selected is excellent for testing their methods and hypothesis. The results they presented are partial but convincing. I would like to see this approach be further explored, tested, and improved by more computational researchers. Along this line, since there are still some ambiguous or arbitrary steps or parameters in the study, I recommend that this paper only be accepted after the following major comments/issues are resolved:1. The two represented phenotypes(starvation sensitive and sterile) were nicely investigated and presented. But in the current workflow, there are few steps that involve manual curation, like the selection of expression data and the representative GO terms. To show the application's universality, please add at least one additional test phenotype of interest that the related experiments and significant enrichment of GO terms were selected by predefined rules or algorithms.2. The current flow diagram in Fig.1 is quite confusing, at least to me. Please consider separating the signature finding and phenotype prediction into two parts.3. Meanwhile, since the paper already discussed the complexities of the relation between the gene-expression changes and the phenotypes and described the limitation of the current approach. If possible, I would like to see an evaluation part by the end of the signature-finding section to evaluate whether the selected phenotype could be significantly presented or identified by the expression profiles of the currently available microarray experiments.4. In the cross-validation section, the paper shows "the AUC was calculated using the class (control/mutant) probabilities derived from the randomForest package, using the top 200 genes from the molecular signature (based on the p-values from the logistic regression)". How this 200 was determined, whether this top-N influence the AUCs? I would like to know the results of a series of Ns to have a better understanding of this step. And prefer a more rigorous or data-driven way to calculate or determine the numbers of the signature genes for different phenotypes.Response:We thank the reviewer for the helpful and thorough comments and suggestions. We are pleased to see that the Reviewer would like to see this approach further explored, tested, and improved by more computational researchers in the future and hopefully once the study is in the public domain, this would happen.We have addressed all of the comments below:1. The reviewer suggested to add one additional phenotype to test the general applicability of the methods. While we completely agree that adding more phenotypes will improve the overall methods and design, this will take considerable amount of time and effort. The overall goal of the presented work was to see if it was possible to implement a set of methods (combination of linear mixed effect models, GO terms and principal components) that could be used to predict complex phenotypes using gene-expression data. We also agree that there was scope for automating the generation of the molecular signatures, although this would require large amount of time and effort and was beyond the scope of our original hypothesis, i.e. is it possible to predict complex phenotypes using gene-expression data in Drosophila.This study is a proof-of-concept and we make this clear throughout the manuscript and this is why we feel that adding an additional phenotype is beyond the scope of this manuscript.2. We agree that the Figure 1 was confusing and we have separated the signature finding and phenotype prediction into two parts to make it clearer.3. Yes, to make the evaluation part clearer, we have amended the discussion section to include a more detailed discussion if a particular phenotype can be represented or identified/predicted. We further described that it is possible that there could be phenotypes not well predicted by changes in gene-expression and these could for example be better represented or predicted using other types of data, for example methylation. In addition, the type of leave-one-out cross validation that we perform, tests precisely the question raised by the reviewer. That is, we leave one whole experiment, for example all the controls/mutants part of the crol experiment, create the molecular signature with the rest of the controls/mutants and test if we can predict the controls/mutants that were left out. This ensures that the overall AUC reflects the ability of the selected microarray experiments to predict the phenotype of interest. Of course, in order to gain a better understanding of what type of phenotypes and how many can be predicted with gene-expression data we need to investigate a large number of such phenotype gene-expression combinations. All of the above is described in detail in the Discussion section as suggested by the Reviewer.4. We agree with the reviewer that the selection of the 200 genes to predict and calculate AUC is relatively arbitrary and that is why we performed a series of additional experiments, which are included in the revised manuscript. We selected a range of top genes and performed the leave-one-out cross-validation for each of these for all principal components. This ranged from 50 to 3,000 genes (15 different numbers of top genes). Overall, these were 120 leave-one-out cross-validations and AUCs for the starvation-sensitive and sterile phenotypes respectively. These new results are summarised in detail in supplementary figures 6 and 7 (Figs S6 and S7). Furthermore, we have also amended the Materials and Methods and Results sections to describe these analyses (sections "Leave-one-out cross-validation" and "Determining the number of PCs for unwanted variation" in the Materials and methods and Results respectively). For the starvation-sensitive phenotype there was little difference when choosing the number of top genes, although there is a trend for higher number of top genes to deliver higher AUC. For the sterile phenotype the opposite trend was noted, fewer genes resulted in better AUC. This could potentially be caused by the size of the transcriptional network responsible for the phenotype, for example it has been previously reported that the starvation stress resistance involves transcriptional response of ~25% of the genome in Drosophila. Nevertheless, this is a speculation and further work in terms of a large number of phenotypes need to be examined to assess this and gain further understanding. All of the above was described in detail in the Results section and data summarised in the Supplementary materials.Submitted filename: Response_reviewers_Ivanov_predict_phenos_gene_expression.docxClick here for additional data file.5 Oct 2020A novel computational approach for predicting complex phenotypes in Drosophila (starvation-sensitive and sterile) by deriving their gene expression signatures from public dataPONE-D-20-03484R1Dear Dr. Ivanov,We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.Kind regards,Xia Li, Ph.D.Academic EditorPLOS ONEAdditional Editor Comments (optional):My final decision is also "Accept"Reviewers' comments:Reviewer's Responses to QuestionsComments to the Author1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.Reviewer #2: All comments have been addressedReviewer #3: All comments have been addressedReviewer #4: All comments have been addressed**********2. Is the manuscript technically sound, and do the data support the conclusions?The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.Reviewer #2: YesReviewer #3: YesReviewer #4: Yes**********3. Has the statistical analysis been performed appropriately and rigorously?Reviewer #2: N/AReviewer #3: YesReviewer #4: Yes**********4. Have the authors made all data underlying the findings in their manuscript fully available?The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.Reviewer #2: YesReviewer #3: YesReviewer #4: Yes**********5. Is the manuscript presented in an intelligible fashion and written in standard English?PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.Reviewer #2: YesReviewer #3: YesReviewer #4: Yes**********6. Review Comments to the AuthorPlease use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)Reviewer #2: Satisfied that comments are addressed in the various sections, abstract, intro and discussion. Can better clarify on the environmental influence part.Reviewer #3: The authors have clarified most of the questions I raised in my previous review. I am glad to see the updated manuscript with a more focused title to reflect the study and clearer descriptions of the pipeline and the evaluation part.Reviewer #4: I have a suggestion if the authors want to consider. The bracket in the title can be avoided with something like this:A novel computational approach for predicting complex phenotypes in starvation-sensitive and sterile Drosophila by deriving their gene expression signatures from public data**********7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.If you choose “no”, your identity will remain anonymous but your review may still be made public.Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.Reviewer #2: NoReviewer #3: NoReviewer #4: No12 Oct 2020PONE-D-20-03484R1A novel computational approach for predicting complex phenotypes in Drosophila (starvation-sensitive and sterile) by deriving their gene expression signatures from public dataDear Dr. Ivanov:I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.If we can help with anything else, please email us at plosone@plos.org.Thank you for submitting your work to PLOS ONE and supporting open access.Kind regards,PLOS ONE Editorial Office Staffon behalf ofProf. Xia LiAcademic EditorPLOS ONE
Authors: Trudy F C Mackay; Stephen Richards; Eric A Stone; Antonio Barbadilla; Julien F Ayroles; Dianhui Zhu; Sònia Casillas; Yi Han; Michael M Magwire; Julie M Cridland; Mark F Richardson; Robert R H Anholt; Maite Barrón; Crystal Bess; Kerstin Petra Blankenburg; Mary Anna Carbone; David Castellano; Lesley Chaboub; Laura Duncan; Zeke Harris; Mehwish Javaid; Joy Christina Jayaseelan; Shalini N Jhangiani; Katherine W Jordan; Fremiet Lara; Faye Lawrence; Sandra L Lee; Pablo Librado; Raquel S Linheiro; Richard F Lyman; Aaron J Mackey; Mala Munidasa; Donna Marie Muzny; Lynne Nazareth; Irene Newsham; Lora Perales; Ling-Ling Pu; Carson Qu; Miquel Ràmia; Jeffrey G Reid; Stephanie M Rollmann; Julio Rozas; Nehad Saada; Lavanya Turlapati; Kim C Worley; Yuan-Qing Wu; Akihiko Yamamoto; Yiming Zhu; Casey M Bergman; Kevin R Thornton; David Mittelman; Richard A Gibbs Journal: Nature Date: 2012-02-08 Impact factor: 49.962
Authors: Lucie Kučerová; Olga I Kubrak; Jonas M Bengtsson; Hynek Strnad; Sören Nylin; Ulrich Theopold; Dick R Nässel Journal: BMC Genomics Date: 2016-01-13 Impact factor: 3.969