Literature DB >> 23812993

Automated annotation of gene expression image sequences via non-parametric factor analysis and conditional random fields.

Iulian Pruteanu-Malinici¹, William H Majoros, Uwe Ohler.

Abstract

MOTIVATION: Computational approaches for the annotation of phenotypes from image data have shown promising results across many applications, and provide rich and valuable information for studying gene function and interactions. While data are often available both at high spatial resolution and across multiple time points, phenotypes are frequently annotated independently, for individual time points only. In particular, for the analysis of developmental gene expression patterns, it is biologically sensible when images across multiple time points are jointly accounted for, such that spatial and temporal dependencies are captured simultaneously.
METHODS: We describe a discriminative undirected graphical model to label gene-expression time-series image data, with an efficient training and decoding method based on the junction tree algorithm. The approach is based on an effective feature selection technique, consisting of a non-parametric sparse Bayesian factor analysis model. The result is a flexible framework, which can handle large-scale data with noisy incomplete samples, i.e. it can tolerate data missing from individual time points.
RESULTS: Using the annotation of gene expression patterns across stages of Drosophila embryonic development as an example, we demonstrate that our method achieves superior accuracy, gained by jointly annotating phenotype sequences, when compared with previous models that annotate each stage in isolation. The experimental results on missing data indicate that our joint learning method successfully annotates genes for which no expression data are available for one or more stages.

Entities: Chemical Disease Gene Species

Mesh：

Substances：
RNA, Messenger

Year: 2013 PMID： 23812993 PMCID： PMC3694682 DOI： 10.1093/bioinformatics/btt206

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 INTRODUCTION

The use of high-throughput image acquisition, such as in phenotypic screens, has been quickly increasing and thus provides a new source of data for computational biologists. Microscopy of colored or fluorescent probes, followed by imaging, is able to deliver spatial and temporal quantitative phenotype information such as gene expression at high resolution (Busch ; Ljosa ; Walter ). In addition, expression patterns can be documented and distributed over the internet as a valuable resource to the research community. Recent advances in throughput, or long-term investment in specific projects, have by now generated large collections of images. Such image databases are traditionally analyzed through direct inspection by human curators; an example is the Berkeley Drosophila Genome Project (BDGP) gene expression pattern database (Tomancak , 2007). In this dataset, images are assigned to stage ranges within the 17 embryonic stages defined by developmental features, and annotated collectively in small groups using a controlled vocabulary (CV), i.e. annotation terms. This allows researchers to search image databases and compare spatial and temporal embryonic development. Given the very diverse nature of imaging technology, samples and biological questions, computational approaches have often been tailored to a specific dataset. For example, the image-based profiling of gene expression patterns via in situ hybridization (ISH) requires the development of accurate and automatic image analysis systems for using such data, to understand regulatory networks and development of multicellular organisms. Images are affected by multiple sources of noise due to experiments or microscopy (incomplete or multiple embryos, variance of probes across genes, illumination artifacts), making the extraction and registration of embryos non-trivial (Fowlkes , 2008; Harmon ; Keranen ; Kumar ; Mace ; Puniyani ;). Peng and Myers (2004) and Peng introduced an automatic image annotation framework using various high-dimensional feature representations and classifying frameworks: Principal Component Analysis (PCA), wavelets, Gaussian mixture models, Support Vector Machines (SVM), Quadratic Discriminant Analysis. Each image may show the embryo under different views: lateral, dorsal or ventral; this is a challenge for gene annotation, as embryonic structures may be visible in only certain views. Yet, recent studies have shown that incorporating images from multiple views could consistently improve the annotation accuracy (Ji ; Pruteanu-Malinici ). It is desirable to represent images in a way that takes advantage of image features and offers robustness to image distortions. In contrast to such large feature sets prone to high redundancy and high computational costs, Frise identified a set of basic expression patterns in Drosophila. A set of 39 well-defined clusters describing specific regions of embryo expression were determined from 2693 lateral views of early development. As with the majority of described approaches, this study involved a high level of human intervention in selecting ‘good’ images for training/testing purposes—a potential drawback, considering the rapid increase in the size of ISH image collections. In contrast, Pruteanu-Malinici proposed a new approach for automatic annotation of spatial expression patterns using a ‘vocabulary’ of basic patterns that involved little to no human intervention. This work provided a flexible unsupervised framework in competitively predicting gene annotation terms, while using only a small set of features. A particular aspect that has been largely neglected by computational approaches so far is that data acquired from such experiments often span multiple time points or conditions. Phenotypes are typically annotated stage-by-stage, without jointly learning the salient temporal dependencies across multiple time points, which should allow for an overall higher accuracy; e.g. the annotation terms predicted for earlier stages should inform the prediction at later stages. Furthermore, many genes are annotated with more than one term from the vocabulary, creating an additional dependency structure between annotations within the same stage range. In this article, we address the automatic annotation of Drosophila embryo gene expression sequences, building on state-of-the-art models from computer vision and machine learning. There are several challenges that need to be addressed when approaching this problem through computational methods. As we mentioned previously, the image acquisition process results in embryonic structures with multiple perspectives, shapes and locations. Moreover, the shape/position of the same embryonic structure may vary from image to image: ‘variation in morphology and incomplete knowledge of the shape and position of various embryonic structures’ have made the gene annotation task more prohibitive (Ji ). We first show that a non-parametric Bayesian factor analysis (BFA) approach, the infinite factor model, allows for an efficient and sparse feature representation of the Drosophila gene expression dataset. Then, we propose a conditional random field (CRF) to tackle the time-evolving annotation task. Experiments show that the exploitation of dependencies across adjacent developmental stages leads to annotation accuracy superior to existing Drosophila gene expression annotation approaches. The proposed framework also tackles the missing data scenario: for many genes, one or more stage ranges are absent from the image collection; in such cases, human annotators would take into account the entire set of expression data from adjacent stages to successfully annotate the available images. The challenge to automatize this process is novel and represents a step closer toward a fully automatic gene annotation pipeline. These predictions can be later analyzed by biologists to assess the correctness of the image acquisition and the level of interest for that particular gene. Finally, for a given gene, the described framework predicts the entire set of annotation terms simultaneously, taking full advantage of the term dependencies that exist at the stage-range level. The rest of this article is organized as follows: in Section 2, we focus on data description and introduce the sparse BFA-CRF framework. Experimental results are reported in Section 3, followed by conclusions and future work in Section 4.

2 MATERIALS AND METHODS

2.1 Data description

One of the most popular large-image expression datasets is the BDGP collection of embryonic expression patterns. The project started with a first release of images for 2000 genes; the second release was in 2007 with 6000 genes. Release number 3 came in 2010 bringing the total to 7500 genes, including 97% of the sequence-specific transcription factor genes. As of today, the collection consists of over 105 000 images, which document patterns of embryonic gene expression for over 7400 of the 13 659 protein-coding genes identified in the Drosophila melanogaster genome. Expression is visualized by RNA ISH, which provides an effective way of locating specific mRNA sequences by hybridizing complementary mRNA-binding oligonucleotides and a suitable dye (Tautz ). The mRNA expression apparent in the captured in situ images was verified by independently derived microarray time-course analysis using Affymetrix GeneChip technology (more details can be found at http://insitu.fruitfly.org, and in Tomancak ). Gene expression patterns were documented by taking low (2×) and high (20×) magnification images, at multiple developmental stages. The low-magnification digital images were taken to capture groups of embryos, to provide a permanent record of the hybridization in each well. Each slide was then further examined under higher magnification using a Zeiss Axiophot optical microscope. Images were assigned to developmental stage ranges following the sequence of events taking place at specific times after fertilization, using the 17 stages defined in (Campos-Ortega, 1985). In this analysis, we focused on the first 15 hr of Drosophila development, spanning embryonic stages 4–6, 7–8, 9–10, 11–12 and 13–16. Developmental stages 1–3 were skipped owing to predominant ubiquitous expression patterns not of interest to our analysis. Any gene is represented, on average, by approximately 12 images; however, the number of images per gene varies from 1 to 80. This variability reflects the BDGP strategy to document highly dynamic, complex and novel patterns, while lowering the number of images documenting common expression patterns. Among those, there are images with non-informative patterns due to poor-quality staining/washing or non-tissue–specific expression (maternal or ubiquitous). Images within the same window can show different patterns due to embryo orientation or the relatively long developmental time spanned by a stage range. Images are annotated with ontology terms from a CV describing developmental expression patterns (Fig. 1). This vocabulary has been developed and refined by FlyBase (The FlyBase Consortium, 2002) over the past few years, allowing human curators to compare their findings with expression data assembled from the literature, expansion of annotations to greater detail and thorough searches of datasets based on Gene Ontology schema. The annotations used throughout this project consisted of a subset of about 300 of the 5800 annotation terms in the FlyBase CV, many of which only apply to later stages of development.

Fig. 1.

Examples of Drosophila embryo ISH images and associated annotation terms (BDGP database) for gene Actn (FBgn0000667), across five developmental stage ranges. The dark blue stained regions highlight areas where genes are expressed; darker colors correspond to higher gene expression levels As mentioned previously, we use all available images in our approach, i.e. including those taken with any embryo orientation: lateral, dorsal and ventral. Before extracting features, we segmented and registered images using a previously described probabilistic segmentation approach based on statistical shape models (Mace ). This provides us with 240 × 120 pixel images, mostly containing a single embryo in a standard dorsal(up)/anterior(down) orientation and no background. In Figure 1, we show a particular gene expression pattern across five developmental stage ranges of interest. The complexity and variability of the image data led to competitive but not perfect results, in terms of precise embryo extraction as well as embryo orientation, which increased the challenge of automatic gene annotation. We here use the average intensities in a down-sampled fixed grid size of 80 × 40 pixels as input features for the entire collection of images within the BDGP dataset.

2.2 Feature extraction—sparse BFA

Sparse Bayesian factor analysis (sBFA) is a statistical method for modeling many dependent random variables through linear combinations of a few hidden variables (Gorsuch, 1983). This model is designed to address the high-dimensional setting where hundreds or thousands of genes are simultaneously examined. The sparsity assumption is the key feature in our model that allows us to scale stable and accurate inference to a large number of images/genes represented by many image input features. For high-dimensional models, sparsity helps prevent sampling errors from swamping out the true signal in data, leading to stable parameter estimates. In our framework, sparsity implies that each image/gene is affected only by a few underlying estimated factors, and as a result, many of the mixing weights in the model will be (near) zero. The sparse Bayesian factor model was derived using the following matrix representation: where X = [x1, x2… x] is a p × n dimensional data matrix, with n the number of features, quantifying the associated gene-expression values for p images (genes) under consideration. Each row of X is called a gene pattern with dimension 1 × n. Here, we assume that each gene pattern is already normalized to zero mean. A is the factor loading matrix with dimension p × k, which contains the linear weights. S is the factor matrix with dimension k × n, with each element modeled by a standard normal distribution. Each column of S is the factor score for feature i (i = 1, 2, … , n) and each row is called a factor. E is the additive Gaussian noise with dimension p × n. Both A and S are inferred by the model simultaneously. From the model we can see that each row of X is modeled by a linear combination of the factors (rows of S), indicating that the variability of the original p feature patterns can be explained by only k latent factors. The model can also be written in vector form as follows: where x and a denote the j row of X and A, respectively, and the basis matrix S is shared across all samples. Indeed, factor analysis is an unsupervised dimensionality reduction method used widely in data analysis and signal processing (Prince ). To impose the sparsity required by the underlying biological assumption where spatial gene expression patterns are modeled only by a few domains (factors), we used the Student-t distribution, which consists of a Gaussian distribution and a Gamma prior on the precision parameter. The sparseness is directly controlled by the precision parameter α,; the objective of imposing sparseness is to automatically shrink most elements in A near zero. The updating equations, along with a full description of the sparse factor model used in this manuscript can be found in Pruteanu-Malinici . For an extension of our previous work, which largely focused on two developmental stage ranges only, the number of factors (k) for every developmental stage range needed to be determined in an ideally unbiased fashion. For this, we used an adaptive Gibbs sampler, which automatically truncated the loading and factor matrices through an adaptive selection of the number of important factors. This sparse Bayesian infinite factor model, first introduced by Bhattacharya and Dunson (2011), obviates the need for pre-specifying the number of factors; the effective number of factors (here denoted by k*) is chosen such that the contribution from adding additional factors is negligible. This approach has been shown to produce accurate estimates of the true effective number of factors k*; the adaptation of the Gibbs sampler occurs every 10 iterations at the beginning of the Markov chain but decreases in frequency exponentially fast, so as to satisfy the diminishing adaptation condition in Theorem 5 of Roberts and Rosenthal (2007). More specifically, the decreasing frequency is modeled as an exponential at the tth Gibbs iteration with α and α chosen so that adaptation occurs every 10 iterations initially but then decreases in frequency exponentially fast. The loadings matrix is adaptively modified by monitoring the columns with all elements within some pre-specified small neighborhood of zero. For some iterations, columns may be discarded or a new column could be simply added to the matrix; the remaining parameters of the model are modified accordingly. The parameters of the factors (in the case of adding some) are estimated from their prior distribution to fill in the necessary values.

2.3 Conditional random fields

Probabilistic graphical models such as Bayesian networks and random fields are increasingly popular choices for statistical modeling of complex biological relationships. Although Bayesian networks provide a viable solution for directed acyclic relationships where the direction of causality can be easily identified, undirected graphical models offer a clear advantage for highly connected relational structures that are not simple chains or trees. Among undirected models, CRFs (Lafferty ) have proven to be among the most powerful predictors owing to their inherently discriminative (rather than generative) nature. In a CRF, the observable variables (X = {X}) and unobservable variables (Y = {Y}) are treated separately, with the unobservables globally conditioned on the observables. The relationships among the unobservables are modeled via an undirected graph G = (Y,E), in which the Ys are the nodes (vertices), and edges E ⊆ Y × Y are pairs (Y, Y); the edges serve to denote direct non-independence relations between pairs of Ys. In particular, a node Y is taken to be conditionally independent of all other nodes Y, given the immediate neighbors N of Y in the graph: for N(Y) = {Y≠|(Y, Y)∈E}. The well-known Hammersley–Clifford theorem (Hammersley and Clifford, 1971) provides a means of computing conditional densities via decomposition of the graph into cliques. In particular, where Y denotes the nodes in clique I, and Z(X) is a normalizing constant; it is assumed that P(Y) >0 for all possible joint assignments to Y. Φ is called the potential function for clique I; in practice, these are often pooled among like-sized cliques. Because cliques larger than some reasonable size N are typically ignored, modeling is accomplished by choosing a suitable set {Φ1, Φ2, … Φ} of potential functions for different clique sizes; the λs and any additional parameters of the Φ’s can be trained discriminatively via cross validation. Exact inference with a CRF is tractable if the graph can be converted into a chain or a tree. To this end, a junction tree can be obtained by collapsing tight clusters of nodes into meta-nodes and extracting a maximal spanning tree from the resulting structure (Jensen ; Lauritzen and Spiegelhalter, 1988). The sum–product algorithm (Pearl, 1988) can then be applied to propagate local densities across the tree, permitting exact computation of posterior probabilities for joint or individual value assignments to nodes in the graph, or identification of the maximum a posteriori assignment; for linear-chain CRFs, these are analogous to the well-known Forward–Backward and Viterbi algorithms for hidden Markov models (Rabiner, 1989). To infer the presence or absence of specific annotation terms for individual embryo images, we constructed a CRF structured as shown in Figure 2. Each node Y denotes the status of an annotation term: Y = 1 means present (the annotation term applies to the image), Y = 0 means absent (the annotation term does not apply). Columns correspond to developmental stages. All of the nodes in a column are directly connected via an edge to all nodes in adjacent columns (blue lines). Within a column, the nodes are connected in a linear chain (i.e. each node has exactly 1 or 2 neighbors within the column), with the order of the chain chosen so as to maximize the total mutual information between all adjacent pairs in the chain; this maximization was carried out via a standard traveling-salesman heuristic in Matlab. Each column was constrained to include only the most popular annotation terms in the training partition (for more details, see Results). The sparse image factors (previous section) were included as observables X; the Xs were specific to each column, and numbered from k = 57 to k = 160, depending on developmental stage, with later stages having more factors.

Fig. 2.

CRF framework used for the automated annotation of time-course Drosophila embryo ISH images. Nodes correspond to annotation terms, and edges denote relationships. The order of the annotation terms within a given stage range was determined using a standard traveling–salesman heuristic in Matlab We defined potential functions for cliques of size up to 2: Φ1(Y, X) = λ1 log P(X|Y), and Φ2(Y, Y, X) = λ2 log P(Y, Y), where P(Y, Y) is a multinomial and P(X|Y) is a multivariate Gaussian with diagonal covariance, both trained by simple counts (maximum likelihood) from the training partition during cross validation (see Results). Coefficients λ1 and λ2 were estimated by maximizing the training-partition classification accuracy via simple hill climbing. Φ1 we refer to as the node potential, as it is associated with single-node cliques, and Φ2 we refer to as the edge potential, as it is associated with two-node cliques (individual edges in the graph).

3 RESULTS

In this section, we describe the application of a sparse BFA-CRF framework for automatic time-course gene expression pattern annotation. Our procedure starts by extracting sparse meaningful features (sBFA) from the entire collection of Drosophila embryos, suitable for downstream temporal analysis based on a conditional-random-fields approach. We then use the estimated factor loadings as observed variables in the CRF framework, so as to infer most likely annotation terms for previously unseen images.

3.1 Factor inference/decomposition of expression patterns

The BDGP collection divides early embryogenesis of Drosophila into six developmental stage ranges, 1–3, 4–6, 7–8, 9–10, 11–12, 13–16, and most of the CV terms are stage-range specific. As mentioned previously, we skipped stage range 1–3 owing to lack of informative images, as well as a low number of annotation terms associated to it. We applied the sBFA model to the entire set of images from the five stage ranges of interest. These spanned thousands of images (Table 1), with each stage being annotated by a set of 40–150 annotation terms. To illustrate the potential of the sBFA for decomposing expression patterns into meaningful features, we show selective estimated factors from developmental stages 9–10. The model began with the set of 5929 embryo images and estimated the loadings and factor matrices while having full control of the degree of sparsity (throughout our analysis, the sparsity on the factor loading matrix A was controlled by a scale parameter of the Gamma prior distribution on the precision parameter α, h0 = 10−6). Figure 3 illustrates a selection of the estimated factors that, per ensemble, correspond to lateral, dorsal and ventral views, demonstrating the ability of the model to automatically extract distinct patterns for different embryo orientations. As mentioned in ‘Materials and Methods’ section, the number of factors was determined in an unsupervised fashion, for every developmental stage range, using the sparse Bayesian infinite factor model. The estimated number of factors can be found in Table 2; in addition, we compared these findings with an empirically determined estimation akin to Pruteanu-Malinici . As illustrated in Table 2, the range of factors is similar for both scenarios: fully unsupervised (infinite factor models) or estimated by underlying biological assumptions (generate and test). A convergence check on the estimated number of factors for two randomly chosen stage ranges is illustrated in Figure 4. Similar to the sBFA analysis, the Bayesian infinite factor model was run for 6000 Gibbs iterations, discarding the first 1000 and estimating the model parameters on the remaining 5000 iterations.

Table 1.

Statistics of the images from the BDGP database before and after the filtering process

BDGP image statistics	Stage range 4–6	Stage range 7–8	Stage range 9–10	Stage range 11–12	Stage range 13–16
Original number of images	9484	5744	5929	13 737	19 784
Number of images after filtering process	8722	5227	5523	13 245	19 269
Number of images shared across 1807 genes in common	6610	4615	4468	9315	11 111
Number of annotation terms	14	8	9	9	8

Note: The annotation terms represent a fraction of the total CV; for any given stage range, they cover ∼85% of the total number of genes of interest, being frequent enough to show statistical significance.

Fig. 3.

Selected factors, estimated from a total of k = 80 factors and a grid size of 80 × 40 (developmental stage range 9–10). Different background colors are an artifact and not part of the model

Table 2.

Comparison of the number of estimated factors in the BDGP set

Method	Stage range 4–6	Stage range 7–8	Stage range 9–10	Stage range 11–12	Stage range 13–16
Generate and test	k = 60	k = 100	k = 100	k = 150	k = 150
Infinite factor models	k = 57	k = 60	k = 80	k = 140	k = 160

Note: First row corresponds to number selection based on biological prior knowledge followed by generate-and-test procedures. Second row shows the estimated number of factors, fully unsupervised (the infinite BFA).

Fig. 4.

Convergence of the estimated number of factors for two developmental stage ranges (7–8 and 11–12): 5000 Gibbs iterations

Selected factors, estimated from a total of k = 80 factors and a grid size of 80 × 40 (developmental stage range 9–10). Different background colors are an artifact and not part of the model Convergence of the estimated number of factors for two developmental stage ranges (7–8 and 11–12): 5000 Gibbs iterations Statistics of the images from the BDGP database before and after the filtering process Note: The annotation terms represent a fraction of the total CV; for any given stage range, they cover ∼85% of the total number of genes of interest, being frequent enough to show statistical significance. Comparison of the number of estimated factors in the BDGP set Note: First row corresponds to number selection based on biological prior knowledge followed by generate-and-test procedures. Second row shows the estimated number of factors, fully unsupervised (the infinite BFA). The feature extraction/selection process was followed by filtering non-informative (such as ubiquitous) gene expression patterns. Using Euclidean distances between estimated sparse factor analysis weights and a null vector as reference, we separated informative images from the non-informative ones (for a full description see Pruteanu-Malinici ). We successfully removed 2.6–9% non-informative images (Table 1).

3.2 Large-scale annotation of time-course expression patterns

In evaluating the performance of the sBFA-CRF framework, we used the estimated sparse loadings/features only on the set of genes in common between all five stage ranges of interest and a repertoire of annotation terms from a CV. The most popular annotation terms were independently selected for each stage range, to cover ∼85% of the entire set of genes. This resulted in a set of 1807 images and 48 annotation terms distributed as follows: 14 terms for stage range 4–6, 8 terms for stage range 7–8, 9 terms for stage range 9–10, 9 terms for stage range 11–12 and 8 terms for the last stage range 13–16 (Table 1). To assess the relative utility of various parts of our model, we determined the prediction accuracy of the full model compared with versions of the model handicapped in various ways. In particular, we considered including (in separate experiments) the following sets of edges in the CRF: Relationships across adjacent stage ranges and within stage ranges (between annotation terms): blue and green lines in Figure 2 (full model, scenario A). Relationships across adjacent stage ranges only: blue lines in Figure 2 (scenario B). Relationships within stage ranges only: green lines in Figure 2 (baseline, scenario C). For example, when images are annotated for individual stage ranges in isolation, the relationships indicated by the edges between adjacent stages are ignored. Two examples of Drosophila expression pattern images across time are shown in Figure 5: we are interested in modeling the edge potentials between developmental stage ranges, as well as those between annotation terms within each stage range.

Fig. 5.

Drosophila embryonic gene expression across six stage ranges. Images can display different embryo orientations due to the relatively long developmental time spanned by a stage range. Using the edge potentials between adjacent stage ranges, as well as within stage ranges, translates into a significant increase in annotation accuracy Annotation accuracy was computed as a global measure, across all annotation terms and stage ranges, by comparing the ground truth (human curated labels) with the sBFA-CRF predictions. As previous methods had largely focused on the annotation of individual stage ranges, one term at a time, and are largely trained on heavily curated benchmark data rather than the whole BDGP dataset, a fair comparison to these approaches was not feasible. To put our approach into context, we therefore compared the results generated by the sBFA-CRF framework with our own recent binary Support Vector Machines (SVM)-based classification system described in Pruteanu-Malinici . We had previously shown that this approach provides competitive and often superior classification results compared with the best competing approaches, even when working with the full BDGP image data instead of ‘cleaned’ benchmarks. Our previous method used independent SVM classifiers for each annotation term and stage range, disregarding relationships within and between adjacent stage ranges. This resulted in lower annotation accuracy, as shown in Table 3. The SVM results are comparable with the new CRF baseline scenario, which only considers edge potentials between annotation terms within the same stage range (both models simply ignore any temporal/transition information that might improve the overall accuracy). The advantage of using the edge potentials between adjacent stage ranges translates into an absolute increase of 3–4% in accuracy (i.e. a relative reduction of the error rate of >20%). All models were applied to the same set of 1807 images, using 10-fold cross validation; mean values across five runs are shown.

Table 3.

Summary of annotation accuracy

Annotation accuracy analysis	CRF scenario (A)	CRF scenario (B)	CRF scenario (C)	SVM
Mean annotation accuracy	86.75	85.69	82.93	83.32

Note: sBFA-CRF and SVM models: mean annotation accuracy, over 5 N-fold cross validation runs (N = 10). Scenario (A) includes relationships between adjacent stage ranges and within stage ranges; scenario (B) considers only relationships between adjacent stage ranges; scenario (C) models only relationships within stage ranges (baseline). For the SVM model, we used independent classifiers for each annotation term and stage range.

Summary of annotation accuracy Note: sBFA-CRF and SVM models: mean annotation accuracy, over 5 N-fold cross validation runs (N = 10). Scenario (A) includes relationships between adjacent stage ranges and within stage ranges; scenario (B) considers only relationships between adjacent stage ranges; scenario (C) models only relationships within stage ranges (baseline). For the SVM model, we used independent classifiers for each annotation term and stage range.

3.3 Missing-data annotation analysis

In addition to improved gene annotation accuracy, the sBFA-CRF framework provides an elegant means of annotating images in missing-data scenarios. During CRF decoding, the most likely configuration of the model (i.e. values of the unobservables, Y) is computed using relationships between adjacent stage ranges, as well as within each stage range. In the case of missing data, the most likely state for a given node Y with no directly related observables X is estimated entirely through relationships in the random field. This allows us to infer annotation terms for missing images, which is of utmost importance in scenarios where, for a given gene, data have been collected for only a subset of the stage ranges. In evaluating the performance of the sBFA-CRF model in annotating missing data, we manually removed 5–25% of images from the 1807 gene set, by randomly selecting genes and removing their corresponding images; for missing images, the node potentials were set to 1. Results are shown in Figure 6, where the model included edge potentials between adjacent stage ranges, as well as within stage ranges (CRF scenario A); we were able to annotate with 80% or better accuracy even for scenarios with 25% missing data. Remarkably, our model outperforms the SVM classification framework (which had access to full data) even when 15% of the data are withheld from the sBFA-CRF. Figure 7 illustrates particular cases with genes that are correctly annotated, for stages where their images were missing. As previously mentioned, this is of particular interest to biologists who require predictions for stages where images have not yet been collected. Our results confirm that the proposed time-course pipeline leads to highly successful expression pattern classification, despite the presence of uninformative images, registration errors and missing data in considerable amounts.

Fig. 6.

Fig. 7.

Missing-data gene annotation analysis. Shaded boxes indicate the stage range for which data were missing (we manually removed those images). In two cases, the entire set of annotation terms is correctly annotated within the stage range despite the fact that no images were available (third and sixth rows). In these two scenarios, the CRF used the relationships across adjacent stages, to estimate the most likely configuration. A selection of the correctly and incorrectly annotated terms for the stage range with missing data are shown in the last two columns

Annotation accuracy results for missing data scenarios. The accuracy values were computed as global measures, across the entire set of 1807 genes. For each case, we randomly selected 5–25% of the complete gene set and removed their corresponding images, so as to simulate missing data scenarios. The green bar indicates the annotation accuracy for the full dataset scenario (previous analysis); the red bar corresponds to the SVM analysis Missing-data gene annotation analysis. Shaded boxes indicate the stage range for which data were missing (we manually removed those images). In two cases, the entire set of annotation terms is correctly annotated within the stage range despite the fact that no images were available (third and sixth rows). In these two scenarios, the CRF used the relationships across adjacent stages, to estimate the most likely configuration. A selection of the correctly and incorrectly annotated terms for the stage range with missing data are shown in the last two columns Lastly, we compared the sBFA-CRF predicted labels to the human curated ones (the ground truth), so as to identify genes and annotation terms for which the annotations were different but the outcome from our model appeared consistent. We recognize that for the same annotation term, the corresponding regions in different images may have significant variations in visual appearances, which would lead to a difficult manual annotation task and could sometime generate ambiguous outcomes. We show three examples for which the sBFA-CRF annotations are opposite from the human curated ones and are likely correct given the full context (Fig. 8). For all three scenarios, we confirmed our findings with the BDGP human curators. Arguably, the model was correct in predicting different annotation terms in the following examples, including gene FBgn0003502, where human curators initially decided that expression in ‘procephalic ectoderm AISN’ is not detected for stage range 4–6; however, the sBFA-CRF predicted label, as well as a second careful visual inspection, would suggest the contrary. In addition, we identified another instance where ‘ventral ectoderm anlage’ should have been annotated for gene FBgn0022073 in stage range 7–8. The last scenario (gene FBgn0033988) corresponds to a case where all images are extremely difficult to annotate owing to out-of-focus staining issues or overall noise. On a second inspection, the model was arguably correct in labeling the annotations for both stage ranges 4–6 and 9–10.

Fig. 8.

Examples of genes/annotation terms for which the sBFA-CRF predictions differ from the human curated ones. Yes—gene is annotated; No—gene is not annotated. Note the consistency of model predictions within the context of annotations for neighboring stage ranges

4 CONCLUSIONS

We have described a novel sBFA-CRF model to automatically annotate Drosophila embryo gene-expression time-course data. The sparse BFA framework represents an efficient feature selection technique, which automatically determines the feature-space dimension, using a non-parametric implementation. The learned features are then used as observed variables in the CRF framework, so as to infer most likely annotation terms for previously unseen images. The CRF encodes temporal relationships between adjacent stage ranges throughout Drosophila development. By capturing the temporal sequence, the model is able to predict the entire collection of annotation terms in a single run and achieves superior performance when compared with highly competitive models that annotate stages in isolation. In addition to improved annotation accuracy, the experimental results demonstrate the success of the method in handling missing-data scenarios. This is extremely useful in real-life scenarios when estimates are needed over the ensemble of annotation terms, with only partial data being collected. One promising extension to our approach would be to include ‘latent’ annotation terms in the CRF structure, to account for additional rare annotation terms for which we would not attempt to obtain a prediction. These latent terms could have limited connectivity in the graph, so as to allow large numbers of latencies to be included without compromising decoding efficiency. Such an extension may well improve prediction accuracy for the primary terms, even if the latent terms are themselves difficult to accurately predict (due to paucity of training data for those terms). It would also increase the flexibility of the resulting model: while we currently select the primary annotation terms manually based on their popularity among genes in a given stage range, a simple threshold on the number of genes being annotated, together with an appropriate means of ranking terms, would allow to automatically partition the primary versus latent sets. Based on our experience with the BFA-CRF model described here, additional work along these lines seems promising. Finally, we are continuing to develop this approach in close collaboration with biologists so as to suggest outliers or interesting patterns during the anticipated expansion of the BDGP collection to the whole Drosophila genome. We plan to incorporate the sBFA-CRF framework into a Fiji plug in (Schindelin )—a gene annotation tool that would accurately assign annotation terms to new/unseen images, in a timely manner.

20 in total

1. Automated annotation of Drosophila gene expression patterns using a controlled vocabulary.

Authors: Shuiwang Ji; Liang Sun; Rong Jin; Sudhir Kumar; Jieping Ye
Journal: Bioinformatics Date: 2008-07-16 Impact factor: 6.937

2. Sparse Bayesian infinite factor models.

Authors: A Bhattacharya; D B Dunson
Journal: Biometrika Date: 2011-06 Impact factor: 2.445

3. Fiji: an open-source platform for biological-image analysis.

Authors: Johannes Schindelin; Ignacio Arganda-Carreras; Erwin Frise; Verena Kaynig; Mark Longair; Tobias Pietzsch; Stephan Preibisch; Curtis Rueden; Stephan Saalfeld; Benjamin Schmid; Jean-Yves Tinevez; Daniel James White; Volker Hartenstein; Kevin Eliceiri; Pavel Tomancak; Albert Cardona
Journal: Nat Methods Date: 2012-06-28 Impact factor: 28.547

4. SPEX2: automated concise extraction of spatial gene expression patterns from Fly embryo ISH images.

Authors: Kriti Puniyani; Christos Faloutsos; Eric P Xing
Journal: Bioinformatics Date: 2010-06-15 Impact factor: 6.937

5. BEST: a novel computational approach for comparing gene expression patterns from early stages of Drosophila melanogaster development.

Authors: Sudhir Kumar; Karthik Jayaraman; Sethuraman Panchanathan; Rajalakshmi Gurunathan; Ana Marti-Subirana; Stuart J Newfeld
Journal: Genetics Date: 2002-12 Impact factor: 4.562

6. A microfluidic device and computational platform for high-throughput live imaging of gene expression.

Authors: Wolfgang Busch; Brad T Moore; Bradley Martsberger; Daniel L Mace; Richard W Twigg; Jee Jung; Iulian Pruteanu-Malinici; Scott J Kennedy; Gregory K Fricke; Robert L Clark; Uwe Ohler; Philip N Benfey
Journal: Nat Methods Date: 2012-09-30 Impact factor: 28.547

2. New methods for computational decomposition of whole-mount in situ images enable effective curation of a large, highly redundant collection of Xenopus images.

Authors: Ilya Patrushev; Christina James-Zorn; Aldo Ciau-Uitz; Roger Patient; Michael J Gilchrist
Journal: PLoS Comput Biol Date: 2018-08-29 Impact factor: 4.475

2 in total