Xiao Cai1, Hua Wang, Heng Huang, Chris Ding. 1. Department of Computer Science and Engineering, University of Texas at Arlington, Arlington, TX 76019, USA.
Abstract
MOTIVATION: Staining the mRNA of a gene via in situ hybridization (ISH) during the development of a Drosophila melanogaster embryo delivers the detailed spatio-temporal patterns of the gene expression. Many related biological problems such as the detection of co-expressed genes, co-regulated genes and transcription factor binding motifs rely heavily on the analysis of these image patterns. To provide the text-based pattern searching for facilitating related biological studies, the images in the Berkeley Drosophila Genome Project (BDGP) study are annotated with developmental stage term and anatomical ontology terms manually by domain experts. Due to the rapid increase in the number of such images and the inevitable bias annotations by human curators, it is necessary to develop an automatic method to recognize the developmental stage and annotate anatomical terms. RESULTS: In this article, we propose a novel computational model for jointly stage classification and anatomical terms annotation of Drosophila gene expression patterns. We propose a novel Tri-Relational Graph (TG) model that comprises the data graph, anatomical term graph, developmental stage term graph, and connect them by two additional graphs induced from stage or annotation label assignments. Upon the TG model, we introduce a Preferential Random Walk (PRW) method to jointly recognize developmental stage and annotate anatomical terms by utilizing the interrelations between two tasks. The experimental results on two refined BDGP datasets demonstrate that our joint learning method can achieve superior prediction results on both tasks than the state-of-the-art methods. AVAILABILITY: http://ranger.uta.edu/%7eheng/Drosophila/.
MOTIVATION: Staining the mRNA of a gene via in situ hybridization (ISH) during the development of a Drosophila melanogaster embryo delivers the detailed spatio-temporal patterns of the gene expression. Many related biological problems such as the detection of co-expressed genes, co-regulated genes and transcription factor binding motifs rely heavily on the analysis of these image patterns. To provide the text-based pattern searching for facilitating related biological studies, the images in the Berkeley Drosophila Genome Project (BDGP) study are annotated with developmental stage term and anatomical ontology terms manually by domain experts. Due to the rapid increase in the number of such images and the inevitable bias annotations by human curators, it is necessary to develop an automatic method to recognize the developmental stage and annotate anatomical terms. RESULTS: In this article, we propose a novel computational model for jointly stage classification and anatomical terms annotation of Drosophila gene expression patterns. We propose a novel Tri-Relational Graph (TG) model that comprises the data graph, anatomical term graph, developmental stage term graph, and connect them by two additional graphs induced from stage or annotation label assignments. Upon the TG model, we introduce a Preferential Random Walk (PRW) method to jointly recognize developmental stage and annotate anatomical terms by utilizing the interrelations between two tasks. The experimental results on two refined BDGP datasets demonstrate that our joint learning method can achieve superior prediction results on both tasks than the state-of-the-art methods. AVAILABILITY: http://ranger.uta.edu/%7eheng/Drosophila/.
The mRNA in situ hybridization (ISH) provides an effective way to visualize gene expression patterns. The ISH technique can precisely document the localization of gene expression at the cellular level via visualizing the probe by colorimetric or fluorescent microscopy to allow the production of high quality images recording the spatial location and intensity of the gene expression (Fowlkes ; Hendriks ; L'ecuyer ; Megason and Fraser, 2007). Such spatial and temporal characterizations of expressions paved the way for inferring regulatory networks based on spatio-temporal dynamics. The raw data produced from such experiments includes digital images of the Drosophila embryo (examples are visualized in Fig. 1) showing a particular gene expression pattern revealed by a gene-specific probe (Grumbling ; Lyne ; Tomancak , 2007). The fruit fly Drosophila melanogaster is one of the most used model organisms in developmental biology.
Fig. 1.
Examples of Drosophila embryo ISH images and associated anatomical annotation terms in the stages 4–6, 7–8, 9–10, 11–12 and 13–16 in the BDGP database. The darker stained region highlights the place where the gene is expressed. The darker color the region has, the higher the gene expression level is
Examples of Drosophila embryo ISH images and associated anatomical annotation terms in the stages 4–6, 7–8, 9–10, 11–12 and 13–16 in the BDGP database. The darker stained region highlights the place where the gene is expressed. The darker color the region has, the higher the gene expression level isTraditionally, such ISH images are analyzed directly by the inspection of microscope images and available from well-known databases, such as the Berkeley Drosophila Genome Project (BDGP) gene expression pattern database (Tomancak , 2007) and Fly-FISH (L'ecuyer ). To facilitate spatio-temporal Drosophila gene expression pattern studies, researchers needed to solve two challenging tasks first: Drosophila gene expression pattern stage recognition (temporal descriptions) and anatomical annotation (spatial descriptions). As shown in Figure 1, Drosophila embryogenesis has been subdivided into 17 embryonic stages. These stages are defined by prominent features that are distinguishable in living Drosophila embryos (Weigmann ). To recognize the stages of the Drosophila, embryos provide their time course patterns. On the other hand, the Drosophila gene expression patterns are often recorded by controlled vocabularies from the biologist's perspective (Tomancak ). Such anatomical ontology terms describe the spatial biological patterns and often cross stages. What is more, because the ISH images are attached to each other collectively becoming bags of images, the corresponding stage label as well as anatomical controlled terms are the descriptions of the whole group of images instead of each individual image inside the bag. A Drosophila embryo ISH image bag belongs to only one stage, but has multiple related anatomical terms. Previously, those two tasks are tackled by domain experts. However, due to the rapid increase in the number of such images and the inevitable bias annotation by human curators, it is necessary to develop an automatic method to classify the developmental stage and annotate anatomical structure using controlled vocabulary.Recently, a lot of research works have been proposed to solve the above two problems. They considered the stage recognition as a single-label multi-class classification problem while the anatomical annotation was treated as a multi-label multi-class classification problem. (Kumar ) first developed an embryo enclosing algorithm to find the embryo outline and extract the binary expression patterns via adaptive thresholding. (Peng and Myers, 2004) and (Peng ) developed new ways to represent ISH images based on Gaussian mixture models, principal component analysis and wavelet functions. Besides that, they utilized min-redundancy max-relevance to do the feature selection and automatically classify gene expression pattern developmental stages. Recently, (Puniyani ) constructed a system (called SPEX2) and concluded that the local regression (LR) method taking advantage of the controlled term–term interactions can get the best enhanced anatomical controlled term annotation results. The LR method was proposed by Ji et al. and developed based on their previous works (Ji ; 2010; Li ; Shuiwang ;). All of the above methods have provided new inspirations and insights for classifying or annotating Drosophila gene expression patterns captured by ISH. However, none of them considered doing those two tasks simultaneously. As we know, intuitively, anatomical controlled vocabulary terms provide evidence for the stage label and vice versa. For example, the early stage range is more likely annotated with the controlled terms such as ‘statu nascendi’ and ‘celluar’ than the terms ‘embryonic’ and ‘epidermis’. Therefore, besides the image–stage and image–annotation relationships which have been well studied and applied in the previous research, it is necessary to take advantage of the correlations between stage classes and annotation terms.In this article, we propose a novel Tri-Relational Graph (TG) model that comprises the data graph, anatomical controlled terms graph, developmental stage label graph to jointly classify the stage of images and annotate anatomical terms simultaneously. Upon the TG model, we introduce a Preferential Random Walk (PRW) method to simultaneously produce image-to-stage, image-to-annotation, image-to-image, stage-to-image, stage-to-annotation, stage-to-stage, annotation-to-image, annotation-to-stage and annotation-to-annotation relevances to jointly learn the salient patterns among images that are predictive of their stage label and anatomical annotation terms. Our method achieves superior developmental stage classification performance and anatomical terms annotation results compared with the state-of-the-art methods.We consider each image bag as a data point and extract the bag-of-word features that are widely used in computer vision research as the corresponding descriptors. Since the real object is 3D and each image can only provide 2D observation from a certain perspective, we integrate the bag-of-word features for different views. We summarize our contributions as follows:This article is the first one to propose a novel solution to the questions ‘What is the developmental stage?’ and ‘What are the anatomical annotations’ simultaneously, given an unlabeled image bag.Via the new TG model that we constructed, the relationships between stage label and anatomical controlled terms as well as the correlations among anatomical terms can be naturally and explicitly exploited by the graph-based semi-supervised learning methods.We propose a new PRW method to seek the hidden annotation–annotation and annotation–stage relevances. Other than only using image-to-image relevance conducted by existing methods, we can directly predict the stage label and annotate anatomical controlled terms for unknown image bags.
2 DATA DESCRIPTORS
As we known, the Drosophila embryos are 3D objects. However, the corresponding image data can only demonstrate 2D information from a certain view. Since recent study has shown that incorporating images from different views can improve the classification performance consistently (Ji ), we will use the images taken from multiple views instead of one perspective as the data descriptor. We only consider the lateral, dorsal and ventral images in our experiment due to the fact that the number of images taken from other views is much less than that of the above three views. All the images from BDGP database have been pre-processed, including alignment and resizing to 128×320 gray images. For the sake of simplicity, we extract the popular SIFT (Lowe, 2004) features from the regular patches with the radius as well as the spacing as 16 pixels (Shuiwang ), which is shown in Figure 2. Specifically, we extract one SIFT descriptor with 128 dimensions on each patch and each image is represented by 133 (7×19) SIFT descriptors. Nevertheless, the above SIFT features cannot be directly used to measure similarity between data points (image bags), because the number of images in each image bag is different. In order to get a desired equal length descriptor to release the burden of later learning task, we need to build codebook for all extracted SIFT features first and then redo the data representations for each image bag based on the constructed codebook.
Fig. 2.
Demonstration of the regular patches. We extract one SIFT feature on one patch, where the radius and spacing of the regular patches are set to 16 pixels
Demonstration of the regular patches. We extract one SIFT feature on one patch, where the radius and spacing of the regular patches are set to 16 pixels
2.1 Codebook construction
Usually the codebook is established by conducting the clustering algorithms on a subset of the local features, and the cluster centers are then chosen as the visual words of the codebook. In our study, we use K-means to do the clustering on the training image bags. Since the result of K-means depends on the initial centers, we repeat it with 10 random initializations from which the one resulting in the smallest objective function value is selected. The number of clusters is set to 1000, 500 and 250 for lateral, dorsal and ventral images, respectively, according to the total number of images for each view as shown in Table 1. (Other codebook sizes gave similar performance.)
Table 1.
The statistics summary of the refined BDGP images with 79 terms
Stage range
4–6
7–8
9–10
11–12
13–16
Total
Size of control term
11
12
12
20
31
79
No. of image bags
500
500
500
500
500
2500
No. of lateral images
1514
812
727
1356
1004
5414
No. of dorsal images
226
324
431
447
724
2152
No. of ventral images
164
137
81
214
216
812
The statistics summary of the refined BDGP images with 79 terms
2.2 Data (image bag) representations
After we get three codebooks, the images in each bag are quantized separately for each view. Features computed from patches on images with a certain view are compared with the visual words in the corresponding codebook and the visual word closest to the feature in terms of Euclidean distance is utilized to represent it. Therefore, if an image bag encompasses the images from three views, then it could be represented by three bags of words, one for each view. We concatenate the three vectors so that the images with different views (lateral, dorsal and ventral) in one bag can be represented by one vector. To be specific, Let x∈ℝ1000, x∈ℝ500 and x∈ℝ250 denote the bag-of-words vector for images in a bag with lateral, dorsal and ventral view, respectively. The descriptor for this image bag can be represented as x=[x; x; x]∈ℝ1750. Since not all the image bags enclose the images from all three views, the corresponding bag-of-words representation is a vector of zeroes if a specific view is absent. Moreover, in order to capture the variability of the number of images in each view and each bag, we normalized the bag-of-words vector to unit length. At last, each image bag is represented by a normalized vector x.
3 METHODS
In this section, we first construct a TG to model Drosophila gene expression patterns followed by proposing a novel PRW method. Using PRW on TG, we jointly make stage classification and annotate anatomical terms of Drosophila gene expression patterns.For the Drosophila gene expression pattern data, we have n gene expression images bags X={x1,···, x}, where each image bag is abstracted as a data point x∈ℝ. Each data point x belongs to one of K stage classes ={c1,···, c} represented by y∈{0, 1}, such that y(k)=1 if x is classified into class c, and 0 otherwise. Meanwhile, each image bag x is also annotated with a number of anatomical ontology terms ={a1,···, a} represented by y∈{0, 1}, such that y(k)=1 if x is annotated with term a, and 0 otherwise. Also, for convenience, we write y=[y, y]∈{0, 1}. Without loss of generality, we assume the first lSection 1, the stage class and anatomical terms have some relations. We utilize the following affinity matrix to model their interrelations, R∈ℝ, where R(i, j) indicates how closely class c and term a are related. In this work, we compute it as
where is the i-th row of Y and is the j-th row of Y. Throughout this article, we denote a vector as a bold lowercase character and a matrix as an uppercase character. We denote the i-th entry of a vector v as v(i), and the entry at the i-th row and j-th column of a matrix M as M(i, j). ||v|| denotes the Euclidian norm of vector v. And the inner product of two vector v1 and v2 is defined as =v1
v2.
3.1 The construction of TG
Given the dataset X, pairwise similarity W∈ℝ between data points can be computed using the Gaussian kernel function,
where the vector x is calculated using bag-of-word features for one image bag. Regarding the parameter σ, we resort to self-tuning method (Zelnik-Manor and Perona, 2004). W(i, j) indicates how closely x and x are related. From W, a graph =(, ) can be induced, where =X and ⊆×. And we use kNN graph. To be specific, we connect x, x if one of them is among the other's k nearest neighbor and define the value of the edge connecting them by Equation (2). Because characterizes the relationships between data points, it is usually called as data graph, such as the middle subgraph in Figure 3. Existing graph-based semi-supervised learning methods (Kang ; Zha ) only make use of the data graph, on which the class label information is propagated.
Fig. 3.
The TG constructed from the gene expression data. Solid lines indicate affinity between vertices within in a same subgraph, dashed lines indicates associations between vertices in two different subgraphs
The TG constructed from the gene expression data. Solid lines indicate affinity between vertices within in a same subgraph, dashed lines indicates associations between vertices in two different subgraphsDifferent from conventional single-label classification learning problem in which classes are mutual exclusive, the anatomical terms are interrelated with one another. Again, we resort to cosine similarity to calculate the controlled term affinity matrix W, where W(i, j) indicates the correlation between a and a. Thus, a graph =(, ) is induced, where = and ⊆×. We call as annotation label subgraph, which is shown as the right subgraph in Figure 3. Similarly, stage classification label graph shown as the left subgraph in Figure 3, =(, ) can be constructed from stage classification labels, where = and ⊆×, where we define the value of the edge connecting two stage labels as W(i, j)=‖S−S‖, where ‖ ‖ means Frobenius norm and S denotes the between class scatter matrix for stage i. Connecting and by the annotation associations via the green dashed lines, connecting and by the class associations via the blue dashed lines and connecting and by the stage-term association via the purple dashed lines, we construct a TG as following:
which is illustrated in Figure 3. Obviously, the subgraph =(, V, ) connects G and G, whose adjacency matrix is Y. Similarly, the adjacency matrix of =(, , ) is Y. The subgraph (, , ) characterizes the associations between stage classes and anatomical terms whose adjacency matrix is R defined in Equation (1).In contrast to existing graph-based semi-supervised learning methods that only use information conveyed by , we aim to simultaneously classify and annotate an unlabeled data point using all the information encoded in . Since all data points (gene expression image bags), stage terms and annotation terms are equally regarded as vertices on , our task is to measure the relevance between a class/anntotation term vertex and a data point vertex. As each class/annotation term has a set of associated training data points, which convey the same biological record information as the class/annotation term, we consider both a class/annotation term vertex and its labeled training image bag vertices as a group set,
which is illustrated as the vertices with orange boundary in 3. As a result, instead of measuring vertex-to-vertex relevance between a class/annotatation term vertex and an unlabeled data point vertex, we may measure the set-to-vertex relevance between the group set and the data point. Motivated by Brin and Page (1998); Tong ), we consider to further develop standard random walk and use its equilibrium probability to measure the relevance between a group set and an unlabeled data point.
3.2 Preferential random walk
Standard random walk on a graph W can be described as a Markov process with transition probability M=D−1W, where d=∑
W(i, j) is the degree of vertex i and D=diag(d1,···, d). Clearly, M ≠ M and ∑M(i, j)=1. When W is symmetric, it corresponds to an undirected graph. When W is asymmetric, it corresponds to a directed graph and d is the out degree of vertex i. Let p( be the distribution of the random walker at time t, the distribution at t+1 is given by p((j)=∑
p((i)M(i, j). Thus, the equilibrium (stationary) distribution of the random walk p*=p( is determined by M
p*=p*. It is well known that the solution is simply given by p*=We/(∑d)=d/(∑d), where d=[d1,···, d].It can be seen that the equilibrium distribution of a standard random walk is solely determined by the graph itself, but independent of the location where the random walk is initiated. In order to incorporate label information, we propose the following PRW:
where 0≤α≤1 is a fixed parameter, and h, called preferential distribution, is a probability distribution such that h(i)≥0 and ∑h(i)=1 . Equation (5) describes a random walk process in which the random walker hops on the graph W according to the transition matrix M with probability 1 − α, and meanwhile it takes a preference to go to other vertices specified by h with probability α. The equilibrium distribution of PRW in Equation (5) is determined by p*=(1 − α )M
p*+αh, which leads to:
Due to Perron-Frobenius theorem, the maximum eigenvalue of M is less than max ∑M(i, j)=1. Thus, I−(1 − α)M is positive definite and invertible. Equation (5) takes a similar form to two existing works: random walk with restart (RWR) method (Tong ) and PageRank algorithm (Brin and Page, 1998). In the former, h is a vector with all entries to be 0 except one entry to be 1 indicating the vertex where the random walk could be restarted; while in the latter, h is a constant vector called as damping factor (Brin and Page, 1998). In contrast, the preferential distribution vector h in Equation (5) is a generic probability distribution, which is flexible thereby more powerful. Most importantly, through h we can assess group-to-vertex relevance, while RWR and PageRank methods measure vertex-to-vertex relevance.Similar to RWR (Tong ), when we set the h to be a probability distribution in which all the entries are 0 except for those corresponding to , p*(i) measures how relevant the k-th group is to the i-th vertex on .
3.3 Preferential random walk on TG
In order to classify and annotate unlabeled data points using the equilibrium probabilities in Equation (6) of the PRW on TG, we need to construct the transition matrix M and the preferential distribution h from .Construction of the transition matrix M:Let
where M, M and M are the intrasubgraph transition matrices of , and respectively, and the rest six sub-matrices are the intersubgraph transition matrices among , and . Let β1∈[0, 1] be the jumping probability, i.e. the probability that a random walker hops from to and vice versa. And let β2∈[0, 1] be the jumping probability from to or vice versa. Therefore, β1 and β2 regulates the reinforcement between and one of the other two subgraphs. When both β1=0 and β2=0, the random walk are performed independently on , which is equivalent to existing graph-based semi-supervised learning methods only using the data graph . Similarly, we define λ as the jumping probability from to or vice versa.During a random walk process, if the random walker is on a vertex of the data subgraph which has at least one connection to the label subgraph, such as vertex x1 in Figure 3, she can hops to the class label or annotation subgraph with probability β1 or annotation subgraph with probability β2, or stay on the data subgraph with probability 1 − β1 − β2 and hop to other vertices of the data subgraph. If the random walker is on a vertex of the data subgraph without a connection to the class label or annotation subgraph, she stays on the data subgraph and hops to other vertices on it as in standard random walk process. To be more precise, let d=∑Y(i, j), the transition probability from x to c is defined as following:Similarly, let d=∑
Y(i, j), the transition probability from c to x is:Following the same definition, the rest four inter-subgraph transition probability matrices are defined as:
where d=∑
Y(i, j) and d=∑
Y(i, j), and
where d=∑
R(i, j) and d=∑
R(i, j).Let d=∑
W(i, j), d=∑Y(i, j), d=∑Q(i, j), d=∑Q(i, j) where Q=R+Y and Q=R+Y.The data subgraph intra transition probability from x to x is computed as:Similarly, let d=∑
W(i, j), the annotation label subgraph intra transition probability from a to a is:let d=∑
W(i, j), the classification label subgraph intra transition probability from c to c is:
It can be easily verified that, ∑M(i, j)=1, i.e. M is a stochastic matrix.
Construction of the preferential distribution H:
the preferential distribution vector specifies a group of vertices to which the random walker prefers to moving in every iteration step. The relevance between this group and an vertex is measured by the equilibrium distribution of the random walk process. Therefore, we construct K=K+K preferential distribution vectors, one for each semantic group G:
where h((i)=1/∑
y(k) if y(k)=1 and h((i)=0, otherwise; h((i)=1, if i=k, γ∈[0, 1] controls how much the random walker prefers to go to the data subgraph and other two subgraphs , . It can be verified that ∑
h((i)=1, i.e, h( is a probability distribution. Let I be the identity matrix of size K×K, we write
PRW on TG:
given the TG of a dataset, using the transition matrix M defined in Equation (7) and the preferential probability matrix H defined in Equation (18), we can perform PRW on the TG. According to Equation (6), its equilibrium distribution matrix P* is computed as:
P*=[p1*,···, p*]∈ℝ(, and p* is the equilibrium distribution of the PRW taking the k-th semantic group as preference. Therefore, p*(i) (l+1≤i≤n) measures the relevance between the k-th class and an unlabeled image bag x. We can predict the stage class from the sub-matrix P* by Equation (20) and annotate the controlled terms for x using the adaptive decision boundary method (Wang ) on the submatrix P* by Equation (21).
Since the stage prediction is a single-label classification problem, we select the stage label y(i) with the maximum probability as the stage label for image bag x.
where is the i-th row vector of matrix P*. Therefore, we can do the stage classification and anatomical controlled term annotation simultaneously.
4 DATA REFINEMENT
In this section, we will introduce the details of the data used in our experiment. Drosophila embryogenesis has 17 stages, which are divided into 6 major ranges, i.e. stages 1–3, 4–6, 7–8, 9–10, 11–12 and 13–16 (the stage 17 is usually studied individually), in the BDGP database (Tomancak ). Each image bag is labeled with one stage term and many controlled vocabulary terms. The total number of anatomical controlled vocabulary terms is 303.We used the following way to refine the dataset. First, we only keep the image bag data with lateral, dorsal and ventral view information. And then, we eliminate six common annotation terms, that is, ‘no staining, ubiquitous, strong ubiquitous, faint ubiquitous, maternal, rapidly degraded’, which can be regarded as outliers because they can neither provide stage-specific information nor record anatomical structures. After that, we remove the anatomical terms whose data sample is <50. We ignore the stage 1–3 data since the number of anatomical terms after the above procedure becomes 2, too small to be compared with other stages. And finally we get 79 anatomical annotation terms in total that we will consider to annotate the unlabeled image bag.We refine the data mainly based on the following two reasons. On one hand, the annotation terms which appear in too few image bags are statistically too weak to be learned effectively. On the other hand, since we will use 5-fold cross-validations in our experiments, we have to guarantee there is at least one data point associated with each anatomical term in each fold. Moreover, in order to balance the number of image bags for different stages, we randomly sample 500 image bags as the data points for each stage. At last, the summary of the refined dataset is shown in Table 1.
5 EXPERIMENT
In this section, we will conduct experiments to evaluate PRW empirically on the refined dataset and compare it with other state-of-art classification methods. Since our method can do joint classification, in order to evaluate the benefit of joint learning, we compare its performance with that of the state-of-art multiclass single label or multiclass multilabel algorithms which can only handle either stage classification or anatomical term annotation problem. Our procedure is to train our model with stage labeled and anatomical term annotated image bags. All testing image bags are unlabeled with developmental stage and unannotated with anatomical controlled terms.
5.1 Experimental setup
When constructing PRW on TG, we used kNN graph setting k=9. We used ‘inverse’ 5-fold cross-validation to determine the values of the following five parameters, that is, using 1-fold for training and using the remaining 4-folds for testing to mimic the scenario in the real application where the number of training data is much less than the testing data. In our experiment, we found that the following five parameters are not sensitive in certain ranges with good performances. β1, β2 and λ controls the jumping between different subgraphs and cannot affect the result much if they are assigned in the range of (0.1, 0.45). α controls initial preference of the random walker and will get stable result if it is assigned in the range of (0, 0.1). γ controls how much the random walker prefers to go to the data subgraph or to go to two other subgraphs and it is usually in the range of (0.1, 0.3).Besides those parameters, we also need to initialize the stage as well as anatomical controlled terms for the testing image bag x, where i=l+1,…, n, l is the number of training image bag. In our experiment, we used k-nearest neighbor (KNN) method to do the initializations for both stage classification and anatomical term annotations tasks because of its simplicity and clear intuition. To be specific, we use k=1 and we abbreviate it as 1NN. Our joint classification framework will self-consistently amend the incorrect labels for stage and controlled terms. We perform 10 random splits of the data and report the average performance over the 10 trials. Please note that, in each trial, we still do ‘inverse’ 5-fold cross validation and record the average performance result as the result of that trial.
5.2 Image bag stage classification
Drosophila gene expression pattern stage categorization is a single-label multi-class problem. We compare the result of our method with that of support vector machine (SVM) with radial basis function (RBF) kernel (Chang and Lin, 2001). We use the optimal parameter values for C and γ got from cross-validation as well. We also compare the classification result of 1NN that we use to do the initialization. We assess the classification in terms of the average classification accuracy and the average confusion matrices. Since the data that we used is class balanced, the mean value of the entries on the diagonal of the confusion matrix is also the average classification accuracy. From the resulting average confusion matrices shown in Figure 5, we can see that the average prediction accuracy of our method is better than that of the other two state-of-art methods, especially in the last stage 13–16, where the number of anatomical terms is greatly larger than that of the other stages.
Fig. 5.
Stage classification results in terms of confusion matrices on 79-term dataset: (a) the confusion matrix calculated by SVM (b) the confusion matrix calculated by 1NN. (c) the confusion matrix calculated by our method. (a) SVM: acc. 84.50%; (b) 1NN: acc. 77.40%; (c) our: acc. 85.20%
5.3 Image bag controlled vocabulary terms annotation
Besides the stage classification task, we also validate our method by predicting the anatomical controlled terms for the Drosophila gene expression patterns, which can be considered as a multi-class multi-label classification problem. The conventional classification performance metrics in statistical learning, precision and F1 score, are utilized to evaluate the proposed methods. For every anatomical term, the precision and F1 score are computed following the standard definition for the binary classification problem. To address the multi-label scenario, following Tsoumakas and Vlahavas (2007), macro and micro average of precision and F1 score are used to assess the overall performance across multiple labels. We compared four state of art multi-label classification methods: local shared subspace (LS) (Ji ), local regression (LR) (Ji ), harmonic function (HF) (Zhu ) and random walk (RW) (Zhou and Schölkopf, 2004). All of them are proposed recently to solve the multilabel annotation problem. In addition, we compare the results of 1NN as well. For the first three methods we use the published codes posted on the corresponding author's websites. And we implement the RW method following the original work (Zhou and Schölkopf, 2004). For HF and RW methods, we follow the original work to solve the multilabel annotation only. Therefore, we only evaluate those two methods on data subgraph and annotation label subgraph without using any information derived from the classification label subgraph such as the stage–term correlation. Table 2 shows the average anatomical annotation performance of 79-term dataset. Compared to the above five stat-of-the-art methods, our method has the best results by all metrics. Figure 6 illustrates the average Micro F1 score of our method, 1NN, LS, LR, RW and HF approaches for all the anatomical terms on 79-term dataset. And again, our method consistently achieves best performance for most of the anatomical controlled terms.
Table 2.
Annotation prediction performance comparison on the 79-term dataset
Method
Ma Pre
Ma F1
Mi Pre
Mi F1
1NN
0.3455
0.3595
0.2318
0.2230
LS
0.5640
0.3778
0.3516
0.1903
LR
0.6049
0.4425
0.3953
0.2243
RW
0.4019
0.3385
0.2808
0.1835
HF
0.3727
0.3296
0.2756
0.1733
Our method
0.6125
0.4434
0.4057
0.2336
Ma Pre, Avg. Macro Precision; Ma F1, Avg. Macro F1; Mi Pre, Avg. Micro Precision; Mi F1, Avg. Micro F1.
Fig. 6.
The Avg. Micro F1 score of five methods on each term in 79-term dataset. (It is better to be viewed in colorful and zoomed in mode.)
Annotation prediction performance comparison on the 79-term datasetMa Pre, Avg. Macro Precision; Ma F1, Avg. Macro F1; Mi Pre, Avg. Micro Precision; Mi F1, Avg. Micro F1.
5.4 The advantage of joint learning
Unlike the traditional work, our proposed method can take advantage of all the information to do the stage classification and anatomical term annotation simultaneously. Therefore, when the number of training data is scare, we can resort to both intrarelations and interrelations to make the decision for stage classification and anatomical controlled term annotation simultaneously. When there are strong correlation between those two tasks, we expect that the performance of both tasks will be enhanced by joint learning work than treating them individually and independently. Figure 4 shows the pairwise label correlations of the 79 terms and stage–term correlations between 5 stages and 79 terms. As highlighted by purple arrows, we can observe that there are high pairwise correlations between the terms ‘embryonic brain’,‘ventral nerve cord’ as well as ‘embryonic/larval muscle system’. Moreover, all the above three terms have high correlations with the stage 13–16, which can provide strong evidence that the given testing image bag could belong to the last developmental stage besides the induction from the data graph only. If our joint classification framework annotates it with all those three terms, although from the data similarity we cannot get strong evidence for the stage prediction, we can take advantage of the term–term as well as term–stage high correlations to adjust its stage to stage 13–16. In other words, relevant anatomical terms could help us to predict the stage label since they provide the spatial and temporal information of local structure corresponding to a specific embryo development stage. Nevertheless, not all anatomical terms will definitely benefit stage classification, which is consistent with our stage classification result. From Figure 5, we can see that our method may have competitive result compared with SVM with respect to some certain stage. However, given more anatomical term information, the performance of our method will gradually outperform the other methods, especially for the prediction result of stage 13–16.
Fig. 4.
The middle part demonstrates the terms–stages correlation and the right part shows the terms–terms correlation of 79 terms. The stage unknown test data shown in the left part is classified correctly as Stage 13–16, because of the strong correlation between the predicted stage and its predicted anatomical terms and vice versa, NOT the similarity of its first and second nearest neighboring data induced from the data graph only
The middle part demonstrates the terms–stages correlation and the right part shows the terms–terms correlation of 79 terms. The stage unknown test data shown in the left part is classified correctly as Stage 13–16, because of the strong correlation between the predicted stage and its predicted anatomical terms and vice versa, NOT the similarity of its first and second nearest neighboring data induced from the data graph onlyStage classification results in terms of confusion matrices on 79-term dataset: (a) the confusion matrix calculated by SVM (b) the confusion matrix calculated by 1NN. (c) the confusion matrix calculated by our method. (a) SVM: acc. 84.50%; (b) 1NN: acc. 77.40%; (c) our: acc. 85.20%The Avg. Micro F1 score of five methods on each term in 79-term dataset. (It is better to be viewed in colorful and zoomed in mode.)
5.5 The more meaningful asymmetric correlation matrix
When we build the TG, at first, we assume the term–term correlation and stage–term correlation are both symmetric, since we used cosine similarity to represent their correlations. However, the above assumption does not always hold in the real data. In Drosophila embryo gene expression images, we found that the conditional probability of the occurrence of term ‘ventral nerve cord’ given term ‘embryonic brain’ is higher than that of the ‘embryonic brain’ given ‘ventral nerve cord’, which satisfies the biology meaning that ventral nerve cord occurs earlier than embryonic brain. After learning, our method can automatically discover the above hidden asymmetric correlation information, that is,
In order to see the learned asymmetric term–term correlation more clearly, in Figure 7, we show the difference matrix got by P*−P*
. Taking those more accurate asymmetric correlation into consideration, our method can potentially improve both stage classification and anatomical annotation results even more.
Fig. 7.
The learned difference matrix. (It is better to be viewed in colorful and zoomed in mode.) In order to see the asymmetric entries more clearly, we plot P*−P*
. After PRW, the entries marked as brighter square have higher conditional probability (positive correlation) than its counterpart which is marked as darker color. This asymmetric reflects more accurate term–term correlation than the original symmetric assumption
The learned difference matrix. (It is better to be viewed in colorful and zoomed in mode.) In order to see the asymmetric entries more clearly, we plot P*−P*
. After PRW, the entries marked as brighter square have higher conditional probability (positive correlation) than its counterpart which is marked as darker color. This asymmetric reflects more accurate term–term correlation than the original symmetric assumption
6 CONCLUSION
In this article, we proposed a novel TG model to learn the task interrelations between stage recognition and anatomical terms annotation of Drosophila gene expression patterns. The standard bag-of-word features and three major views (lateral, dorsal and ventral) were used to describe the 3D Drosophila images. A new PRW method was introduced to simultaneously propagate the stage labels and anatomical controlled terms via TG model. Both stage classification and anatomical controlled term annotation tasks are jointly completed. We evaluated the proposed method using one refined BDGP dataset. The experimental results demonstrated in the real application, when the number of training data is scarce, our joint learning method can achieve superior prediction results on both tasks than the state-of-the-art methods. What is more, we can discovery more accurate asymmetric term–term correlation, which can potentially improve the results of both tasks even more.
Authors: Katrin Weigmann; Robert Klapper; Thomas Strasser; Christof Rickert; Gerd Technau; Herbert Jäckle; Wilfried Janning; Christian Klämbt Journal: Trends Genet Date: 2003-06 Impact factor: 11.639
Authors: Cris L Luengo Hendriks; Soile V E Keränen; Charless C Fowlkes; Lisa Simirenko; Gunther H Weber; Angela H DePace; Clara Henriquez; David W Kaszuba; Bernd Hamann; Michael B Eisen; Jitendra Malik; Damir Sudar; Mark D Biggin; David W Knowles Journal: Genome Biol Date: 2006 Impact factor: 13.583
Authors: Pavel Tomancak; Benjamin P Berman; Amy Beaton; Richard Weiszmann; Elaine Kwan; Volker Hartenstein; Susan E Celniker; Gerald M Rubin Journal: Genome Biol Date: 2007 Impact factor: 13.583
Authors: Rachel Lyne; Richard Smith; Kim Rutherford; Matthew Wakeling; Andrew Varley; Francois Guillier; Hilde Janssens; Wenyan Ji; Peter Mclaren; Philip North; Debashis Rana; Tom Riley; Julie Sullivan; Xavier Watkins; Mark Woodbridge; Kathryn Lilley; Steve Russell; Michael Ashburner; Kenji Mizuguchi; Gos Micklem Journal: Genome Biol Date: 2007 Impact factor: 13.583