René Donner1, Bjoern H Menze, Horst Bischof, Georg Langs. 1. Computational Image Analysis and Radiology Lab, Department of Radiology, Medical University of Vienna, Austria; Institute for Computer Graphics and Vision, Graz University of Technology, Austria. Electronic address: rene.donner@meduniwien.ac.at.
Abstract
The accurate localization of anatomical landmarks is a challenging task, often solved by domain specific approaches. We propose a method for the automatic localization of landmarks in complex, repetitive anatomical structures. The key idea is to combine three steps: (1) a classifier for pre-filtering anatomical landmark positions that (2) are refined through a Hough regression model, together with (3) a parts-based model of the global landmark topology to select the final landmark positions. During training landmarks are annotated in a set of example volumes. A classifier learns local landmark appearance, and Hough regressors are trained to aggregate neighborhood information to a precise landmark coordinate position. A non-parametric geometric model encodes the spatial relationships between the landmarks and derives a topology which connects mutually predictive landmarks. During the global search we classify all voxels in the query volume, and perform regression-based agglomeration of landmark probabilities to highly accurate and specific candidate points at potential landmark locations. We encode the candidates' weights together with the conformity of the connecting edges to the learnt geometric model in a Markov Random Field (MRF). By solving the corresponding discrete optimization problem, the most probable location for each model landmark is found in the query volume. We show that this approach is able to consistently localize the model landmarks despite the complex and repetitive character of the anatomical structures on three challenging data sets (hand radiographs, hand CTs, and whole body CTs), with a median localization error of 0.80 mm, 1.19 mm and 2.71 mm, respectively.
The accurate localization of anatomical landmarks is a challenging task, often solved by domain specific approaches. We propose a method for the automatic localization of landmarks in complex, repetitive anatomical structures. The key idea is to combine three steps: (1) a classifier for pre-filtering anatomical landmark positions that (2) are refined through a Hough regression model, together with (3) a parts-based model of the global landmark topology to select the final landmark positions. During training landmarks are annotated in a set of example volumes. A classifier learns local landmark appearance, and Hough regressors are trained to aggregate neighborhood information to a precise landmark coordinate position. A non-parametric geometric model encodes the spatial relationships between the landmarks and derives a topology which connects mutually predictive landmarks. During the global search we classify all voxels in the query volume, and perform regression-based agglomeration of landmark probabilities to highly accurate and specific candidate points at potential landmark locations. We encode the candidates' weights together with the conformity of the connecting edges to the learnt geometric model in a Markov Random Field (MRF). By solving the corresponding discrete optimization problem, the most probable location for each model landmark is found in the query volume. We show that this approach is able to consistently localize the model landmarks despite the complex and repetitive character of the anatomical structures on three challenging data sets (hand radiographs, hand CTs, and whole body CTs), with a median localization error of 0.80 mm, 1.19 mm and 2.71 mm, respectively.
The accurate localization of anatomical landmarks in medical imaging data is a challenging problem, due to rich variability and frequent ambiguity of their appearance. In this paper we propose a generic algorithm for global landmark localization in 2D and 3D medical imaging data.The detection of landmarks is relevant in different contexts. Anatomical landmarks can serve as basis for morphometric measurements in anatomical structures for clinical applications such as the assessment of spinal flexibility (Lamarre et al., 2009), knee alignment angles (Issa et al., 2007; Fakhrai et al., 2010), joint space narrowing in rheumatoid arthritis (Langs et al., 2009; Peloschek et al., 2007), or fracture detection in osteoporosis diagnosis (Roberts et al., 2006). It can facilitate navigation through large medical imaging data by pointing to relevant locations, and is necessary if subsequent image analysis algorithms need initial location estimates as for instance segmentation algorithms, or accurate identification of anatomical landmarks for localized follow-up analysis. In computer aided diagnosis it can serve as a means to identify target regions for further analysis (Doi, 2007).Segmentation approaches such as Level-Sets (Cremers et al., 2007), Active Shape and Appearance Models (Cootes et al., 1995; Cootes et al., 2001), Graph Cuts (Boykov et al., 2001), or Random Walks (Grady, 2006) typically require at least a coarse initial localization, either by user input or by an appropriate algorithm to set either seed points (Graph Cuts, Random Walks) or to initialize contour positions (Mičušík and Pajdla, 2007). Registration based segmentation approaches such as those applied in brain imaging (e.g., FreeSurfer (Fischl et al., 2002)) typically assume coarse alignment across individuals. While this is the case in adult neuroimaging studies it does not hold for studies involving other anatomical structures, or fetal brain imaging data (Dittrich et al., 2011). To make atlas based approaches applicable to large scale studies on other anatomical structures, prior automatic alignment is necessary, for which landmarks offer a feasible solution. Landmark localization can also be regarded as a form of semantic parsing (Seifert et al., 2009) when point-wise rather than regional information is required.Typically, landmarks are identified by anatomy-specific algorithms that first obtain coarse location estimates for selected landmarks – sometimes aided by manual interaction – and then apply a domain specific model for their refinement. Among the reasons for the difficulties are noise (including local and global intensity changes), cluttered image data (overlapping structures in 2D projections, highly structured background in 3D organ segmentation), and anatomical structures that exhibit a high degree of similarity (e.g., fingers or vertebrae).We propose an algorithm that copes with these challenges and offers a general approach to accurately localize landmarks without initialization or the need for subsequent refinement.The method first generates very accurate position hypotheses for landmarks in a global search, and then disambiguates the landmarks based on their spatial configuration. A global pre-filtering classification with local Hough Forests yields the hypotheses, while the disambiguation is performed by solving a Markov Random Field (MRF). The method starts from local image information, and includes the global, geometrical information of the anatomical structure in the graph matching step.The proposed approach performs global search. It learns optimal landmark detectors and features from the training data. It adapts the topology of the shape model to the anatomical structure based on the variability of the training examples, and aggregated appearance evidence for stable and accurate landmark location estimates. Due to the relatively low number of candidates for each landmark it scales well to large 3D data.
State of the art
Several related approaches to anatomical structure localization exist in recent literature. They mainly differ regarding the type of semantic representation that is obtained to describe the image data. We distinguish between approaches that either (1) indicate the positions of individual landmarks, (2) result in model parameters which describe the position and shape of the object or (3) provide voxel-wise labels for different organs.Localizing anatomical landmarks using the positions of selected interest points has been the objective of Donner et al. (2007a), Donner et al. (2010b). The methods employ interest point detectors (such as symmetry maxima, Harris corners) to find candidates for the individual model landmarks, and perform discrete optimization of a graph matching problem to obtain the final localization. The model landmarks are thus restricted to positions where these interest points are reliably detected. The methods in Bergtholdt et al. (2010),Schmidt et al. (2007), Donner et al. (2010a) employ learnt interest point detectors to be able to obtain candidate points which directly represent anatomical landmarks. In both cases, the final representation consists of the matched model graph vertices (i.e. the landmarks) in the space of the target image or volume. To balance the low accuracy of the initial landmark estimates, (Bergtholdt et al., 2010; Schmidt et al., 2007) employ a subsequent refinement step after matching the model. While the investigated anatomical structure (spine) is embedded in a 3D data set, the landmarks are all contained in a 2D subspace. In contrast to this, (Donner et al., 2010a) matches a 3D geometric model to candidates obtained as mean-shift cluster centers of 3D probability volumes resulting from the classification of CT data using Haar-like wavelets and Random Forests.In related work (Shotton et al., 2011) predicts joint positions in 2D depth images by classification and mean-shift clustering of the resulting labeling. Due to the characteristics of the data set, a disambiguating, final optimization step is not required. Seifert et al. (2009) parse whole body CT data in a hierarchical fashion, to detect larger scale organs. To reduce the complexity of the task this parsing is performed on axial slices. They first search for one salient slice in each dimension and consequently only localize landmarks within these slices. While substantially speeding up the localization this only works for objects which are rather large in respect to the volume size, as all the objects have to be visible in at least one of the three slices.The idea of using ensemble regressors to estimate model parameters by having parts of the image vote for positions in the parameter space has been very successfully employed outside of the domain of medical imaging in the form of Hough Forests (Gall and Lempitsky, 2009; Shotton et al., 2011). A first application to the localization of organs in thorax CTs has been proposed in Criminisi et al. (2010). They train Random Regression Forests on Haar-like long-range features to predict the position and size of bounding boxes. An extension using Hough ferns was presented in Pauly et al. (2011) to predict the bounding boxes of multiple organs at once in full-body MR data. Working on axial slices of CT scans, (Zhou et al., 2012) estimates bounding boxes through a boosted learning scheme and the combination of the independent axial predictions to obtain a localization in 3D.The task of assigning pixel-wise labels to segment entire organs or organ structures has been approached by Criminisi et al. (2009), Montillo et al. (2011) using Random Forest classification. The work in Montillo et al. (2011) extends the Auto-Context (Tu and Bai, 2010) idea to include intermediate decisions within a tree to speed up the classification, which also incorporates information about the relative spatial positions of the objects.Relying on stochastic optimization instead of ensemble classification or regression, Marginal Space Learning (Zheng et al., 2009) tries to find the parameters of a bounding box or a parametric and data-driven shape model (Cootes et al., 1995) to localize and segment anatomical structures. In contrast to standard Particle Filters (de Bruijne and Nielsen, 2004), this is not performed on all parameter dimensions at once, but rather the number of searched parameter dimensions is increased after each convergence. Iterative approaches have been proposed to cope with repetitive structures such as the spine (Kelm et al., 2010).
Limitations of existing methods
The existing approaches exhibit a number of limitations if the goal is to accurately identify and precisely localize anatomical landmarks.
Local search leads to local minima especially in the case of repetitive anatomical structures
Approaches without a global cost function which optimizes the matching of the model to the target volume cannot cope with the high degree of repetitive sub-parts existing in anatomical structures. Different anatomical sub-parts may show a very similar local appearance pattern, and landmark detectors based on classification may yield highly inter-changeable responses requiring further disambiguation. For this reason, a global search and robust disambiguation of possible landmark configurations is necessary.
A fixed graph structure cannot represent arbitrary anatomical structures optimally
Previously proposed methods for landmark-based localization employ a limited geometric model using simple mean ± standard deviation length distributions on the model’s edges (Donner et al., 2010b). To effectively deal with the topological and geometrical richness of human anatomy, non-parametric geometric models that adapt to the structure based on the training data are preferable.
Manageable numbers of accurate candidates
To achieve high accuracy during global search, the candidate points have to be precise even before disambiguation. This is not possible with purely classifier-based approaches. Furthermore, the number of candidates must be low in order to make the matching of large models feasible.The alternative of using only Hough Forests, i.e. letting all voxels predict a landmark’s position as in Criminisi et al. (2010) and Pauly et al. (2011) leads to low accuracy predictions: while the long range, global predictions are required to indicate the region in the image where the landmark is located, only the local predictions indicate the landmark’s position with high accuracy. A pre-filtering step (as proposed in this work) or an iterative, course-to-fine Hough Regression scheme is required to cope with this limitation.Lastly, to be of practical relevance, the resulting approach should not require a delicate estimation of algorithm parameters or model topology, while model training as well as the matching of the model to the target volume should be fast.
Contribution
The proposed algorithm is a global search that tackles scalability, accuracy and adaptability of representation. We demonstrate that a combined classification and regression approach to candidate point estimation yields highly accurate positions. The candidate detector exhibits high specificity, reducing the number of candidates per landmark while increasing their likelihood and thus substantially reducing the computational complexity of the graph matching, which is performed using global optimization. The use of state of the art ensemble methods ensures fast training and prediction times. The graphical model topology is derived automatically from the data. Finally, the parameters of the system generalize well, and were identical for all experiments reported in this paper.
Paper structure
The paper is structured as follows: Section 2.1 details the training and computation of the initial probability maps per landmark, whose agglomeration in Section 2.2 yields the candidate points. The construction of the MRF is explained in Section 2.3. Section 3 introduces the experiments, with the results presented in Section 3.5. A discussion and an outlook can be found in Sections 4 and 5.
Methods
The approach is divided into a training phase and a localization phase as shown in Fig. 2. During training the algorithm constructs a geometric model by estimating the mutual predictive accuracy between landmark positions, trains a Random Forest classifier for predicting landmark locations and a random regression forest (Hough Forests) for refining position estimates.
Fig. 2
Outline of the proposed anatomical structure localization approach.
During localization (Fig. 2, right) Random Forest classifiers pre-filter the target volume and Hough Forests aggregate the classification probabilities into accurate landmark candidates. Subsequently an MRF that encodes the mutual location information of the landmark configuration is solved to disambiguate landmarks with multiple candidates.The following sections detail the training of the Random Forest classifiers and regressors, and the formulation of the MRF to perform the model matching.
Landmark candidate pre-filtering – Random Forest classification
We train the algorithm with a set of N training images or volumes I each containing corresponding landmark annotations. The annotations are coordinates of L landmarks of the anatomical structure in question, and are present in each of the training volumes. From the annotated examples we learn their local appearance for the classification and regression, and their spatial relations for the geometric model.First we train a classifier for the detection of landmark candidates. For each landmark l and each training volume I all voxels v within a small radius are considered as positive samples for the landmark l. This results in a classification task with L + 1 classes, with the background class consisting of a random subset of the voxels not labeled as a landmark.We capture the local image information at each voxel v through a vector of gray value differences between the voxel and F = 100 randomly chosen voxels in its vicinity, similar to Lepetit and Fua (2006) and Bergtholdt et al. (2010), resulting in a feature vector f. Using intensity differences to model local appearance aids with dealing with low-frequency brightness changes. The random offsets are taken from a Gaussian distribution with standard deviation σ = 10. A different set of F random offsets is used for each decision tree. We train the random forest using n = 32 extremely randomized trees as an L + 1 class classifier. We chose extremely randomized trees (Geurts et al., 2006) as they are similar to conventional random forests (Breiman, 2001) in performance and accuracy, while they are at the same time extremely fast to train and simple to implement.During localization each voxel in the target volume I is classified, and each tree votes for a class for each voxel in the test volume. Normalizing these votes yields L + 1 class conditional probability volumes P for the target volume. The training and prediction is performed on down-sampled volumes according to a downscaling factor δ, which brings considerable performance benefits for both training and localization (compare Fig. 5). We improve the localization accuracy in a following step – through the aggregation of candidate probabilities with Hough Regression Forests which are trained on the full resolution data.
Fig. 5
Left: Time required for the classification of one whole body CT depending on the chosen downscaling factor δ. Downscaling the volume is crucial to maintain reasonable run-times. Right: Comparison of the median landmark error and runtime for solving the MRF with landmark candidates generated by the proposed approach, by non-maxima suppression Bergtholdt et al., 2010, and by mean-shift Shotton et al., 2011 for whole body CT localization with δ = 0.2.
Accurate landmark candidates by probability aggregation – particle Hough Forests
The classification results indicate those regions of the image or volume that are likely to contain one of the landmarks. We want to condense this information into precise and discrete landmark candidate positions, weighted with the aggregated probability information. This is achieved by training Hough regressors (Gall and Lempitsky, 2009) for each landmark on the full-resolution data, which learn to vote for the exact position of a landmark for the voxels within a local neighborhood of that landmark. We extend this idea by using the Hough Forest in a particle-like probability aggregation.To train the Hough regressors, we obtain features/offset pairs as follows: For each landmark l we take voxels v around l from all training images. For every v we calculate the offset δx between its spatial position and the location of the corresponding landmark l: δx = (δx, δ y, δz); and we extract the local image features f at the position of v. Using features and offsets of a large number of voxels as training samples we then train a regressor – the Hough forest – that is predicting the relative position offset δx using features f. As the accuracy of this prediction decreases with increasing distance from l, the regressor is only trained within a local neighborhood of l, i.e. from small sub-volumes centered around each l.During localization, after the probability maps P have been computed by the classifier, the Hough forest shifts the positions of all voxels with probabilities above the threshold β = 0.5∗max(P). By updating the position of each of these voxels according to the regressor predictionit aggregates the probability mass in P towards specific positions in the target volume. Due to noise in the predictions and higher prediction accuracy once the voxels move closer to the true landmark location, this refinement is iteratively repeated n = 3 times to yield the final voxel positions. Each unique voxel position, i.e. each candidate c gets assigned the accumulated probability mass . Thereby the original probability mass of P is now highly localized. Finally, the number of candidates is reduced by non-maxima suppression with radius r = 10. The result of this refinement step through Hough regression is a set of highly accurate candidate positions for each landmark. Examples of the refined candidates are shown in Fig. 3b and c.
Fig. 3
(a) Topology of the MRF learnt from the 2D radiographs data set (Eqs. (4) and (5)). (b) For 3 landmarks the candidates and their weights are visualized (red is highest weight). (c) MRF disambiguation result for the entire landmark set (red circles). In addition all candidates are shown. To visualized the distribution the candidates are connected to the ground truth. (d) The normalized weights assigned to the candidates, which form the basis for the unary terms of the MRF, exhibit very fast drop-off, i.e. the are typically only 2–3 candidates of interest for each landmark, ensuring fast and robust MRF solutions. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
Model localization via graph matching
For most landmarks the previous step returns multiple candidate positions. To select a single candidate for each landmark, i.e. to obtain the localization result, we make use of the high level information encoded into a parts-based anatomy model.Having obtained the landmark candidates c for a given target volume, we perform the matching of the graph of model landmarks onto these candidates. The MRF’s task is to select, for each landmark, one of the landmark’s candidates, such that the overall probability of the match is maximized.While we could build a parts-based model using our anatomical knowledge, or relying on basic geometries arising, for example, from next neighbor triangulation, we choose to actually learn the MRF geometry from the annotated training data instead. We construct a Markov Random Field with L nodes , connected by E edges. The aggregated probabilities p(c) for each candidate c are used to derive the unary terms U, while the incorporation of geometric constraints is achieved through the binary terms B. The landmark candidates c corresponding to landmark l are the possible labels for node . The discrete optimization objective function for a match of the model to the target volume is thusThe MRF is defined by the unary and binary terms, and by its topology. In the following we will explain how to derive each of them.
Unary terms – aggregated candidate probabilities
Only the d = {1, … , D} candidates with the largest weights p(c) are considered for the MRF. Their weights are normalized such thatEach node of the MRF has at most D labels. The D × L unary terms U are thus set such that . Fig. 3(d) shows the typical, fast drop-off of the candidates’ weights.
Graph topology
The topology of the MRF is derived automatically from the training data based on the strength of the relationship between landmark positions. Looking at the variations of the geometric distances between the landmarks in the training data, differential entropy is used to estimate the reliability of one landmark’s position in predicting another’s position. The most predictive landmark pairs are then connected in the topology of the geometric model and thus the MRF. This topology estimation is performed once, during training.Given two landmarks s, t and their coordinates , in all the N training volumes, their offset distances are . We associate each with a Gaussian distribution and compute the differential entropyby integrating over all voxels v of the volume I, whereby and σ is empirically chosen (identical for all landmarks and data sets). We thus obtain a low entropy h when the landmarks s and t show little variance in their relative positions, or when these relative positions form clusters. The T smallest entropies inyield the indices of the landmarks s is connected to. This specifies the set of e ∈ {1, … , E = L · T} unique undirected edges of the geometric model and thus the MRF. An example for the derived topology is depicted in Fig. 3a. Alternatively, but not investigated in this work, T can be adapted for each landmark s individually by further investigating the distribution of the corresponding entropies h.
Binary terms – non-parametric geometric model constraints
The D × D × E binary terms B of the MRF relate the relative positions of the landmark candidates connected by edges to the spatial information contained in the training set. The distance between two candidates for the landmarks s and t of an edge 〈s, t〉, being , is compared to the N edges between s and t in the training set, each modeled by the distribution . The resulting value for the edge representing an edge 〈s, t〉 is thus given byEvaluating this measure for all edges which are part of the MRF’s topology defines the values of the binary terms
Solving the MRF
Having defined the unary and binary terms as well as the topology of the MRF, the MRF’s solution, the so called labeling, can now be obtained by optimizing Eq. (2):which assigns each of the L model nodes to one landmark candidate in the target image, matching the model to the target volume. This yields the final position estimates in the target image for each landmark. While this optimization is in general NP-hard several approximate algorithms exist to obtain a locally optimal solution. Due to the simple structure of the MRF employed in this approach (low number of nodes/landmarks, low number of labels/candidates per node with a quick decrease of confidence for unlikely labels and a topology with few edges) the choice of MRF solver (believe propagation, graph cuts, tree-reweighted message passing, dual decomposition) is of little importance. We employ loopy believe propagation (Wainwright and Jordan, 2008) throughout this work.
Experiments
Data sets
We evaluated the proposed approach on the three separate data sets shown in Fig. 1: (1) 20 hand radiographs, (2) 12 high resolution hand CTs and (3) 20 whole body CTs.
Fig. 1
Examples from the three data sets employed in this paper. (a) Hand radiographs, (b) high resolution hand CTs and (c) whole body CTs. The objective of the proposed method is to localize the depicted anatomical landmarks in an unseen target image or volume.
Data set 1: Hand Radiographs. N = 20 hand radiographs with an average size of 460 × 260 pixels with a resolution of 0.423 mm/pixel were annotated with L = 24 landmarks. The landmarks include the five finger tips, as well as the distal interphalangeal (DIP), proximal interphalangeal (PIP), metacarpophalangeal (MCP) and carpometacarpal (CMC) joints for each finger.Data set 2: Hand CTs. The 3D hand CTs have a voxel size of 0.5 mm × 0.5 mm × 0.66 mm resulting in an average size of 256 × 384 × 330 voxels. They are annotated with the same 24 landmarks as the hand radiographs, with three additional landmarks placed around the carpus at the radiocarpal, radioulnar, and ulnocarpal joints, totaling in L = 27.Data set 3: Whole body CTs. 57 landmarks were annotated for the first 20 whole body CTs from the Whole Body Morphometry Project (Mallinckrodt Institute of Radiology Washington University School of Medicine, 2010). The volumes have an average size of 512 × 512 × 1900 voxels, with a voxel size of 1.3 mm × 1.3 mm × 1 mm. The landmarks are distributed throughout the entire body, to be able to localize all major body parts. The landmarks in the whole body CT are depicted in Fig. 1c.
Setup
The experiments were run in a leave-one-out cross validation framework, training the Random Forest classifier and the L Hough Forests and the MRF topology on N − 1 images/volumes and performing the localization on the remaining image/ volume. The main measure of interest for each landmark is the residual distance between the position of the selected candidate and the corresponding ground truth. The Random Forest classifier and the regressors were trained using 32 extremely randomized trees, which were grown to their full depth. The number of connecting edges between landmarks in the MRF topology was set to L/10 for each data set. The downscaling factor δ was set to 0.5, 0.5 and 0.2 for the hand radiographs, the hand CTs and the whole body CTs, respectively. All other parameter settings are identical for the experiments on the three data sets.
Comparison with existing candidate generation approaches
Apart from investigating the accuracy of the overall landmark localization, we compared the influence of the proposed candidate probability aggregation (Hough regression) with two recently published approaches for candidate position estimation: mean-shift clustering and simple non-maxima suppression of the classification probabilities.
Mean-shift based interest point generation
Mean-shift (Cheng, 1995) is a method for the density estimation and cluster analysis of a set of points in a feature space, and was employed in Shotton et al. (2011) on classification probabilities to obtain landmark positions. Given a d-dimensional dataset D mean-shift iteratively moves each data point d towards the weighted mean of the data points, according to a specified kernel with bandwidth σ around d. The process is repeated until equilibrium is reached, i.e. when no data point is shifted anymore. The candidate positions for landmark l are thus estimated by employing mean-shift with a Gaussian kernel on a point set obtained from P through rejection sampling. As in the present approach, the aggregated probability mass in each cluster is assigned to the cluster center, i.e. the landmark candidate.
Interest points from local non-maxima suppression
As suggested in Bergtholdt et al. (2010), the candidates for each landmark class can be chosen by finding local maxima directly from the probability maps. For each probability map P, only points above a threshold T∗max(P) are considered. While (Bergtholdt et al., 2010) employs empirically estimated individual thresholds for every single landmark, we use T = 0.5 throughout our experiments. To further reduce the number of candidates, non-maxima suppression within the radius σ is performed. Note that, in contrast to our approach and the mean-shift approach, no aggregation of probabilities takes place.
Comparison with Random Fern Regression localization
We compare our approach with the method recently proposed in Pauly et al. (2011), which uses random regression ferns on binary Haar-like features to predict all landmark positions at once. Random ferns are similar to random forests in that they partition the input space into cells and record class membership histograms (classification) or linear models between input and output features (regression) of the contained data points. The partition is obtained by randomly splitting random projections of the input space n times, yielding cell per fern.The input features employed in Pauly et al. (2011) are three scales of binary Haar-like features, i.e. at each scale the mean intensity 26 boxes with random offsets and sizes are compared to the mean intensity of a box centered on the voxel in question. The target values for each voxel are its relative positions to all L landmarks that are to be predicted, leading to a n · L target vector δx. For each cell of each fern, a linear regression model is estimated linking the input and output spaces.During localization, the binary input feature vector is computed for each voxel v, the cell in fern f containing this data point is calculated and the linear model predicts the output feature vector . A weighted sum of these output vectors over all ferns, following Gaussian distributions modeling the input data points of each cell, yields δx. Voxel v thus estimates the landmarks to be found at x + δx. Integrating these predictions over the entire target volume yields the final landmark positions estimates x for each landmark. To allow for a fair comparison with the proposed method we refine these estimates locally: For each landmark l an additional random fern regressor is trained to predict δx in a local neighborhood of l, and x is refined accordingly. All parameters of the method were set to the values reported in Pauly et al. (2011).
Results
Candidate accuracy – comparison with Mean-Shift and Non-Maxima Suppression
The comparison of the three different strategies evaluated in this paper for the estimation of landmark candidates is shown in Fig. 5. Two important aspects in this evaluation are the time required to perform the Random Forest classification and to solve the Markov Random Field, as these are the dominating factors.Fig. 5 (right) shows the median localization residuals of the three landmark candidate estimation schemes Hough Regression, Mean-Shift and Non-Maxima Suppression on a leave-one-out run on the whole body CT data set, with a down-scaling factor of δ = 0.2. First, it can be observed how learnt aggregation in the Hough Regression approach (whose regressors are trained on full resolution data) is able to cope with the strongly down-sampled data used during classification. In contrast to this, Mean-shift and Non-Maxima Suppression cannot exploit this additional information and are thus adversely affected when limited to down-scaled image data.The iterative Hough predictions performed in our approach exhibit run-times very similar to mean-shift clustering, the main time difference between these methods stems from the time required to solve the MRF. The non-maxima suppression approach does not perform any candidate position refinement, leading to faster candidate generation times at the expense of poorer accuracy.The main influence on the MRF runtime comes from the number of candidates per landmark, which depends on the kernel bandwidth σ (mean-shift) and the non-maxima suppression radius σ, as detailed in Fig. 5 (right). Using a smaller σ yields more accurate landmark candidates, but their larger quantity results in considerably slower MRF inference. Albeit much less pronounced, a similar effect can be observed in the mean-shift data. At very small bandwidths (σ = 2) results are slightly worse, presumably due to the fact that the MRF solver gets stuck at a local optimum.In summary, the results show that the proposed Hough regression approach is able to yield highly accurate localization results while minimizing the runtime of both the classification and the MRF inference. This balance of accuracy and performance results from the landmark-specific aggregation functions which are learnt on the full resolution data.
Localization accuracy on the three data sets
The results of the leave-one-out evaluation of the landmark localization is presented in Table 1 and the visualization in Fig. 4 which shows the localization accuracy for each individual landmark in the three data sets and the two compared methods.
Table 1
Localization accuracy: Landmark localization error for Random Fern Regression and for the proposed approach.
Method
Data set
Residual in voxels
Residual in mm
p
Median
Mean
Std.
Median
Mean
Std.
Random Fern Regression
Hand radiographs
3.40
3.76
2.48
1.44
1.59
1.05
Hand CTs
4.42
7.43
10.08
2.60
4.16
5.15
Full body CTs
37.30
48.53
82.00
43.37
54.80
86.98
Pre-filtered Hough Regression
Hand Radiographs
2.00
2.34
1.95
0.80
0.99
0.82
<10−19
Hand CTs
2.23
2.61
1.96
1.19
1.45
1.13
<10−16
Full body CTs
2.23
4.34
11.96
2.71
5.25
15.08
<10−66
Fig. 4
Resulting localizations on the three data sets visualized on one image/volume, for Random Fern Regression Pauly et al., 2011 (a, c, and e) and the proposed pre-filtered Hough Forests (b, d, and f).
Data set 1: Hand Radiographs. For this 2D data set, both the median and mean residual are below 1 mm, at 0.8 mm and 0.99 mm respectively (median/mean error of 2.00/2.34 pixels). As we will see below, this level of accuracy is representative for the method, with the other data sets showing similar results. Fig. 4b shows the distribution of the localization residual over the individual landmarks. The localization of the CMC joints shows slightly elevated residuals in comparison to the other landmarks. Comparing their positions in Fig. 1 shows that the CMC landmarks are in a region of more cluttered appearance compared to the PIP and MCP joints. The highly symmetric structure of the more distal joints (in both x and y direction) allows the Random Forests to make more precise predictions from the features f. We also attribute the larger error in the CMC joints to the fact that the manual annotations are more difficult to perform. The overall localization is very robust, with only 12 outliers (out of L · N = 480 localizations) with a residual of more than 6 mm, i.e. 97.5% of the localized positions were within 6 mm of the ground truth position. No residual was larger than 14 mm and only 4 outliers’ residuals were larger than 10 mm.Data set 2: Hand CTs. The results for the 3D hand CT data set show similar characteristics as the 2D case (Fig. 4d). The median/mean residuals of 2.23 and 2.61 voxels correspond to 1.19 and 1.45 mm, respectively. Again we see the larger errors for the CMC joints and the additional three joints, which due to the lack of local scale image information are more difficult to localize (and to annotate). For the 24 landmarks with positions comparable to those in the 2D data, 97% of the localized positions were within 6 mm of the ground truth position, and only 5 outliers’ residuals were larger than 10 mm.Data set 3: Whole body CTs. This 3D data set was the most complex data set, with 57 landmarks. The median residual, 2.23 voxels, closely follows the trend set in the other two data sets. Due to a larger number of difficult to localize landmarks, the mean error for this data set is slightly higher at 4.34 voxels. These results correspond to 2.71 mm and 5.25 mm, respectively.Looking at the results presented in Fig. 4(f) in more detail, clear trends can be detected in the performance on different body parts. In the landmarks located in the hands motion artifacts in two of the volumes introduce a number of outliers. In the volumes without artifacts, the performance is on a par with the results in the above 2D/3D hand data evaluations. Mismatches between the 3rd and 4th toes can be observed in both feet, in addition to a single instance where a right foot was localized as being a left one. We attribute this to the fact that the geometric model, by connecting each landmark to several of its neighbors, in some cases is overpowered by (sets of) very high candidate probabilities – which can occur due to the high degree of left/ right similarity in the body.Putting these results into perspective, out of the 1140 individual landmarks localizations, only 12 outliers (1.05%) had errors > 30 mm. We consider the results on the three data sets to clearly demonstrate the ability of the proposed approach to find the landmark positions in the target volume with high accuracy, with the consistent localization of detailed anatomical structures with a median residual of ≈2 pixels/voxels, independent of the data set.
Comparison with Random Fern Regression localization
Fig. 4a, c, and e shows the results of performing random ferns based localization (Pauly et al., 2011) on the three data sets. In the 2D hand data set the method is able to localize the structure, with only minimal outliers, but overall the localized positions exhibit more deviation from the ground truth than in the proposed method. This is also reflected in the median/mean error of 3.40/3.76 pixels, corresponding to 1.44/1.59 mm. In the 3D case the localization residual increases considerably, with a median/mean error of 4.42/7.43 pixels, which is 2.60/4.16 mm. The localization error affects all parts of the structure equally, i.e. while the initial prediction of x does predict the right parts of the image it seems to fail to do so with high enough accuracy that the local refinement can have impact. This is even more pronounced in the case of the full body CTs (Fig. 4e). The size of the individual anatomical structures in a full body CT are, in relation to the size of the volume, comparatively small, putting an even higher burden on good initial position estimates. In this case the median/mean localization error rises to 37.30/48.53 pixels and 43.37/54.80 mm respectively.
Run times
The run-times of the proposed approach for the localization on three data sets were in the order of 20 s/90 s/120 s on a 2009 8-core Xeon MacPro. The implementation was performed in Matlab except for the C-based Random Forests and the loopy belief propagation MRF solver (Andres et al., 2010). Solving the MRF only takes <3 s, even for the whole body data set, i.e. the vast majority of runtime is spent on evaluation of the random forests. Recent advances in GPU-based Random Forests (Sharp, 2008) showed classification runtime for whole body CTs of about 5 s (Criminisi et al., 2010). By exploiting such an implementation in the two steps of low-resolution classification and subsequent refinement the runtime of the random forest evaluation of our approach could be lowered to below 5 s, resulting in a potential total runtime of less than 8 s including the time so solve the MRF.
Discussion
The experimental results demonstrate how a generic approach that performs global search for landmarks can yield high localization accuracy in a variety of medical imaging domains. Ensemble learning techniques enable the algorithm to obtain landmark candidates with sufficiently high specificity to allow for successfully disambiguation of locations via discrete optimization even on large scale data. The localization accuracy can be improved by Hough regression, and the MRF topology can be learnt from training data.The results clearly show that the spatial accuracy and low number of candidates stems from combined classification and regression. In addition, the incorporation of the geometric information into the MRF is crucial for the disambiguation between the landmark candidates.The distribution of the probability mass within the candidates depends on both the number of candidates and their specificity. A lower number of candidates is preferable, as it reduces the number of labels of the MRF, while a steep drop-off of the candidates’ weights enables the MRF to find the solution in a very short runtime.Since classification is performed on a per-voxel basis, computation time has a cubic relation to the downscaling factor δ, as measured for the whole body CT in Fig. 5 (left). Therefore, performing initial classification on a down-sampled volume, and then local regression refinement as learnt on full-resolution data results in highly accurate candidates while keeping computational costs low.The aggregation of the classification probabilities through a learnt regressor instead of mean-shift (or non-maxima suppression), allows to find more accurate candidates while reducing their number, resulting in faster MRF solution as shown in Fig. 5 (right).The comparison of the proposed approach with the localization method based on random ferns shows the importance of predicting first local model characteristics and to include global information at a later stage. In the random ferns approach, local image features vote for all landmarks in the volume, which are potentially far away from the voting voxel. Using local refinement is thus only successful when the initial prediction of the landmarks is close enough to the actual landmark position, i.e. within the capture range of the local refinement regressor.The proposed approach on the other hand initially only estimates local parts of the model, i.e. the candidates, and focuses on their spatial accuracy, low number and high specificity. Only then the global structure of the model is exploited to select from these candidates, advancing from the local to the global characteristics of the model.Approaches such as marginal space learning (Zheng et al., 2009; Kelm et al., 2010) always work with the full model and do not distinguish between local and global characteristics. They have so far been used for bounding box based localization of anatomical structures, and an extension to focus on individual anatomical landmarks is an alternative approach for future research.
Limitations
In contrast to Donner et al. (2010b), the model is not explicitly rotationally invariant. The geometric model proposed in that work employs a parametric model of spatial relations by encoding mean/standard deviations of both edge lengths and angles between local interest point orientations and neighboring landmarks. But several factors as well as the result from this work seem to indicate that rotation invariance might not be necessary in the medical context. First, the acquisition protocols for the individual organs/ modalities ensure very consistent image data with regard to rotations. Second, the relative angles between landmarks when using a non-parametric model and a sufficient number of training examples can describe variations in the data due to the superposition of example geometries. A similar argument is valid for the landmark classifier; in cases with small training sets but considerable amounts of rotation, the training set can be artificially inflated by randomly rotating and perturbing the training data.The algorithm was implemented on a CPU. Fast implementations of random forests on a GPU exist with speed-ups by factors up to 100 (Sharp, 2008), and we expect this to have a substantial impact on the run time of the algorithm, since the classification and regressions forms a large part of computation.Although it did not occur in the experiments it is possible that no landmark candidate sufficiently close to the true position can be found, and disambiguation can fail. To cope with this so-called dummy labels (Donner et al., 2007b) can be introduced in the MRF, that model absent landmarks.The proposed model does only use unary and binary terms. In principle higher-order terms can be used to capture spatial relationships among e.g., triplets of landmarks.
Conclusion and outlook
We present an approach for localizing complex, partly repetitive anatomical structures in 3D volumes, which yields highly accurate, robust results with fast run-times. Based on Random Forests for classification and Hough regression, the method detects landmark candidates in a target volume. The model of the anatomical structure is matched to these candidates by solving a Markov Random Field. The algorithm does not rely on predefined interest point detectors or a manually designed graph topology but learns both from the training data. It achieves high accuracy of the best candidate while at the same time keeping the number of candidates low. The crucial feature of the proposed global localization approach is the combination of Random Forest classifiers and the Hough regressors. They can be trained on full-resolution image data, while the Random Forest classifiers can be applied to strongly down-sampled volumes.Several areas of research seem promising for future work. In particular, the application of the approach to the localization of structures with occlusions due to the field of view or due to missing parts because of anatomical differences is an important issue for clinical practice. To reduce the amount of work required for the annotation of the training data, a scheme similar to the one presented in Donner et al. (2009) could enable the building of models from a single manual annotation. Similarly, building a unified model of the entire human anatomy would allow a matching of the model to any given depicted region. Other variants of Random Forests such as the recently introduced Oblique Random Forests (Menze et al., 2011) should be investigated. Their oblique splits allow for a higher adaptivity to the data and a better class separation, as well as for correlated features and random offset and scaling that occurs as noise in between observations. The learning paradigm for feature detectors may be particularly helpful for a transfer to MR image data, where such artifacts occur as a result of local and global field inhomogeneities. As a next step, we will evaluate the method on a substantially larger data set.
Authors: Olivier Pauly; Ben Glocker; Antonio Criminisi; Diana Mateus; Axel Martinez Möller; Stephan Nekolla; Nassir Navab Journal: Med Image Comput Comput Assist Interv Date: 2011
Authors: René Donner; Branislav Micusik; Georg Langs; Lech Szumilas; Philipp Peloschek; Klaus Friedrich; Horst Bischof Journal: Med Image Comput Comput Assist Interv Date: 2007
Authors: Negar Fakhrai; Peter Widhalm; Catharina Chiari; Michael Weber; Georg Langs; René Donner; Helmut Ringl; Marion Jantsch; Philipp Peloschek Journal: Eur J Radiol Date: 2009-03-13 Impact factor: 3.528
Authors: Philipp Peloschek; Georg Langs; Michael Weber; Johannes Sailer; Michael Reisegger; Herwig Imhof; Horst Bischof; Franz Kainberger Journal: Radiology Date: 2007-10-19 Impact factor: 11.105
Authors: Bulat Ibragimov; Jerry L Prince; Emi Z Murano; Jonghye Woo; Maureen Stone; Boštjan Likar; Franjo Pernuš; Tomaž Vrtovec Journal: Med Image Anal Date: 2014-11-23 Impact factor: 8.545