Literature DB >> 36170279

A semi-supervised learning approach for automated 3D cephalometric landmark identification using computed tomography.

Hye Sun Yun¹, Chang Min Hyun¹, Seong Hyeon Baek¹, Sang-Hwy Lee², Jin Keun Seo¹.

Abstract

Identification of 3D cephalometric landmarks that serve as proxy to the shape of human skull is the fundamental step in cephalometric analysis. Since manual landmarking from 3D computed tomography (CT) images is a cumbersome task even for the trained experts, automatic 3D landmark detection system is in a great need. Recently, automatic landmarking of 2D cephalograms using deep learning (DL) has achieved great success, but 3D landmarking for more than 80 landmarks has not yet reached a satisfactory level, because of the factors hindering machine learning such as the high dimensionality of the input data and limited amount of training data due to the ethical restrictions on the use of medical data. This paper presents a semi-supervised DL method for 3D landmarking that takes advantage of anonymized landmark dataset with paired CT data being removed. The proposed method first detects a small number of easy-to-find reference landmarks, then uses them to provide a rough estimation of the all landmarks by utilizing the low dimensional representation learned by variational autoencoder (VAE). The anonymized landmark dataset is used for training the VAE. Finally, coarse-to-fine detection is applied to the small bounding box provided by rough estimation, using separate strategies suitable for the mandible and the cranium. For mandibular landmarks, patch-based 3D CNN is applied to the segmented image of the mandible (separated from the maxilla), in order to capture 3D morphological features of mandible associated with the landmarks. We detect 6 landmarks around the condyle all at once rather than one by one, because they are closely related to each other. For cranial landmarks, we again use the VAE-based latent representation for more accurate annotation. In our experiment, the proposed method achieved a mean detection error of 2.88 mm for 90 landmarks using only 15 paired training data.

Entities: Chemical

Mesh：

Year: 2022 PMID： 36170279 PMCID： PMC9518928 DOI： 10.1371/journal.pone.0275114

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.752

1 Introduction

Cephalometric analysis is commonly used by dentists, orthodontists, and oral and maxillofacial surgeons to provide morphometrical guidelines for diagnosis, surgical planning, growth analysis, and treatment planning by analyzing dental and skeletal relationships in the craniofacial complex [1]. It is based on cephalometric landmarks, which serve as proxy to the skull morphological data pertaining to craniofacial characteristics [2]. Conventional cephalometric analysis uses two-dimensional (2D) cephalometric radiographs (lateral and frontal radiographs), which have drawbacks including geometric distortions, superimpositions, and the dependence on correct head positioning [3]. Due to recent advances in image processing techniques and the need for accurate craniofacial analysis, a three-dimensional (3D) approach to the cephalometric landmarks obtaining 3D computerized tomography (CT) images is gaining preference over the conventional 2D techniques [4-6]. Recently, there have been many studies conducted on automated cephalometric landmark identification that aims to find the landmarks and enable immediate cephalometric analysis, because manual landmarking and cephalometric analysis are labor-intensive and cumbersome tasks even for the trained experts. Due to recent advances in deep learning techniques, the automated annotation of 2D cephalometric landmarks may now be used for clinical application [7, 8]. Conversely, automated 3D cephalometric tracing (for 90 landmarks) may not yet be utilized in clinical applications, wherein the required average error is commonly designated to be less than 2 mm [9-13]. The high dimensionality of the input data (e.g., 512 × 512 × 512) and limited number of training data are the main factors that hinder the training of deep learning networks for learning the 3D landmark positional vectors from 3D CT data. Moreover, due to the current legal and ethical restrictions on medical data, it is very difficult to utilize CT data from patients. To overcome the above-mentioned learning problems caused by the high input dimensions and training data deficiencies, the method proposed in this study utilizes semi-supervised learning that takes advantage of a large number of anonymized landmark dataset (without using the corresponding CT dataset) which have been used in surgical planning and treatment evaluation. We use these landmark dataset to obtain their low dimensional representations, reducing the dimensions of the total landmark vectors (270 = 90 × 3 dimension) to only 9 latent variables via a variational autoencoder (VAE) [14]. For training the VAE, a normalized landmark dataset is used to efficiently learn skull shape variations while ignoring unnecessary scaling factors. With this dimensionality reduction technique, the positions of all 90 landmarks can be roughly estimated by identifying a small number of easy-to-find reference landmarks (10 landmarks), which can be accurately and reliably identified via a simple deep learning method [11]. The rough estimation of all landmarks is used to provide a small 3D bounding box for each landmark in the 3D CT images. Following this, we apply convolutional neural networks (CNNs) to these small bounding boxes to enable the accurate placement of landmarks. Our fine detection strategy is divided into two parts; mandible and cranium. It is desirable to accurately capture the morphological variability of the mandible because the shape of the mandible can be affected by a variety of factors, including the masticatory occlusal force, muscular force, functional activity such as breathing and swallowing, and age [15]. Noting that landmarks on the mandible represent morphological features of a 3D mandibular surface geometry, we apply 3D CNN to a segmented image of the mandible (separated from the cranium). We follow a recent study [16] for a segmentation method to separate the mandible from the cranium. Because several landmarks around the condyle are closely related to each other, it is better to detect these landmarks all at once. For the landmarks on the midsagittal plane, it is better to further reduce the dimensionality of the input by using a partially integrated 2D image of the midsagittal plane. For the remaining landmarks lying on the cranium, we again use the anonymized landmark dataset to obtain a more accurate latent representation of all landmarks on the cranium, due to its rigidity. The proposed approach achieved a mean detection error of 2.88 mm for 90 landmarks, which nearly meets the clinically acceptable precision standard. It should be emphasized that this accuracy has been achieved using a very small amount of training data.

2 Method

We begin by introducing the following notations. Five easy-to-find reference landmarks (CFM, Bregma, Na, and Po (L/R)) are used as the basis for constructing a coordinate system to determine the midsagittal and axial planes, and they are utilized for data normalization (methods for obtaining these five reference landmarks will be described in Section 2.1). denotes a 3D CT image, which is defined on a voxel grid Ω ≔ {v = (v1, v2, v3):v = 1, ⋯, 512 for j = 1, 2, 3}. Here, we set v1 as the normal direction of the midsagittal plane. denotes a binarized CT image of (i.e., skull segmentation from the CT image), defined by where ρ is a thresholding value. In our experiment, the value of ρ was consistently chosen as ρ = 500HU, which is known as an effective choice for thresholding-based bone segmentation [17]. mid denotes a partially integrated 2D image of in the normal direction of the midsagittal plane, defined by where [a, b] determines the truncated volume of . and denote the concatenated vectors of 46 cranial and 44 mandibular 3D landmarks, respectively. The entirety of the landmarks is defined by . See S1 Table for more detailed information of the landmarks. denotes a concatenated vector of the landmarks (Bregma, CFM, Na, ANS, Or (L/R), and Po (L/R)) in the cranium and denotes a concatenated vector of the landmarks (MF (L/R)) in the mandible. A reference landmark vector is defined by . We mention here details of reference landmarks; Bregma is the point of junction of the coronal and sagittal sutures of the skull. (i) CFM, (ii) Na, (iii) ANS, (iv) Or, (v) Po, and (vi) MF are the abbrevations of (i) the center of foramen magnum, (ii) nasion, (iii) anterior nasal spine, (iv) orbitale, (v) porion, and (vi) mental foramen, which are defined by (i) the center of an opening for spinal cord, (ii) the center of the midline bony depression between the eyes, where the frontal and two nasal bones meet, just below the glabella, (iii) the projection formed by the fusion of the two maxillary bones at the intermaxillary suture, (iv) lower most point on the lower margin of the left or right orbit, (v) the most superior point of the upper margin of each ear canal, and (vi) a bilateral opening in the vestibular portion of the mandible through which nerve endings, such as the mental nerve, emerge. See Fig 1.

Fig 1

Reference landmarks.

These are easy-to-find through CNN with input of the illuminated images because they have strong geometric cues that can be revealed in illuminated 2D images.

Reference landmarks.

These are easy-to-find through CNN with input of the illuminated images because they have strong geometric cues that can be revealed in illuminated 2D images. The 3D cephalometric landmarking aims to develop a function that maps a 3D CT image to all landmarks . To learn the landmark detection map f, deep learning techniques can be used. Unfortunately, due to the legal and ethical restrictions on medical data, a few paired data are available. This severe shortage of paired data makes it difficult to obtain an accurate and reliable map in the following supervised learning framework: where N is the small number of paired training data, is a paired dataset, is a deep learning network, and ‖ ⋅ ‖ is the standard Euclidean norm. In our study, only 15 paired data are available (i.e., N = 15). Even with a certain amount of paired data, the learning process (3) of the direct detection map f can be difficult because the dimension of the input image is very large (greater than 108). The proposed method attempts to address this problem by taking advantage of a semi-supervised learning framework that permits the utilization of the N number of anonymized landmark data whose corresponding CT data are not provided. In this research, specifically, 229 anonymized data (i.e., N = 229) are utilized. As shown in Fig 2, the proposed method comprises the following three main steps: (i) To obtain easy-to-find reference landmarks , we apply CNN with 2D illuminated images generated from a binarized CT image and normalize the output with respect to the cranial volume. (ii) A rough estimation of entire landmarks is obtained using the partial knowledge and a VAE-based low dimensional representation of . (iii) Using this estimation, coarse-to-fine detection for is conducted, wherein separate strategies are utilized for the mandibular and cranial landmarks. For the mandibular landmarks, the landmarks are accurately identified by applying 3D patch-based CNNs to capture the morphological features on a 3D surface geometry associated with the landmarks, wherein an input patch is extracted based on the coarse estimation. For cranial landmarks, we first detect three landmarks lying on the midsagittal plane by applying a 2D CNN whose input is an extracted 2D patch from a partially integrated image mid in basis of the coarse estimation. By utilizing the three finely-detected landmarks and cranial reference landmarks as the partial information of , the remaining cranial landmarks are finely annotated via a VAE-based local-to-global estimation utilizing the same method in the previous step.

Fig 2

Schematic diagram of the proposed method for the 3D landmark annotation system.

Each of these steps is described in detail as follows.

2.1 Detection of easy-to-find reference landmarks and uniform scaling for skull normalization with respect to the cranial volume

The first step of the proposed method is to find 10 reference landmarks from a given . Initially, a CT image is converted into a binarized image by (1). From , 2D illuminated images are generated by manipulating various lighting and viewing directions (see Fig 1). By applying VGGNet [18] to these illuminated images, the reference landmarks are accurately and automatically identified. This detection method is based on that presented in the recent study [11]. Using these identified reference landmarks, data normalization is conducted for efficient feature learning of skull shape variations in further steps. By applying a uniform scaling with respect to the cranial volume, the landmark vector is normalized, wherein the cranial volume is defined via a product of the distance between the v1-coordinate of Po (L) and Po (R) (cranial length), the distance between the v2-coordinate of Po (L) and Na (depth), and the distance between the v3-coordinate of CFM and Bregma (height). This data normalization minimizes the positional dependencies of the landmarks on the translation, rotation, and overall size of the skull; therefore, shape information of the skull (regarding facial deformities) can be effectively learned in further VAE-based steps. From here on, we will denote all landmark vectors as normalized vectors (e.g., and are normalized vectors for total landmarks and reference landmarks).

2.2 Rough estimation of all landmarks from reference landmarks using VAE-based low dimensional representation

This subsection provides a method for roughly estimating all landmarks from the reference landmarks that are accurately annotated in the previous step. Based on the method in [13], we build a bridge that connects and by taking advantage of a low-dimensional representation of learned by the variational autoencoder (VAE) [14]. The rough estimation obtained from , denoted by , is given by where is a local-to-global landmark estimation map as described in Fig 3. The map is constructed as follows: First, we train VAE that consists of an encoder () and a decoder (), to learn low dimensional representation of . Afterwards, we train the nonlinear map Φ that provides so that .

Fig 3

Initial estimation of all 90 landmarks using the knowledge of 10 reference landmarks .

This is possible because all landmarks can be roughly represented by only 9 latent variables.

Initial estimation of all 90 landmarks using the knowledge of 10 reference landmarks .

This is possible because all landmarks can be roughly represented by only 9 latent variables. Specifically, the VAE learns an encoder and a decoder , where is a d-dimensional latent variable (d ≪ 270). The maps and learn landmark patterns by leveraging dataset that consists of the unpaired landmark dataset as well as the paired dataset. The training is achieved via the following energy minimization sense: where N = N + N is the total number of training landmark data, is a class of functions in the form of a given VAE network, is a d-dimensional normal distribution with a mean μ( and a diagonal covariance matrix Σ( = diag((σ((1))2, ⋯, (σ((d))2), is a standard normal distribution, and the last term in the loss function is the Kullback-Leibler (KL) divergence defined by Here, μ( = (μ((1), ⋯, μ((d)) and σ( = (σ((1), ⋯, σ((d)) are the mean and standard deviation vectors obtained in the interim of the encoding process of an i-th training data (i.e., ). The encoder can be expressed in the following nondeterministic form: where hnoise is a noise sampled from , ⊙ is the Hadamard product (i.e., element-wise product), and vectors μ and σ are given by Here, the matrices represent fully-connected layers and ReLU is an element-wise activation function defined by ReLU(t) = max(t, 0). The decoder is the reverse process of the encoder , which can be represented by where the matrices {D1, D2, D3, D4} represent fully-connected layers. The detailed network architecture is described in the red box of Fig 3. After pretraining the VAE, the nonlinear map is learned, which connects reference landmarks with a latent variable . The architecture of the map Φ is a fully-connected neural network as described in Fig 3. The training is achieved by the following minimization sense: Here, we remark that Φ relies on the pretrained VAE, whose encoder () is used only for training and decoder () is the component of the local-to-global detection map. The resultant map estimates all landmarks from the partial knowledge , based on the learned patterns of landmarks by VAE.

2.3 Coarse-to-fine detection

This subsection explains coarse-to-fine detection obtained using the initial estimation . We put a final touch on by utilizing CT image data. The coarse-to-fine detection is based on suitable strategies that rely on the landmark locations (i.e. on the mandible or cranium). The details are explained in the following subsections.

2.3.1 Mandible-cranium segmentation

In the binarized skull image , we segment the mandible and cranium separately using the connected component labeling (CCL) technique [19]. Among all connected components generated from the CCL method, the largest component and the second largest are the cranium and the mandible respectively. The segmented cranium and mandible images are denoted as and (as shown in Fig 2). Using these images and the rough estimation , the following fine detection processes are conducted.

2.3.2 Detection of mandibular landmarks

For the landmarks on the mandible being articulated to the skull, a patch-based 3D CNN is applied to capture the morphological variability of the 3D mandibular surface geometry associated with the landmarks. Let be a roughly estimated position of a landmark with index j in . See S1 Table for the details of the landmark index. For each mandibular landmark (i.e., j ∈ {49, ⋯, 90}), we extract a 3D image patch that is defined by a cube with edge length of η and center of . By using 3D CNN, we obtain a map , where is an accurate positional estimation for the landmark with index j (i.e., ). To learn the fine detection map , we generate training dataset by using the paired dataset : For a given landmark index j, we generate The 3D CNN is trained by the dataset in the following sense: In practice, data augmentation using translation and horizontal flip is applied. The architecture of the 3D CNN is a modified version of VGGNet [18], which is described in Fig 4.

Fig 4

Mandibular landmarks detection.

Mandibular landmarks detection.

Patch-based 3D CNN is applied to the segmented image of the mandible (separated from the maxilla), in order to capture 3D morphological features of mandible associated with the landmarks. For six landmarks on condyle, we detect them all at once, instead of one by one, because they are positionally related to each other. In practice, several landmarks are identified in a group at once. We simultaneously identify six landmarks on the condyle (COR, MCP, LCP, Cp, Ct-in, and Ct-out) that are positionally related to one another, as well as landmarks with bilaterality (e.g. left/right mandibular foramen) that are associated with the symmetric structure of the mandible. For this group detection, we construct a 3D CNN to produce a concatenated vector of all landmark positions on the same group from one 3D image patch. Here, COR is the abbreviation of coronoid point, defined by the anterior process of the superior border of the ramus of the mandible. (i) MCP, (ii) LCP, (iii) Cp, (iv) Ct-in, and (v) Ct-out are the abbreviations of (i) medial condylar point, (ii) lateral condylar point, (iii) posterior condylar point, (iv) medial temporal condylar point, and (v) lateral temporal condylar point.

2.3.3 Detection of cranial landmarks

Landmarks on the cranium that demonstrates rigidity have less variability between subjects. According to [13], cranial landmarks have smaller variance compared to mandibular landmarks with the normalization presented in Section 2.1. Moreover, our empirical experiment shown in Fig 7 demonstrates that the rough local-to-global estimation achieved using the VAE-based low dimensional representation provides more accurate annotations for cranial landmarks. Therefore, we again utilize a VAE-based low dimensional representation in the same manner as in Section 2.2 by using only the cranial landmarks . To increase the detection accuracy, we enrich the partial knowledge of by accurately detecting three additional cranial landmarks lying near the midsagittal plane (MxDML, Od, and PNS). Here, (i) MxDML, (ii) Od, (iii) PNS are the abbrevation of (i) maxillary dental midline, (ii) odontoid process, and (iii) posterior nasal spine, which are defined by (i) the midsagittal line and point of the maxillary central incisors, usually defined by the junctional line and point of right and left incisal edge and medial surface on maxillary central incisors, (ii) a protuberance (process or projection) of the axis (second cervical vertebra), and (iii) a part of the horizontal plate of the palatine bone of the skull. The overall process is illustrated in Fig 5.

Fig 7

VAE-based local-to-global estimation errors (mm) of 90 landmarks over 9 test data.

Fig 5

3D cranial landmark detection using VAE-based low dimensional representation combined with easy-to-find landmarks.

Here, the entire cranial landmarks are estimated directly from the knowledge of the reference landmarks and three landmarks on midsagittal plane that are obtained by 2D CNN.

3D cranial landmark detection using VAE-based low dimensional representation combined with easy-to-find landmarks.

Here, the entire cranial landmarks are estimated directly from the knowledge of the reference landmarks and three landmarks on midsagittal plane that are obtained by 2D CNN. First, we compute a partially integrated image mid from using (2) so that the center of the truncated volume of lies on the midsagittal plane. Next, a 2D patch is extracted, which is defined by a square with edge length of η and center of . Here, is a vector eliminating the v1 component in the and j ∈ {24, 25, 26}. Using a 2D CNN, we learn a function that infers an accurate position of a landmark in v2- and v3-coordinates () from the 2D image patch . The landmark position in the v1-coordinate is determined by the location of the midsagittal plane. In the similar manner as in (11), the following training dataset is generated: With the training dataset, the 2D CNN is trained as follows: Here, data augmentation using translation is applied. The architecture of the 2D CNN is modified from VGGNet [18], as illustrated in Fig 4. Let be a concatenated positional vector with cranial reference landmarks and three finely detected landmarks obtained by . Using this partial knowledge , we find accurate cranial landmark positions via where is a nonlinear map and is a decoder of VAE. Here, is a -dimensional latent variable given by and is an encoder of VAE. The maps and are trained in the same manner of the method presented in Section 2.2 using cranial landmarks . The detailed architectures of and are illustrated in Fig 5.

3 Result

3.1 Dataset and experimental settings

Our experiment used a dataset containing 24 paired data (multi-detector CT images and landmark data) and 229 anonymized landmark data. This dataset was provided by Yonsei University, Seoul, Korea. The paired dataset was obtained from normal Korean adult volunteers (9 males and 15 females; 24.22±2.91 years old) with skeletal class I occlusion and was approved by the local ethics committee of the Dental College Hospital, Yonsei University (IRB number: 2–2009-0026). All informed consents were obtained from each subject. Among the 24 paired data, we used 15 data pairs for training (i.e., N = 15) and 9 data pairs for testing. The anonymized landmark dataset with 3D landmark coordinates was acquired in an excel format from 229 anonymized subjects with dentofacial deformities and malocclusions (i.e., N = 229). Manual landmarking for both dataset was performed by one of the authors (S.-H. Lee) who is an expert in 3D cephalometry with more than 20 years of experience. Our deep learning method was implemented with Pytorch [20] in a computer system with 4 GPUs (GeForce RTX 1080 Ti), two Intel(R) Xeon(R) CPU E5–2630 v4, and 128GB DDR4 RAM. In the training process, the Adam optimizer [21] was consistently adopted, which is known as an effective adaptive gradient descent method. In our experiment, optimal values of all learning parameters (epoch and learning rate) were empirically selected via cross validation. For image-based methods, 15-fold cross validation was applied, where 15 paired training data were split into 1-fold for validation and the others for training. For the training of VAE parts, 5-fold cross-validation was applied to 244 training data (229 unpaired and 15 paired ones). Fold values were empirically selected, depending on the amount of available training data. The nonlinear map Φ in (10) can be trained by pairs of and , where is an estimated vector of 10 reference landmarks in the first step and is a latent variable obtained by . Here, is the encoder of the pretrained VAE and is the corresponding global landmark. Due to the limited number of CT data, only 15 outputs were provided in the first-step. Hence, an additional dataset was generated from the unpaired landmark dataset, where and ( are given by and . Here, denotes a subsampling operator of 10 reference landmarks. This dataset was used in our implementation. Likewise, the map Φ in (15) was trained as the same manner. For quantitative evaluation, we used mean detection error (MDE) computed as follows: Let be a set of N landmarks output to be evaluated. The MDE is computed by where is the corresponding ground truth for and is defined by Here, is the vector corresponding to j-th landmark in , N is the number of landmarks contained in (e.g., N = 90 for entire global landmarks), and is given by where denotes the v-coordinate of .

3.2 Results of reference landmark detection

The detection of the 10 reference landmarks () provided very accurate and robust results (see Table 1 and Fig 6). These results almost meet clinical requirements, while the intra-observer repeatability has a precision of less than 1 mm and the overall median inter-observer precision is approximately 2 mm in the 3D landmarking system [22].

Table 1

Mean detection error for 10 reference landmarks.

Most of the landmarks are annotated almost within clinical requirements.

Landmark	Mean (mm)	Landmark	Mean (mm)
ANS	1.2	Na	1.7
Bregma	1.9	Or (R)	1.3
CFM	2.32	Po (R)	1.63
Or (L)	1.6	MF (L)	1.96
Po (L)	2.21	MF (R)	1.72

Fig 6

Final localization errors (mm) of 90 cephalometric landmarks after coarse-to-find detection over 9 test data.

Blue and brown dots denote errors for cranial and mandibular landmarks, respectively. Red, blue, and brown lines represent average error over all, cranial, and mandibular landmarks.

Final localization errors (mm) of 90 cephalometric landmarks after coarse-to-find detection over 9 test data.

Blue and brown dots denote errors for cranial and mandibular landmarks, respectively. Red, blue, and brown lines represent average error over all, cranial, and mandibular landmarks.

Mean detection error for 10 reference landmarks.

Most of the landmarks are annotated almost within clinical requirements. By using reference landmarks, we normalized the landmark data via uniform scaling by fixing the cranial volume of each subject as the average value of the cranial volume for the training dataset.

3.3 Results of VAE

For the local-to-global detection, the VAE was trained using 45000 epochs, a full batch-size, and a learning rate of 0.001. Here, the full batch-size indicates that our dataset was not divided into several batches in the training process. These learning parameters were empirically chosen by comparing validation errors, which were obtained by varying parameters when training VAEs. To investigate the effect on the dimension of the latent space, we trained VAE with varying the latent space size. The latent dimension is preferred to be as small as possible compared to that of the vector of reference landmarks () as well as global landmarks (). Taking this into account, the latent dimension (9) and epochs (45000) were chosen as empirical optimal values based on the validation error. Table 2 shows the variation of the averaged test error for the epoch and latent space dimension. The error tendency for the test set was almost the same as that for the validation set.

Table 2

Mean detector error (mm) of VAE over 9 test data.

This is obtained by varying the number of epochs and latent space dimension.

dim/epoch	35000	40000	45000	50000	55000
3	4.93	5.11	5.20	5.19	5.23
5	4.18	4.29	4.26	4.41	4.51
7	3.23	3.32	3.35	3.41	3.36
9	3.04	3.27	3.06	3.21	3.18
11	3.55	3.34	3.30	3.23	3.27
13	3.09	3.19	3.19	3.10	3.15
15	3.12	3.16	3.14	3.00	3.08

Mean detector error (mm) of VAE over 9 test data.

This is obtained by varying the number of epochs and latent space dimension. The averaged representation errors of VAE for 9 test data were 2.89 mm, 3.11 mm, and 3.06 mm for the cranial, mandibular, and all landmarks.

3.4 Results of the initial local-to-global detection

The nonlinear map Φ was trained with 5400 epochs, a full batch-size, and a learning rate of 0.0001. For each landmark, Fig 7 shows the performance evaluation achieved using 9 test data with respect to the averaged error in the sense of (17). The mean detection error was 3.27 mm for the cranial landmarks, 3.90 mm for the mandibular landmarks, and 3.59 mm for all landmarks. The error of the cranial landmark estimation was much smaller than that of the mandibular landmark estimation.

VAE-based local-to-global estimation errors (mm) of 90 landmarks over 9 test data.

From 10 reference landmarks obtained in the first step, 90 landmarks are roughly estimated. Blue and brown dots denote errors for the cranial and mandibular landmarks, respectively. Red, blue, and brown lines represent average error over all, cranial, and mandibular landmarks. The reference landmark detection outputs were selected as the estimation results instead of VAE outputs. This is because the 2D CNN is specially designed to detect the reference landmarks that are placed on a point with a morphologically distinct feature, whereas the VAE-based estimation focuses on capturing global landmark patterns within acceptable tolerance rather than on accurately detecting the specific landmarks. This also applied for the 2D CNN-based cranial landmark detection (in Section 2.3.3).

3.5 Result for coarse-to-fine detection

3.5.1 Mandibular landmark detection

For fine detection of the mandibular landmarks, 3D image patches were extracted with size of 80 × 80 × 80 voxels (≈ 4 × 4 × 4 cm3). To generate the training data in (11), the center location of patch was varied to cover 2 times the maximum error of the initial estimation of for the training data. Using the parameters of 20000 epochs, a full batch size, and a learning rate of 0.0005, nine 3D CNNs were trained. Figs 6 and 8(b) show the quantitative and qualitative results of the 3D CNNs. The mean distance error decreased to 2.68 mm from the initial detection error of 3.90 mm. The proposed method achieved an error range of 1 to 4 mm for the detection of most landmarks. In addition, as shown in Fig 10(b), the proposed method significantly reduced the mean and variance of error for the test subjects, compared to the initial detection.

Fig 8

Qualitative evaluation of detection for (a) cranial landmarks and (b) mandibular landmarks.

The red and green dots denote the ground truth and detected output landmarks respectively.

Qualitative evaluation of detection for (a) cranial landmarks and (b) mandibular landmarks.

The red and green dots denote the ground truth and detected output landmarks respectively.

3.5.2 Cranial landmark detection

To generate the partially integrated image mid, we set the interval for the truncated volume as ± 7.5 mm v1-directionally from the midsagittal plane. Next, 2D image patches were cropped into sizes of 80 × 80 pixels (≈ 4 × 4 cm2). For training the 2D CNNs, we used the learning parameters of 5000 epochs, a full batch-size, and a learning rate of 0.0001. In Fig 9 and Table 3, qualitative and quantitative evaluations of the 2D CNN-based detection of three cranial landmarks on the midsagittal plane are provided. The detection achieved relatively accurate annotation on the three target landmarks.

Fig 9

Results of coarse-to-fine landmark detection on 2D patch.

Yellow dot is the output of the coarsely detected VAE. Green dot is the output of detection using patch-based CNN. Red dot is the ground truth.

Table 3

Error evaluation of the landmarks on the midsagittal plane.

Initial error and 2D CNN error are presented in the table. The errors are reduced after the 2D CNN is applied.

Landmark name	Error before 2D CNN (mm)	Error after 2D CNN (mm)
MxDML	2.81	1.18
Od	3.85	2.65
PNS	2.70	1.49

Results of coarse-to-fine landmark detection on 2D patch.

Yellow dot is the output of the coarsely detected VAE. Green dot is the output of detection using patch-based CNN. Red dot is the ground truth.

Error evaluation of the landmarks on the midsagittal plane.

Initial error and 2D CNN error are presented in the table. The errors are reduced after the 2D CNN is applied. For the estimation of all cranial landmarks, VAE was trained with 80000 epochs, a full batch, and a learning rate of 0.001. The map Φcr was trained with 23000 epochs, a full batch-size, and a learning rate of 0.0001. The latent dimension was empirically set to 15. Figs 6 and 8(a) show the final cranial landmark estimation results in quantitative and qualitative formats, respectively. The mean detection error for all cranial landmarks was 3.08 mm, decreasing from the initial estimation error of 3.27 mm (Fig 10(a)). The error for most cranial landmarks fell within the range of 1 to 4 mm.

Fig 10

The mean detection error of the cranial, mandibular, and all landmarks for each of the 9 test data.

The mean detection error of the cranial, mandibular, and all landmarks for each of the 9 test data.

Blue bars represent the error from the initial estimation, red bars represent the error from the final estimation, and the lines represent the averaged error over test subjects. (a) Cranial Landmarks. (b) Mandibular Landmarks. (c) Entire Landmarks. In terms of all landmarks, as described in Fig 10(c), our proposed method achieved an error of 2.88 mm (Fig 6), which is much lower than the initial detection error of 3.59 mm (Fig 7).

4 Discussion and conclusion

This article proposes a fully automatic landmarking system for 3D cephalometry in 3D CT. The proposed method provides the accurate and reliable identification of cephalometric landmarks that can be used in subsequent clinical studies, such as in the development of morphometrical guidelines for diagnosis, surgical planning, and the treatment of craniofacial diseases. The proposed semi-supervised method is designed to use many anonymized landmark dataset to address the severe shortage of training CT data. Currently, only 24 CT data pairs are available due to the legal and ethical restrictions on medical data, while approximately 200 anonymized landmark data are available. The proposed method is based on the benchmark model [13], which provides 3.63 mm error for annotation of 90 landmarks. This model motivated the backbone structure of the coarse estimation step. The proposed method reduces the average detection error from 3.63 mm to 2.88 mm by employing the coarse-to-fine detection, where appropriate strategies for mandibular and cranial landmarks were considered for their different properties. We expect the detection accuracy to be further improved with increasing amount of available training data. While it may be possible to directly learn the map from the partial knowledge to the global landmarks, the use of VAE is an effective approach for obtaining a meaningful latent representation in terms of skull morphology while learning the local-to-global estimation map. Human skull morphology follows certain patterns and the positions of landmarks are closely interrelated. A previous study [13] provided empirical evidence that VAE can learn a low-dimensional representation that is strongly associated with the factors determining facial skeletal morphology. It is also a well-known advantage of VAE that the learned latent space is dense and smooth [13, 14, 23]. Hence, it is expected that the VAE-based local-to-global estimation map not only provides the connection between partial and global landmarks but also follows the merits of VAE for latent representation. Landmarks on the cranium have smaller variability between subjects compared to those on mandible due to the rigid property of the cranium; therefore, cranial landmarks are relatively suitable for the effective estimation in the sense of finding certain common patterns over the training dataset. Moreover, the use of 3D CNN-based fine annotation for all landmarks requires high computational memory consumption and power budget due to the increased use of 3D networks. Hence, the VAE-based approach can be regarded as an effective strategy to finely detect cranial landmarks with a sufficient level of accuracy. Meanwhile, the positional estimation of the summit position of the cranium (SC) obtained from the relation learned via VAE exhibited the lowest accuracy (see Fig 6). This appears to have occurred because the SC may weakly depend on the positions of other landmarks. A rigorous factor analysis using VAE may be undertaken in future research. The proposed method has the potential to alleviate the experts’ hectic workflow by introducing an automated cephalometric landmarking with high accuracy. In clinical practice, our method allows all 3D landmarks to be estimated from partial information obtained via 3D CT data. Although the error level of some landmarks does not meet the requirement of clinical applications (less than 2 mm), the proposed method may still aid in decisions of clinicians in determining landmark positions, thereby improving their working processes. Recently, as concerns about the radiation doses have increased, there have been attempts to use dental cone-beam CT for cephalometric analysis instead of the conventional multi-detector CT because cone-beam CT utilizes a much lower radiation dose than multi-detector CT. The investigation of an automated 3D landmarking system for cone-beam CT will therefore be a topic of our future research.

About 90 cephalometric landmarks.

(ZIP) Click here for additional data file.

15 in total

Review 1. Three-dimensional cephalometric analysis in orthodontics: a systematic review.

Authors: P Pittayapat; N Limchaichana-Bolstad; G Willems; R Jacobs
Journal: Orthod Craniofac Res Date: 2013-12-22 Impact factor: 1.826

2. Automatic 3D cephalometric annotation system using shadowed 2D image-based machine learning.

Authors: Sung Min Lee; Hwa Pyung Kim; Kiwan Jeon; Sang-Hwy Lee; Jin Keun Seo
Journal: Phys Med Biol Date: 2019-02-20 Impact factor: 3.609

3. Three-dimensional architectural and structural analysis--a transition in concept and design from Delaire's cephalometric analysis.

Authors: S-H Lee; T-J Kil; K-R Park; B C Kim; J-G Kim; Z Piao; P Corre
Journal: Int J Oral Maxillofac Surg Date: 2014-05-01 Impact factor: 2.789

4. Computer-aided cephalometric landmark annotation for CBCT data.

Authors: Marina Codari; Matteo Caffini; Gianluca M Tartaglia; Chiarella Sforza; Giuseppe Baselli
Journal: Int J Comput Assist Radiol Surg Date: 2016-06-29 Impact factor: 2.924

5. Learning-based local-to-global landmark annotation for automatic 3D cephalometry.

Authors: Hye Sun Yun; Tae Jun Jang; Sung Min Lee; Sang-Hwy Lee; Jin Keun Seo
Journal: Phys Med Biol Date: 2020-04-23 Impact factor: 3.609

6. Cephalometric analysis as a tool for treatment planning and evaluation.

Authors: F V Tenti
Journal: Eur J Orthod Date: 1981 Impact factor: 3.075

7. A new mandible-specific landmark reference system for three-dimensional cephalometry using cone-beam computed tomography.

Authors: Pisha Pittayapat; Reinhilde Jacobs; Michael M Bornstein; Guillaume A Odri; Min Sung Kwon; Ivo Lambrichts; Guy Willems; Constantinus Politis; Raphaël Olszewski
Journal: Eur J Orthod Date: 2015-12-17 Impact factor: 3.075

8. The morphology of the human mandible: A computational modelling study.

Authors: Ravin Vallabh; Ju Zhang; Justin Fernandez; George Dimitroulis; David C Ackland
Journal: Biomech Model Mechanobiol Date: 2019-03-02

9. A comparison of two-dimensional radiography and three-dimensional computed tomography in angular cephalometric measurements.

Authors: R Nalçaci; F Oztürk; O Sökücü
Journal: Dentomaxillofac Radiol Date: 2010-02 Impact factor: 2.419

10. A Fully Automated Method for 3D Individual Tooth Identification and Segmentation in Dental CBCT.

Authors: Tae Jun Jang; Kang Cheol Kim; Hyun Cheol Cho; Jin Keun Seo
Journal: IEEE Trans Pattern Anal Mach Intell Date: 2022-09-14 Impact factor: 9.322