Young Hyun Kim1, Chena Lee1, Eun-Gyu Ha1, Yoon Jeong Choi2, Sang-Sun Han1,3. 1. Department of Oral and Maxillofacial Radiology, Yonsei University College of Dentistry, Seoul, Korea. 2. Department of Orthodontics, Institute of Craniofacial Deformity, Yonsei University College of Dentistry, Seoul, Korea. 3. Center for Clinical Imaging Data Science (CCIDS), Yonsei University College of Medicine, Seoul, Korea.
Abstract
PURPOSE: This study aimed to propose a fully automatic landmark identification model based on a deep learning algorithm using real clinical data and to verify its accuracy considering inter-examiner variability. MATERIALS AND METHODS: In total, 950 lateral cephalometric images from Yonsei Dental Hospital were used. Two calibrated examiners manually identified the 13 most important landmarks to set as references. The proposed deep learning model has a 2-step structure-a region of interest machine and a detection machine-each consisting of 8 convolution layers, 5 pooling layers, and 2 fully connected layers. The distance errors of detection between 2 examiners were used as a clinically acceptable range for performance evaluation. RESULTS: The 13 landmarks were automatically detected using the proposed model. Inter-examiner agreement for all landmarks indicated excellent reliability based on the 95% confidence interval. The average clinically acceptable range for all 13 landmarks was 1.24 mm. The mean radial error between the reference values assigned by 1 expert and the proposed model was 1.84 mm, exhibiting a successful detection rate of 36.1%. The A-point, the incisal tip of the maxillary and mandibular incisors, and ANS showed lower mean radial error than the calibrated expert variability. CONCLUSION: This experiment demonstrated that the proposed deep learning model can perform fully automatic identification of cephalometric landmarks and achieve better results than examiners for some landmarks. It is meaningful to consider between-examiner variability for clinical applicability when evaluating the performance of deep learning methods in cephalometric landmark identification.
PURPOSE: This study aimed to propose a fully automatic landmark identification model based on a deep learning algorithm using real clinical data and to verify its accuracy considering inter-examiner variability. MATERIALS AND METHODS: In total, 950 lateral cephalometric images from Yonsei Dental Hospital were used. Two calibrated examiners manually identified the 13 most important landmarks to set as references. The proposed deep learning model has a 2-step structure-a region of interest machine and a detection machine-each consisting of 8 convolution layers, 5 pooling layers, and 2 fully connected layers. The distance errors of detection between 2 examiners were used as a clinically acceptable range for performance evaluation. RESULTS: The 13 landmarks were automatically detected using the proposed model. Inter-examiner agreement for all landmarks indicated excellent reliability based on the 95% confidence interval. The average clinically acceptable range for all 13 landmarks was 1.24 mm. The mean radial error between the reference values assigned by 1 expert and the proposed model was 1.84 mm, exhibiting a successful detection rate of 36.1%. The A-point, the incisal tip of the maxillary and mandibular incisors, and ANS showed lower mean radial error than the calibrated expert variability. CONCLUSION: This experiment demonstrated that the proposed deep learning model can perform fully automatic identification of cephalometric landmarks and achieve better results than examiners for some landmarks. It is meaningful to consider between-examiner variability for clinical applicability when evaluating the performance of deep learning methods in cephalometric landmark identification.
Cephalometric analysis, which has been established as a gold standard for orthodontic diagnoses, has involved measurements of multiple linear and angular parameters using lateral cephalograms since Broadbent introduced the method in 1931.12 Each parameter is calculated based on clinically important landmarks and provides clinicians with useful information to facilitate diagnosis, growth assessment, orthodontic treatment planning, orthognathic surgery, and treatment outcome assessment.34 Although landmark identification is an indispensable part of the diagnostic process in orthodontics, image-related errors56 and expert bias7891011 can cause variability in landmark detection.38Since the 1990s, various approaches have been tried to identify cephalometric landmarks automatically. In the early stages, pixel intensity and knowledge-based methods were used.12131415 However, the results showed sensitive fluctuations depending on the image quality and were difficult to apply in clinical settings. Template matching1617 and mathematical model1218 approaches have also been attempted, and multiple methods have been combined to overcome the disadvantages of each method.41920 In recent years, deep learning algorithms have been widely introduced to detect landmarks automatically on lateral cephalograms.212223 Despite the wide variety of approaches, it remains challenging to automatically detect all landmarks that are essential for diagnosis and achieve clinically acceptable performance.Previous researchers evaluated performance in terms of how many landmarks are identified within a precision range of about 2 to 4 mm.320212425 However, there has been no official consensus approved by an academic society specializing in orthodontics regarding the clinically acceptable precision range for landmark identification. Due to the complexity of lateral cephalograms, some landmarks are more difficult to identify than others, even for experts.24In this study, a fully automatic landmark identification model was developed using deep learning and its performance was evaluated in terms of a clinically acceptable range determined based on the inter-examiner reliability of experts.
Materials and Methods
Ethics approval
Every image was anonymized to avoid identification of the patients, and ethical approval (IRB No. 2-2017-0054) was obtained from the research ethics committee of the Institutional Review Board (IRB) of Yonsei University Dental Hospital. All experiments were performed in accordance with relevant guidelines and ethical regulations, and the requirement for patient consent was waived by the IRB of Yonsei University Dental Hospital due to the retrospective nature of this study.
Data preparation
A total of 950 lateral cephalometric images were used. The images were taken by a Rayscan (Ray Co. Ltd., Hwaseong, Korea) at the Department of Oral and Maxillofacial Radiology and extracted from the picture archiving and communication system of Yonsei University Dental Hospital. The 950 radiographic images were randomly divided into 800 training images, 100 validation images, and 50 test images without overlapping. Thirteen cephalometric landmarks, which are clinically important and based on the hard tissue, were selected in this study. The landmarks included the sella (Se), nasion (N), orbitale (Or), porion (Po), A-point(A), B-point(B), pogonion (Pog), menton (Me), upper incisor border (UIB), lower incisor border (LIB), posterior nasal spine (PNS), anterior nasal spine (ANS), and articulare (Ar) (Fig. 1). Two calibrated orthodontists manually annotated these 13 landmarks using OrthoVision software (Ewoosoft Co. Ltd., Hwaseong, Korea). They were trained in the same orthodontic department and had 15 and 5 years of clinical experience, respectively. To improve inter-examiner reliability, they had 3 training sessions before identifying the landmarks for this study. The landmark data obtained from an expert with 15 years of experience were regarded as the reference landmarks and the distance error of detection between the 2 experts (expert variability) was used to establish a clinically acceptable range for each landmark. To verify inter-examiner reproducibility, 50 cephalometric images were randomly selected and landmarks were annotated manually by 2 experts.
Fig. 1
Cephalometric identification of the 13 landmarks used in this study. S: sella, N: nasion, Or: orbitale, Po: porion, A: A-point, B: B-point, Pog: pogonion, Me: menton, UIB: upper incisor border, LIB: lower incisor border, PNS: posterior nasal spine, ANS: anterior nasal spine, Ar: articulare.
Convolutional neural network model for landmark detection
Figure 2 shows the proposed fully automatic landmark detection model using a convolutional neural network (CNN). The proposed deep learning model has a 2-step structure, comprising a region of interest (ROI) machine and a detection machine (Fig. 2A). Each CNN consisted of 8 convolution layers, 5 pooling layers, and 2 fully connected layers(Fig. 2B).
Fig. 2
The structure of the proposed fully automatic landmark detection model using a convolutional neural network (CNN). A. The overall workflow of the 2-step machines of the proposed model. B. The structure of the CNN model. ROI: region of interest; PNS: posterior nasal spine; ELU: exponential linear units.
The convolution layer had a filter size of 3×3 and the exponential linear units (ELU) function was used for activation. The original images were 1,956×2,238 pixels with a pixel spacing of 0.12 mm. Before being input into the deep learning model, the original images were resized to 512×512 pixels for efficient and fast detection of the 13 landmarks. The first step, the ROI machine, was designed to crop 13 target areas, each of which contained 1 landmark. Once the resized image was input, the convolution layers found the local features of the image, and then the fully connected layers combined the associations of these features to predict the coordinates for the 13 landmarks. Since each of the landmarks has 2-dimensional coordinates (x and y), the predicted landmarks were represented by a 26-dimensional vector, and based on this result, the ROI was cropped in the input image. The second step, the detection machine, was designed to identify the 13 landmarks from each ROI. Once the 13 cropped images were input into 13 CNN models, respectively, the models predicted the coordinates of the landmarks. To optimize the hyper-parameter, the distance errors between the results predicted by the detection machine and the expert-annotated results were used. After the automatic detection of the landmarks was completed, the cropped images were merged into the original position and output as a single image that contained the 13 identified landmarks in 1,956×2,238 pixels. The model was performed on a Linux server running Ubuntu 18.04 with 128 GB of memory and 12 GB of GPU memory (NVIDIA Titan Xp; NVIDIA Corporation, Santa Clara, CA, USA).
Statistical evaluation
The intra-class coefficient correlation (ICC) was calculated to confirm the degree of reliability of the 2 experts. The mean radial error (MRE) and standard deviation (SD) for 13 landmarks between the 2 examiners were calculated to establish the clinically acceptable range. The radial error and MRE are defined in equations(1) and (2), where Δx and Δy denote the absolute distance in the coordinates in the x- and y-directions, respectively, between the predicted and reference landmarks.The similarity of the detected landmarks consisting of coordinates (x, y) was obtained by calculating the pixel distance using Euclidean distance and multiplying by the pixel space value (0.12 mm).Two types of the successful detection rate (SDR)—the general SDR and the expert variability SDR—were also calculated to assess the performance of the proposed model. The general SDR was calculated as the ratio at which the difference between the predicted and reference landmarks was within a given distance, such as 2.0 mm, 2.5 mm, 3.0 mm, and 4.0 mm. The expert variability SDR was calculated by evaluating whether the difference between the reference and the predicted landmarks was within the inter-expert difference values.
Results
The ICCs of the 2 examiners were above 0.99 for all landmarks, including the lower bounds of the 95% confidence intervals (Table 1). Since the ICC values were more than 0.7, which is generally used as the criterion for high agreement, the reliability of the 2 examiners showed almost perfect agreement. Figure 3 presents 4 images of the predicted results by the proposed CNN model compared with the reference landmarks.
Table 1
Reliability of manually annotated landmarks by 2 examiners
Figure 4 displays the MREs of expert variability and the predicted results for all landmarks. From the results for expert variability, ANS showed the highest MRE with 2.25 mm, while Po had the lowest with 0.47 mm. Five landmarks (S, Or, Po, Pog, and PNS) presented less than 1.00 mm of MRE. The MRE of the predicted results for the A-point, UIB, LIB, and ANS were 1.89 mm, 1.55 mm, 1.37 mm, and 2.14 mm, which were lower than the corresponding expert variability values of 1.97 mm, 1.66 mm, 1.40 mm, and 2.25 mm, respectively. S, Po, B, ANS, and Ar had MREs of more than 2.00 mm. UIB showed the lowest MRE (1.37 mm).
Fig. 4
Comparison of the mean radial errors of expert variability and predicted results. S: sella, N: nasion, Or: orbitale, Po: porion, A: A-point, B: B-point, Pog: pogonion, Me: menton, UIB: upper incisor border, LIB: lower incisor border, PNS: posterior nasal spine, ANS: anterior nasal spine, Ar: articulare.
Table 2 presents the performance of the proposed algorithm in terms of the general SDR and expert variability SDR. Six landmarks(N, A, Me, LIB, UIB, and ANS) had SDRs over 50% according to expert variability. Although Po and Or showed SDRs of 47.3% and 66.7%, respectively, when calculated with a general precision range of 2.0 mm, the 2 landmarks showed expert variability SDRs of 3.3% and 7.3%.
Table 2
The success detection rate (SDR) of landmark identification using the proposed algorithm
Discussion
In orthodontic procedures, cephalometric landmark detection is essential for accurate diagnosis and proper assessment of treatment progress, although it is a time-consuming, bothersome, and error-proven task for dentists.12 To overcome these drawbacks, grand challenges at the International Symposium on Biomedical Imaging were held, and the participants proposed various methods to automatically identify landmarks using cephalometric images.26 Researchers have been trying to achieve more accurate performance using the state-of-the-art methods.12327This study proposed a fully automatic deep learning method for landmark identification in cephalometric images. Similar to previous researchers,2127 we designed an ROI machine that detects small areas, including target landmarks, from the entire image using a CNN model. In the model proposed in this study, based on the detected ROI, an individual CNN model is applied for the 13 landmarks that need to be identified. This architecture may allow each CNN model to extract feature maps for only 1 target landmark, resulting in faster and more accurate results.In previous studies, however, these distances were not evaluated considering a clinically acceptable range. Because expert variability in landmark detection might often happen, it may be difficult to detect exactly the same position.51228 The clinically acceptable error range in landmark identification is still debatable.7In this study, the general SDR was based on general precision ranges (2.0, 2.5, 3.0, and 4.0 mm), and expert variability SDR was based on the difference between 2 examiners for each of the 13 landmarks for a clinically acceptable evaluation. Two trained orthodontists annotated the 13 landmarks using randomly selected cephalometric images. Despite the different experience of the 2 examiners, the inter-examiner agreement for all landmarks indicated excellent reliability.29 This may be due to the fact that the 2 examiners had sufficient agreement and training sessions in advance. Thus, the differences(in terms of the Euclidean distance) between the location of the landmarks identified by the trained experts were considered as the clinically acceptable precision ranges.Expert variability SDR ranged from 3.3% to 63.3%, and the rate was higher when calculated on the basis of the general precision range. For example, Or and Po exhibited SDRs of 66.67% and 47.33%, respectively, with 2.0 mm of precision, but 7.33% and 3.33%, respectively, with the expert variability SDR. This means that even though the identified landmarks showed high SDR values based on general precision ranges, they may still be clinically unacceptable. Furthermore, the A-point, LIB, UIB, and ANS showed lower MRE values when detected automatically by the proposed model than the variability between trained experts, indicating excellent detection performance for those landmarks.Some researchers103031 reported that landmarks located anatomically on curves, such as the A-point, are prone to identification errors. The precision of landmark identification can be affected by various factors such as the level of examiner's knowledge,7 individual understanding of landmark definitions,3233 and the quality of cephalometric images.33 The reason why the proposed CNN model showed lower errors than experts in some landmarks may be due to the reduced human-induced variability due to the above-mentioned factors.The proposed algorithm was developed for fully automatic landmark detection using real clinical data, and expert variability was considered for the evaluation of 13 detected landmarks. This can be useful when evaluating the clinical applicability of the developed model.