Literature DB >> 36159155

Research and application of tongue and face diagnosis based on deep learning.

Li Feng¹, Zong Hai Huang¹, Yan Mei Zhong¹, WenKe Xiao¹, Chuan Biao Wen¹, Hai Bei Song¹, Jin Hong Guo².

Abstract

Objective: To explore the technical research and application characteristics of deep learning in tongue-facial diagnosis.
Methods: Through summarizing the merits and demerits of current image processing techniques used in the traditional medical tongue and face diagnosis, the research status of deep learning in tongue image preprocessing, segmentation, and classification was analyzed and reviewed, and the algorithm was compared and verified with the real tongue and face image. Images of the face and tongue used for diagnosis in conventional medicine were systematically reviewed, from acquisition and pre-processing to segmentation, classification, algorithm comparison, result from analysis, and application.
Results: Deep learning improved the speed and accuracy of tongue and face diagnostic image data processing. Among them, the average intersection ratio of U-net and Seg-net models exceeded 0.98, and the segmentation speed ranged from 54 to 58 ms.
Conclusion: There is no unified standard for lingual-facial diagnosis objectification in terms of image acquisition conditions and image processing methods, thus further research is indispensable. It is feasible to use the images acquired by mobile in the field of medical image analysis by reducing the influence of environmental and other factors on the quality of lingual-facial diagnosis images and improving the efficiency of image processing.

Entities: Chemical

Keywords: Deep learning; face diagnosis; image processing; tongue diagnosis; traditional medicine

Year: 2022 PMID： 36159155 PMCID： PMC9490485 DOI： 10.1177/20552076221124436

Source DB: PubMed Journal: Digit Health ISSN： 2055-2076

Introduction

Traditional medicine employs a variety of diagnostic methods, among which traditional Chinese medicine (TCM) diagnosis involves a doctor's visual observation of all external manifestations and discharges of the human body to determine the health status of the body.[1,2] According to TCM, the manifestations of the face and tongue are important indicators reflecting the qi and blood status and overall functions of the body. However, the authenticity and accuracy of diagnosis and treatment information are easily affected by the clinical experience of the physician, the patient's expression, and the difference in the patient's constitution, so it is challenging to form a unified standard of TCM diagnosis. Presently, intelligent tongue and facial diagnosis systems primarily apply modern instruments to acquire tongue and facial images ; it generally involves data pre-processing, image positioning and segmentation, feature extraction, pattern recognition, and other steps to achieve the objective, quantified and informationized tongue and face diagnosis, so as to achieve the goal of high efficiency and high accuracy processing of tongue and facial diagnosis data. In recent years, scholars have studied deep-learning-assisted medical diagnosis and its various explorations in the medical field. Gautam and Sharma focused on deep learning in diagnosing nervous system diseases, such as cerebrovascular diseases, Alzheimer's disease, Parkinson's disease, etc., described the performance of the convolutional neural network (CNN), recurrent neural network (RNN), restricted Boltzmann machine (RBM), deep belief network (DBN), deep Boltzmann machine (DBM), deep neural network (DNN) and deep learning technology, listed the optimal deep learning method to diagnose various diseases, and emphasized the relationship between deep learning method and those diseases. Ahsan and Siddique reviewed that when there is an imbalance in patient data, the utilization of deep learning methods can improve the performance of the models, but cannot prove the reliability of the models. Therefore, subsequent work is directed at extending deep learning experiments to increase the explanatory power of model predictions. Liu et al. proposed to use of an active matrix sensing array assisted with a machine learning approach to measure plantar pressure during walking, so as to improve the diagnostic process of LDD (lumbar degenerative disease). The system can classify common human movements such as squatting, squatting, jumping, walking, and jogging, and the accuracy is as high as 99.2%. It can accurately identify personal activities, the diagnostic accuracy of LDD patients is 100%, and give postoperative recovery evaluation. This research shows that deep learning can be applied to almost every aspect of medicine and poses certain far-reaching impacts. At the same time, deep learning can be combined with other technologies to play a substantial role in smart medicine, telemedicine, and other aspects in the future, for example, it can combine with the Internet of things technology to monitor and manage patients with psychological and neurological disorders and COVID-19 patients, which is conducive to the recovery of patients' physical and mental state. Therefore, deep learning technology can exert a great impact on the medical field; it has unique advantages in diagnosis, prognosis management, health monitoring, risk assessment, and so on. Combining deep learning technology with traditional medicine will improve the diagnosis and treatment efficiency of traditional medicine and contribute to the development of traditional medicine. This work not only analyzed the development of tongue image processing technology and studied face image processing, but also summarized the hardware conditions of image acquisition. It is a systematic overview of the face and tongue images used in traditional medicine for diagnosis, covering acquisition, preprocessing, segmentation, classification, algorithm comparison, result from analysis, and application. Following the evolution of mobile technology, real-time and universality are the direction of future medical development. The use of images collected by mobile terminals for disease diagnosis may provide new ideas for mobile medicine.

Related works

Equipment for tongue and face diagnostic image acquisition

Image acquisition under unrestricted scenarios

There would be no convenience due to the fixed location and time for using specialized image acquisition equipment such as tongue diagnostic devices. In contrast, mobile acquisition of tongue and face images with a greater range of applications is less demanding, so utilizing images taken by mobile devices for tongue diagnosis image analysis is the future trend of intelligent tongue and face diagnosis. The difference between the images captured by mobile and the images captured by standard tongue diagnostic devices lies in noise and picture resolution, and the images captured by mobile in the tongue image processing method require algorithms to exclude the influence of light and background. Song et al. applied the algorithm of LBP (Local Binary Patterns) combined with Adaboost in a cell phone terminal to achieve near real-time tongue image detection, and the accuracy of image analysis was achieved by deep learning algorithm through a residual network without setting the scene. In order to construct a color correction function model for tongue images based on cell phone terminals, Wang et al. compared tongue images taken by different models of cell phones with those taken by the TDA-1 micro-tongue diagnostic instrument, and analyzed more than 300 cases of tongue images for standard color card L*a*b color value acquisition; with the optimized back propagation (BP) neural network as the model, the constructed network can correct tongue color and moss color. Tongue images captured by mobile devices often feature in extremely complex lighting and background environments, therefore they need to be algorithmically pre-processed prior to analysis.

Image acquisition equipment under certain scenario situations

TCM physicians tend to determine the clinical significance of tongue manifestation and facial manifestation based on their own experience and treatment conceptions, which is highly subjective. By setting the parameters and using mathematical models to describe the established targets, the machine can objectively reflect the content of the pictures qualitatively, quantitatively, and locally. Since the 1980s, Yan et al. focused on TCM tongue-face diagnosis and conducted qualitative research on the objectification of the four diagnostic methods of TCM. In 2002, Wei et al. developed for the first time a computer-based digital TCM tongue analysis instrument, which broke through the subjective limitations of traditional TCM tongue diagnosis and was of great significance to the objectification of TCM clinical diagnosis and the development of TCM scientific research. Since then, the objectification of TCM tongue-facial diagnosis has entered a rapid development stage. At present, the research on the objectification of tongue-face diagnosis centers on the following two aspects: (1) developing accurate instruments; (2) exploring specific indicators that can directly reflect the images of tongue and face diagnosis. In the process of tongue-facial diagnosis image acquisition, environmental factors usually directly affect the image quality and even the analysis results. In order to reduce the influence of these factors, Ding and Ding designed a tongue diagnostic instrument based on the theory of TCM inspection diagnosis, which maintains the authenticity of the tongue signs and excludes the environmental influence such as the background, while fixing the light source parameters under the set scene.

Research progress of the facial diagnostic instrument

The facial diagnostic instrument is designed based on the theory of TCM facial diagnosis, integrating facial diagnosis image acquisition, image segmentation, and information processing.[17,18] The light source in the facial diagnostic instrument can affect the realism of facial color, so it is crucial to keep the stability of the light source. Li et al. employed light-emitting diodes with color temperatures 4000 K – 11,000 K as the light source of the dark box system for facial diagnosis information collection, which enhanced the standardization of facial diagnosis information collection; Shi et al. concluded that D50 can be treated as an illumination light source for objectified acquisition in comparison with natural light; Zheng et al. used a xenon lamp with a color temperature of 5500 K to simulate a daylight light source and used a digital camera with a resolution of 4752 × 3168 to acquire images. Their stability of the light source, color rendering, and uniformity of the light source were greater than 95%, thus the collected data could meet the needs of TCM color diagnosis and contribute to the objectified study of color diagnosis information.

Research progress of tongue diagnostic instrument

The tongue diagnostic instrument mainly consists of two parts: the digital tongue image acquisition system and the tongue feature processing system. Among them, the hardware of the digital tongue image acquisition system is mainly the combination of image acquisition equipment, light source, illumination environment, and image closure platform. The tongue feature processing system is to extract and identify tongue features such as tooth marks, fissures, sublingual vessels, and tongue coating, which mainly applies tongue classification and tongue segmentation (see Figures 1 and 2).

Figure 1.

Composition of tongue diagnostic instrument.

Figure 2.

The constitution of tongue image feature processing system.

Composition of tongue diagnostic instrument. The constitution of tongue image feature processing system. Jiang et al., who was inspired by the small handheld tongue analyzer, distributed tongue analyzer, and integrated sphere tongue diagnostic instrument, designed an all-in-one closed tongue acquisition system and investigated a method to manually adjust the camera white balance parameters. Their work laid the foundation for later image color correction. Zhang et al. proposed a non-invasive method instrument based on three groups of tongue features that can capture the tongue image and complete image correction simultaneously and improve the quality of the tongue image. In order to solve the problem of the inconsistent color of the captured images during tongue objectification, some scholars compared and evaluated the imaging camera and illumination environment in the image acquisition equipment by three indexes: peak signal-to-noise ratio, blur coefficient, and quality factor for the imaging quality. The effect of the illumination environment on image color was explored using LED light sources of different wavelengths to produce a new tongue image acquisition device, and experiments proved that the color consistency of images acquired by the new device was significantly improved. The vast majority of tongue image acquisition devices are fixed posterior, and due to the different sizes and shapes of the patient's tongue, it cannot be guaranteed that the captured tongue images are complete and clear. Zhang et al. proposed the concept of dynamic acquisition for the first time and realized vertical photography of the tongue body by controlling the camera with an embedded system. They achieved the clarity, completeness, fidelity, and approximation of tongue color in natural light, and obtained tongue pictures closer to those under natural light sources, overcoming the incomplete reflection of tongue information caused by the original static acquisition. After the tongue objectification instrument was put into clinical practice, it was found that different subjects with different tongue exposure can cause inconsistent measurement results. Considering the ultrasonic ranging principle, the development of the specification of the tongue exposure of the subjects was helpful to improve the reliability of tongue diagnosis image data. Separation of the image of tongue texture from coating is difficult for characteristic processing of tongue image. The traditional K-means clustering algorithm unstably separates the image of tongue texture from the coating. Thus, Li and Li proposed an optimized K-means clustering model for texture and coating image separation, which integrated three color space properties of RGB, HSV, and L*a*b*. After comparing the effect of single-channel, dual-channel, and three-channel tongue image clustering, it was confirmed that the dual-channel tongue image clustering was better than the other two segmentation schemes. And the single-channel tongue image classification histogram peak as the initial clustering centroid to separate the tongue coating and tongue texture are more accurate and stable than the traditional K-means clustering algorithm. Tongue fissures are relatively special features, in order to extract tongue fissures accurately, Zhang et al. proposed a fissured tongue detection method based on a local grayscale threshold. It can separate the cracked region from the background region by local grayscale squared difference after operations such as Laplace enhancement, median filtering to remove noise, and regional consistency principle to determine whether cracks exist. This method can effectively and easily separate tongue cracks and background, which provides powerful technical support for tongue diagnosis objectification. The sublingual vein condition can provide an auxiliary diagnostic basis, but the size of the sublingual vein is much smaller than that of the tongue, and the difficulty of sublingual vein segmentation training is also much higher than that of tongue segmentation training. To improve the accuracy of sublingual vein segmentation, Yang et al. proposed a collaborative attention network that decomposes the entire coding and decoding framework and collaboratively updates the parameters, which improves the training speed and maintains the training stability.

Research and progress of image processing algorithms

After the lingual diagnosis images are captured and transferred to the database, the image information needs to be identified, read and output completely and accurately. Algorithms of lingual diagnosis image processing can improve the speed and accuracy of information processing. Early lingual diagnosis objectification was to convert the collected light of different wavelengths into digital signals by means of a photoelectric sensor/device. In recent years, lingual-facial diagnosis objectification is done by capturing images with a camera, pre-processing the images, and using machine learning to segment, and classify the lingual-facial images[34,35] (see Figure 3).

Figure 3.

Flowchart of objectification of lingual diagnosis.

Facial diagnosis image segmentation algorithm

There are few algorithms for facial diagnosis image segmentation and classification. Traditional image segmentation methods do not take into account the characteristics of the strong stereo nature of facial images, which brings about problems such as insufficient processing of facial details and distortion of chromaticity information in color space conversion. Liu et al. designed an automatic facial image segmentation algorithm for TCM facial diagnostic instruments, using a grayscale adaptive enhancement method for pre-processing. On this basis, it combined with the adaptive nonlinear conversion method to reduce the risk of color conversion distortion, while the details were processed using clustering methods and mathematical morphological operations, accurately analyzing the facial diagnosis map. Lin et al. combined elements of color space theory, statistical features of facial texture, and lip color features, and used machine learning methods such as KNN, support vector machine (SVM), and BP neural network to classify and recognize the extracted facial features to achieve automatic segmentation of facial regions, whose best recognition rate reached 91.03%. To solve the visual inconsistency caused by different angles in the face diagnosis image acquisition process, Ning and Chen adapted a columnar projection method based on facial features and combined the SIFT (Scale-invariant feature transform) feature matching algorithm and the RANSAC (Random Sample Consensus) matching optimization algorithm to extract image feature vectors. Thus it can eliminate matching errors and enable efficient, accurate, and smooth matching of images, which can generate face images quickly and effectively, contributing to the development of objectification in TCM face diagnosis. Facial expressions can convey a variety of emotions; those emotions are directly related to the patient's disease condition, so the recognition of facial expressions is also part of intelligent inspection diagnosis. Huang et al. used a residual deep neural network of internal evolutionary mechanisms and feature fusion algorithms to recognize image facial expressions. It achieved high recognition rates in both standard and low-quality face datasets. This method eliminates the effect of lighting conditions, occlusions, and age changes on the image quality. The facial expression changes are very subtle and the limitations of the technique lead to a low facial expression recognition rate. Some scholars have applied CNN (convolutional network) models to expression recognition, and Jin et al. used a model based on the VGG network structure for image training, which was optimized to continuously reduce the loss rate and improve the accuracy of recognition. Wu et al. designed a three-dimensional convolutional neural network (3D-CNN) micro-expression recognition algorithm, adding a batch normalization algorithm as well as a discard method on top of 3D-VGG-Block to improve the network depth and training speed and prevent data overfitting. At present, few studies incorporate facial expressions into the objectification index of facial diagnosis, but facial expressions possess special significance in TCM diagnosis, and the study of facial expression objectification is going deeper and in a high development period. Face classification algorithms can be used not only for analyzing facial features but also for gender classification. In the process of social software access, identity authentication and login permission are achieved through facial recognition. Accurate gender classification can improve the accuracy of facial recognition and reduce the difficulty of recognition. Fekri-Ershad designed a rotation invariant method for gender classification of face images based on an improved version of the local binary pattern (iLBP), which reduces the computational complexity of smartphone applications, reduces memory and CPU usage, and improves the performance of synchronous applications in smartphones. Zhang et al. proposed a new method based on a multi-scale facial fusion feature (ms3f) to classify gender from facial images. The fusion features are extracted by the combination of a local binary pattern (LBP) and local phase quantization (LPQ) descriptors, and multi-scale features are generated by multi-block (MB) and multi-level (ML) methods. SVM is used as a classifier for gender classification. Those results indicate that the application of multi-scale fusion features greatly improves the performance of gender classification. In the future, gender classification can be applied to many scenes such as smartphones and intelligent medical treatment. The progress of facial image processing technology can promote the accuracy of gender classification.

Progress of tongue image feature processing algorithm

Before the tongue facial diagnosis image is input into the model, image preprocessing is required. Image preprocessing includes light compensation, color correction, geometric transformations, etc. Light compensation is widely utilized in image processing, and common light compensation methods are the Gray World color equalization algorithm and the white reference image-based algorithm. However, Yu et al. proposed a method to remove the light imbalance without a white reference image, where the background is obtained by segmenting the image, estimating the light difference from different background images, and then performing light compensation. This method can be used to segment microscopic medical images and also other medical images. The tongue images acquired by the camera produce noise due to uneven illumination, which seriously affects the quality of the images. This can be done in terms of low-level image processing, point processing of the image, median filtering of the image, etc., by thresholding, expanding the contrast, reducing the gray value, and filtering the noise. This method can reduce the deviations generated during pre-processing while improving the image processing speed. Color correction mainly consists of ambient lighting conditions correction, color space correction, color card color correction, and algorithm correction. Ambient lighting conditions correction specifies the illuminant and light source to complete color correction while capturing images. However, there are still obvious differences between standard light sources and natural light sources, and factors such as distance can also affect the illumination effect, so further correction by algorithms is still required subsequently. Shang et al. superimposed the positive image of the healthy tongue image and the color negative image of the diseased tongue image, selected the information-sensitive area of the tongue image, and made use of the conversion relation between RGB and CIE Lab color model, the discrete chromatographic distribution is obtained, and the Lab color model is proved to be suitable for color discrimination. Yang et al. explored the role of chromaticity for tongue color quantification and emphasized the significance of establishing color correction specifications for building a unified tongue image library. Due to physical characteristics, images acquired by different standardized devices still have color differences. He and Du converted color images to grayscale images and then converted grayscale images to binary images that can be entered into algorithmic models. Image preprocessing enhances the detectability of tongue diagnosis information in terms of light and color and lays the foundation for subsequent image processing. In the process of image collection, due to different acquisition instruments, the size of the image is different. In order to improve the training efficiency and save memory space, the deep learning network has restrictions on the input pictures, so it is necessary to unify the picture size during preprocessing. The processing method is generally to reshape the input image size to facilitate model training. The commonly used reshaping methods are “interpolation,” “cropping,” “inclusion,” “tiling,” and “mirroring.”

Progress of tongue image classification algorithm research

Tongue classification covers tongue color classification and tongue coating classification, which are mainly designed to classify the features of the tongue, such as tongue color, tongue coating, coating color, etc. The improvement of the tongue color classification algorithm is mainly to improve the speed and accuracy of tongue color and coating color discrimination, while the improvement of the tongue coating classification algorithm improves the classification accuracy of certain features of the tongue, such as cracks and teeth marks, which together contributes to the development of the tongue classification algorithm. In recent years, lingual classification algorithms based on deep CNN lingual classification model[54,55] is a hot research topic, and to solve the problems of high training equipment requirements and long training time in deep learning in lingual classification training, and migration-learning-based lingual research methods are proposed. The current research on tongue image classification algorithms is shown in Table 1.

Table 1.

Research status of tongue image classification algorithm.

References	Methods	Classification object	Indicator	Effect
Tang et al.⁵⁶	A joint multi-task learning model based on ImageNet and RPN network	Tongue color, coating color, fissurelines, tooth marks	SpeedAccuracy	Speed compared to the traditional method40% faster and more accurate33% higher
Xiao et al.⁵⁷	Tongue color classification method based on AlexNet network	Tongue color	Accuracy	94.85%
Yang and Zhang⁵⁸	Based on Inception_v3 + 2NNand Inception_v3 + 3NNtongue classification methods	tongue features	Accuracy	Inception_v3 + 2NNClassification accuracy 90.30%Inception_v3 + 3NNClassification accuracy 93.98%
Song et al.⁵⁹	Based on GoogLeNet, ResNetand ImageNet for deep migratory learning method for tongue study	Tongue features	Accuracy	Inception-v highest accuracy 94.88%
Kanawong et al.⁶⁰	Feature extraction based on HSV and RGBextraction methods to build a linguistic classification model	Tongue colortongue coating	Comprehensive evaluation index (F-Measure)	F-measure mean0.837
Yan et al.⁶¹	YoloV5 deep learning algorithm basedand random forest for dentate tongue classificationmethod	Tongue features (tooth marks)	Accuracy	93.7%
Jiao et al.⁶²	Weighted SVM method	Unbalanced tongue samples	Accuracy	Weighted SVM method accuracyrate is higher than the standard SVM methodby more than 20% on average

Research status of tongue image classification algorithm.

Advances in tongue segmentation methods

Traditional tongue segmentation methods

Compared with lingual image classification methods, lingual image segmentation methods have developed more rapidly. Traditional tongue image segmentation algorithms can be divided into five categories: region-based segmentation algorithms, active contour model-based algorithms,[63-65] threshold-based segmentation methods, edge-based segmentation methods, and graph theory-based segmentation methods. In recent years, the rise of deep learning-based lingual segmentation methods has brought breakthroughs in lingual image processing. The early tongue image segmentation algorithms are mainly based on template matching,[67-69] especially the Snake algorithm. The Snake algorithm needs to be given an initial contour curve, and then the initial contour curve is adjusted to evolving to the real contour curve, so the main point of the Snake algorithm research is the acquisition of the initial contour and curve evolution. The initial contour curve is a set of control points that can control the shape of the curve, and all the points meet at the beginning and end to form a closed contour line representation. Assuming that x(s) and y(s) denote the coordinate position of each control point in the image, respectively, and s is the independent variable describing the boundary in the form of Fourier transform which is the arc length, then the energy function of the Snake curve is expressed as where E is the internal energy. E is the image energy; E is the external constraint energy, and the image energy and external constraint energy are collectively referred to as the external energy. The internal energy consists of the elastic energy (modulus of the first-order derivative) and the bending energy (modulus of the second-order derivative). Linear energy: Edge energy: End energy: Solving for: derivative The optimization objective is to locally minimize the total energy function. The optimization objective is that the total energy function is locally minimum. The termination of the iteration is controlled by the minimal energy function or the number of iterations, and the minimized energy function is solved by the Euler equation, based on , rewriting the above equation as: Introducing external energy: Into an iterative evolutionary process with each step: Pang et al.[70,71] proposed a double elliptic deformation contour method based on the Snake algorithm, which combines a double elliptic deformation template (BEDT) and an active contour model for obtaining the approximate shape features of the target and fitting the local details. The BEDT deformation template is used to roughly describe the tongue body. The initial contour curve is obtained by converting the BEDT energy function and replacing the traditional internal energy of the active contour model to evolve the curve, which leads to the tongue segmentation results. Ma et al. proposed a dual Snake algorithm based on the Snake model. After the median filter removes the noise image and transforms the color space, the complete tongue image is obtained using the double Snake model, which has better tongue segmentation performance. However, the Snake algorithm requires multiple interaction work, and if the initial contour of the algorithm is not accurately delineated, it will directly lead to subsequent segmentation underfitting, so the limitations of the Snake algorithm are relatively obvious. It is gradually replaced by other image segmentation algorithms at a later stage. Thresholding is the classical image segmentation method, of which the OTSU algorithm has a wide range of applications. The OTSU algorithm uses one or more thresholds to calculate the grayscale image on the original image, compare the grayscale value of each pixel on the image with the corresponding threshold, and delineate the tongue image and the background, respectively according to the comparison results. Zhang et al. proposed to combine the threshold segmentation method with the grayscale projection method,[73,74] Snake active contour method, and HSV spatial thresholding method for tongue image segmentation, however, further improvements can be realized in terms of algorithm performance. Jiang et al. designed a tongue image segmentation algorithm by combining the Otsu algorithm and mathematical morphological algorithm. After the Otsu threshold segmentation image, the morphological opening operation is added to the correct the tongue image. However, this method involves more operations in the correction step, resulting in a longer segmentation time and lower segmentation efficiency. Huang et al. proposed a tongue image segmentation method based on Otsu and regional sub-block growth, incorporating classification based on image color features and a fast merging algorithm of sub-blocks after classification to achieve automatic tongue segmentation. Zhang and Liu proposed a tongue extraction algorithm based on dynamic thresholding and correction model. The HSI color model[80,81] was used to extract information such as lips and face from the tongue image. A dynamic threshold segmentation algorithm was used to extract the initial outline of the tongue body and a tongue body correction model was used to obtain the final tongue body. The data proved that the images obtained by this extraction method are superior in terms of noise immunity and accuracy and also have a good segmentation effect in the concave and convex areas of the tongue body. Otsu algorithm is also called the adaptive threshold method. Referring to the reverse derivation of the adaptive threshold method, the following equation can be obtained: Parameters: According to the concept of variance, the expression for the variance between classes is Substituting (1) into (3), we get The gray level k that maximizes the above equation is the OTSU threshold The cumulative mean of the gray levels Global mean of the image The final inter-class variance equation TH: assumed threshold C1: pixels less than the threshold TH C2: pixels greater than threshold TH p1: probability that the pixel is divided into C1 p2: probability that pixels are divided into C2 m1: average value of all pixels less than TH m2: average value of all pixels greater than TH mG: average value of all pixels in the image

Deep-learning-based tongue segmentation methods

Deep-learning-based tongue image segmentation models are mainly divided into two categories, one is U-Net[83,84] and Seg-Net[85,86] evolved from the fully convolutional network (FCN), and the other is Mask R-CNN improved from CNN, which are widely used in various types of medical image segmentation. FCN, the pioneer of semantic segmentation, achieves pixel-level classification of images. The FCN model can accept image input of any size and upsample the feature map of the last convolutional layer using the deconvolution layer to recover it to the same size as the input image. The spatial information in the original input image is preserved, and then the upsampled image is classified pixel by pixel, and the softmax classification loss is calculated pixel by pixel, that is, each pixel corresponds to one training sample. So the features of FCN include full convolution, upsampling and skip-level structure (see Figure 4).

Figure 4.

Fully convolutional network (FCN) model diagram.

Fully convolutional network (FCN) model diagram. The U-Net model, an image processing method based on the FCN architecture, is one of the three most classical F-CNNs (fully CNNs) models. It can handle a wide range of medical image segmentation tasks independent of different organs and imaging modalities. Li et al. developed a U-Net model for accurate retinal vascular segmentation, which improved the existing image segmentation methods. Since organ image segmentation[90,91] is error-prone due to the inhomogeneous and irregular shape of organs, Li et al. designed an attention-based nested segmentation network, ANU-Net. They introduced an attention mechanism between nested convolutional blocks, allowing features extracted at different levels to be selected with task-relevant fusion. The model is also designed with a hybrid loss function to make full use of the resolution feature information. ANU-Net is experimentally shown to be competitive among various medical image segmentation methods. Due to the increased demand for various medical image segmentation, the U-Net model plays an important role because of its excellent segmentation accuracy and robustness, and the use of the U-Net model for tongue segmentation is a future trend for tongue objectification (see Figures 5 and 6).

Figure 5.

U-Net algorithm composition.

Figure 6.

U-Net algorithm diagram.

U-Net algorithm composition. U-Net algorithm diagram. CNN is a convolutional layer followed by some fully connected layers to obtain a fixed-length feature vector for classification, so the CNN structure is suitable for image classification and regression tasks. CNN architectures can be divided into two types, the first category includes Le-Net, Alex-Net, Res-Net, etc., which are mainly used for image classification. LeNet is the modern convolutional neural network foundation and provides ideas for the later development of deep learning network structures. LeNet is mainly used for the recognition and classification of handwritten characters with an accuracy of 98%. AlexNet extends based on the LeNet network to reduce the model training time and enhance the generalization ability of the model. The emergence of ResNet (residual neural network) deepens the neural network training depth to avoid near saturation of accuracy. The second class of CNN architectures is mainly used for target detection, including R CNN, Fast R CNN, Faster R CNN, Mask-R CNN,[100,101] etc., among which medical image segmentation mostly uses Mask-R CNN structure. Yan et al. came up with a tongue image segmentation method based on the convolutional neural network Mask-R CNN, and the obtained tongue edges were more accurate. The average pixel accuracy of four quantitative evaluation indexes was 93.03%, the average cross-merge ratio was 86.69%, and the average frequency-weight cross-merge ratio was 87.16%, which better improved the problem of unclear tongue segmentation contours. Zhang et al. also designed an end-to-end tongue image segmentation method for the unclear tongue segmentation problem, which combined DCNN and fully connected CRF to refine the segmented image edges. This algorithm outperforms traditional tongue image segmentation algorithms and mainstream deep learning methods, with an average intersection ratio of 95.41% for segmentation accuracy. There are some problems in the traditional semantic segmentation task: successive post-pooling downsampling in CNN leads to spatial resolution degradation; scale detection requires rescaling and aggregating feature maps, which leads to the excessive computational effort. The image classification task needs to ensure that the spatial transformation is invariant, so the Deeplab structure is introduced. Deeplab v1 is modified from VGG16 by converting the fully connected layers of VGG16 to convolution, and the last two pooling layers are removed and use null convolution. Deeplab v2 uses the ASSP base layer on top of the Deeplab v1 structure, and Deeplab v3 improves the ASSP module by proposing a more general framework that can be applied to all networks and improves the efficiency of image segmentation. In addition to mainstream deep learning algorithms for tongue image segmentation, scholars also proposed to compare the chan vese method with the canny algorithm. The results proved that the canny algorithm generates a large number of pseudo-edges after edge cutting, while the chan vese method can automatically select the best edge information and has a greater clinical value than the canny algorithm. Li et al. proposed to use of calibrated neural networks mixed with other deep learning methods for improving the accuracy of tongue detection and changing the accuracy of tongue recognition taken under natural conditions. Li et al. proposed to capture tongue images using hyperspectral tongues and then use hidden Markov models to classify tongue fissures into 12 classes, and the method showed good performance in tongue classification. Shi et al. proposed a double geo-vector flow (DGF) based tongue edge detection method, which is able to detect tongue edges and segment tongue regions for geological gradient vector flow evaluation of the tongue. Cui et al. proposed a new tongue segmentation method, GaborFM. This method uses a fast-marching algorithm to connect discontinuous contour segments to form a closed continuous tongue contour for ACM-based tongue segmentation. Qualitative and quantitative results show that the GaborFM method outperforms other methods; Zhou et al. proposed a semi-supervised probabilistic model for tongue segmentation based on a combination of image reconstruction constraints and adversarial learning and an inference model with a segmentation decoder and a reconstruction decoder. The effectiveness of this method was verified by generating tongue segmentation images and reconstructing tongue images, and supervising the images using a discriminator. All these models facilitate the process of tongue diagnosis objectification and provide new ideas for tongue diagnosis objectification. At the same time, these methods can only provide a reference for tongue diagnosis objectification and cannot form a standardized and effective set of tongue diagnosis objectification criteria, so further research and development are still needed for tongue diagnosis objectification in the future. In the last two years, tongue image segmentation algorithms have been developed based on traditional tongue image segmentation methods and deep learning algorithms. To solve the problems of edge blurring and detail interference during tongue extraction, Huang et al. designed an automatic tongue image segmentation method using an enhanced full convolutional network with an encoder–decoder structure with an average sensitivity of 98.97%, which is better than the four algorithms SegNet, FCN, PSPNet, and DeepLab v3 + . Gao et al. proposed LSM-SEC based on convolutional neural networks as a model combining symmetric and edge-constrained level sets of tongue geometric features for tongue segmentation, which is suitable for tongue image segmentation under most conditions and can also improve the accuracy of subsequent model evolution. These models promote the objectification process of tongue diagnosis and provide new ideas for the objectification of tongue diagnosis. However, these methods can only provide a reference for the objectification of tongue diagnosis, and cannot form a set of standardized and effective objectification standards of tongue diagnosis. Therefore, the objectification of tongue diagnosis in the future still needs further research and development. As the multi-headed attention mechanism has been proposed and mostly applied to human-computer interaction, such as speech emotion recognition, there have been attempts to apply the multi-headed attention mechanism to image segmentation. Lin et al. proposed a module for enhancing and fusing convolutional feature maps using a multi-headed attention approach, optimizing the scene recognition feature transformation algorithm and obtaining a framework that outperforms many state-of-the-art techniques. This demonstrates that the rapid development of deep learning algorithms offers the possibility of applying more novel algorithmic models to lingual diagnosis image segmentation (see Figure 7).

Figure 7.

Types of deep learning algorithms for tongue and face image segmentation.

Performance comparison of tongue image algorithm segmentation

In this study, four models, Snake, Otsu, Seg-Net, and U-Net, were used for tongue image segmentation, and the algorithm performance was assessed by three parameters: time, memory usage, and accuracy. The Snake algorithm model is highly dependent on the initial contour correctness, and its iterative result accuracy is proportional to the initial contour accuracy. The Otsu algorithm is an algorithm that determines the image binarization segmentation threshold algorithm and maximizes the inter-class variance between foreground and background images after the binarization segmentation of images based on grayscale feature pairs. It is computationally simple and not affected by image brightness and contrast, so it is widely used in the field of digital image processing. While, it is sensitive to image noise, and the segmentation effect is poor when the target and the background-size ratio is disparate, so it is not suitable for images captured too far or too close. Seg-Net and U-Net both belong to the FCN semantic segmentation algorithm, and the semantic segmentation algorithm is to label each pixel point of the image with a category that is related to both the neighboring pixel category and the overall category to which this pixel point belongs. Using the image classification network structure, the determination needs can be satisfied by different levels of feature vectors. U-Net is one of the early algorithms using multi-scale feature semantic analysis; the unique U-shaped structure also provides ideas for the development of subsequent algorithms. The disadvantage is that the effective convolution increases the difficulty and generalization of the model design. Seg-Net uses a fully symmetric structure, convolution, and deconvolution symmetry, pooling and up pooling symmetry, and the overall composition of the encoder-decoder structure, which makes up for the disadvantage of the lower resolution of FCNN segmented images.

Methods

The evaluation metrics of image segmentation generally include inference time, memory occupation, and accuracy. Inference time is a very valuable metric to judge the segmentation efficiency, and the training speed is affected by hardware devices and background implementations, etc., which cannot fully represent the segmentation efficiency. But inference time can evaluate the effectiveness of segmentation methods and is the fastest method running in the same environment. Memory is another important factor in evaluating segmentation methods. The occupancy of an algorithm in image processing can affect the processing efficiency, and the CPU occupancy reflects the efficiency of the algorithm, so the memory occupancy is directly related to the processing efficiency when the running time is the same. There are many metrics to measure algorithm accuracy in image segmentation, usually variations of pixel accuracy and IoU, including PA (pixel accuracy), MPA (mean pixel accuracy), MIoU (mean intersection ratio), and FWIoU (frequency-weighted intersection ratio), with MIoU being the most commonly used metric due to its simplicity and representativeness. To compare the segmentation performance of the algorithms, real tongue images captured by the tongue diagnostic instrument were used to compare the performance of Snake, Otsu, Seg-Net, and U-Net algorithms were validated. The implementation of this experiment is based on the windows server2016 operating system. The CPU is Intel (R) Xeon (R) e5-2678 V3, the GPU is Tesla V100, and the CUDA version is v 10.1. In the environment of python3.7.4, Tensorflow 2.0.2, and Keras 2.3.1 are used as the main model framework to build the model. First, the image is preprocessed. Because there may be pixel differences in the image, it is necessary to normalize the image to the same size first For different pictures, we first fill their edges, take the long edge of the image as the target filling length, and fill the white edge to make it a square with the long edge of the image as the side length. The filled square resizes a standard image of 512*512*3. After completing the construction of the four algorithm models, 1233 preprocessed training set images are input into the algorithm for algorithm training. After the training, 98 random tongue images were input into the model to test, and the accuracy and processing speed of the algorithm was verified from the aspects of PA, runtime, and miou3 indicators. PA refers to the pixel accuracy, which is the proportion of the number of pixels with correct prediction categories to the total number of pixels. In deep learning, runtime refers to the running time of the algorithm. Miou is the ratio of the intersection and union of the predicted results of each category and the real values of the model. The sum and re-average results are used to measure the accuracy of image segmentation. The validation results are shown in Figure 8 and Table 2.

Figure 8.

Four kinds of tongue image segmentation contrast map.

Table 2.

Comparison results of four tongue segmentation algorithms.

Algorithm	Snake	Otsu	Seg-Net	U-Net
Sample size of training set	1233	1233	1233	1233
Sample size of test set	98	98	98	98
PA	0.988	0.983	0.999	0.999
Runtime	14,645 ms	6070 ms	54 ms	58 ms
MIOU	0.536	0.594	0.945	0.957

Four kinds of tongue image segmentation contrast map. Comparison results of four tongue segmentation algorithms.

Results

With the same sample size, the Snake algorithm was more primitive in terms of operation speed, segmentation accuracy, and accuracy among the four algorithms. The Otsu algorithm outperformed the Snake algorithm in terms of operation speed and segmentation accuracy, but the accuracy was still not significantly improved. The Seg-Net algorithm and the U-Net algorithm, compared with the first two, have substantially improved in terms of operation speed, segmentation accuracy, and accuracy improvement, and the performance difference between these two algorithms is small. It indicates that the depth learning method is superior to the traditional image processing methods in terms of operation speed and segmentation accuracy.

Discussion

These four algorithms performed differently and each with prons and cons. The advantage of the snake algorithm is that it can effectively find the contour of the target, and has a good ability to extract and tracking ability for the edge of the target in a specific region. The limitation of the snake algorithm is that it needs many interactions. If the initial contour of the algorithm is not accurately described, it will directly lead to subsequent segmentation underfitting. Therefore, the limitations of the snake algorithm are obvious. Otsu algorithm excelled in its calculation simplicity and is not affected by image brightness and contrast The limitation of the Otsu algorithm is that when the area of the target and the background in the image is very different, there is no obvious double peak in the histogram, or the size of the two peaks is very different, and the segmentation effect is poor, or there is a large overlap between the gray level of the target and the background, it cannot accurately separate the target and the background. SEG net algorithm performed excellently in voice, semantic, visual, and various game tasks. The algorithm can be adjusted quickly to adapt to new problems. The limitation is that a large amount of data is needed for training, which requires high hardware configuration. The model is in a black box state, and it is difficult to understand the internal mechanism. The advantage of the u-net algorithm is that the u-net supports a small number of training models and introduces image mirroring operation to better train data; Each pixel can be segmented to obtain higher segmentation accuracy. The limitation is that effective convolution increases the difficulty and universality of model design.

Conclusion

As an important diagnosis and treatment method in TCM, the inspection diagnosis is primarily a medical activity performed through the physician's subjective judgment of the patient's signs and symptoms. A majority of doctors vary greatly in the efficiency of diagnosis and treatment, and it is highly demanding to form a unified treatment standard due to the variation in treatment ideas, diagnosis, and therapeutical methods. Therefore, using machines to capture images and then analyze images with deep learning algorithms can form an objective specification for tongue-facial diagnosis, which is also one of the research hotspots in recent years. Besides, It is an essential method to improve the efficiency of TCM diagnosis and treatment, and the research on the objectification of tongue-facial diagnosis is progressing faster . Deep learning algorithms can analyze the images captured by tongue-facial diagnosis instruments and process far more images than that manual processing. The future direction of lingual-facial diagnosis objectification research is to analyze images captured by mobile using deep learning algorithms. Image processing algorithms are mainly divided into image segmentation and image classification, with FCN and CNN and their extension algorithms in dominating position. The lingual images need to go through pre-processing steps such as light compensation and color correction before processing. Since the face has strong three-dimensionality and many detailed features, and the expression changes can reflect the patient's disease condition to some extent, the processing of facial angle and detail features are very important in the image acquisition and analysis process. The processing of facial images is more likely to combine machine learning methods with other optimization algorithms, such as combining residual neural networks with feature fusion algorithms to eliminate the influence of other factors on image quality and improve image recognition accuracy. Tongue image classification is the judgment of certain feature attributes of the tongue, such as tongue color, tongue coating, etc. Tongue image segmentation segments and labels the target, both of which enable accurate processing of tongue facial diagnosis images. Traditional tongue diagnosis image processing is represented by active contour model-based algorithms, etc., and the Snake algorithm and the OTSU algorithm are applied and developed, but the segmentation accuracy and speed are not ideal. With the development of deep learning, the application of convolutional neural networks for look-alike image segmentation became popular. The U-Net network model formed based on the development of FCN can handle a variety of medical image segmentation tasks, the segmentation accuracy of the Mask-R CNN network model is high, and many scholars have also formed algorithms with high image processing efficiency based on the improvement of traditional image processing algorithms. However, there are many issues in the process of look-alike objectification: scholars currently pour attention to the objectification of inspection diagnosis without unified standards, specifications, the current acquisition environment, light source type, and intensity lack of national or industry unified standards, and the acquired images have no universality. Traditional deep learning algorithms for image classification focus on finding similar features, and the classification effect is not good when applied to research objects with small differences between samples like tongue image classification. At present, the research on the objectification of TCM diagnosis still stays at the stage of “color” inspection, ignoring the fact that tongue and facial morphology also have diagnostic significance, and cannot fully convey the concepts of TCM diagnosis. At the same time, the current research is still at the stage of small samples, the advantages of machine learning algorithms are reflected in large data, so a standard large database for the objective study of tongue and face diagnosis is should be constructed. The formation of standards for the objectification of inspection diagnosis will facilitate the traditional medicine inspection diagnosis to play a more important role in modern medicine and provide auxiliary functions in subsequent processes. Combining machine learning algorithms with intelligent inspection diagnosis is an inevitable trend.

23 in total

1. The bi-elliptical deformable contour and its application to automated tongue segmentation in Chinese medicine.

Authors: Bo Pang; David Zhang; Kuanquan Wang
Journal: IEEE Trans Med Imaging Date: 2005-08 Impact factor: 10.048

2. Remote monitoring of physical and mental state of 2019-nCoV victims using social internet of things, fog and soft computing techniques.

Authors: Manik Sharma; Samriti Sharma; Gurvinder Singh
Journal: Comput Methods Programs Biomed Date: 2020-06-17 Impact factor: 5.428

10. Mask-R[Formula: see text]CNN: a distance-field regression version of Mask-RCNN for fetal-head delineation in ultrasound images.

Authors: Sara Moccia; Maria Chiara Fiorentino; Emanuele Frontoni
Journal: Int J Comput Assist Radiol Surg Date: 2021-06-22 Impact factor: 2.924