| Literature DB >> 28306739 |
YunZhi Chen1,2, HuiJuan Lu3, LanJuan Li1,4.
Abstract
ICD-10(International Classification of Diseases 10th revision) is a classification of a disease, symptom, procedure, or injury. Diseases are often described in patients' medical records with free texts, such as terms, phrases and paraphrases, which differ significantly from those used in ICD-10 classification. This paper presents an improved approach based on the Longest Common Subsequence (LCS) and semantic similarity for automatic Chinese diagnoses, mapping from the disease names given by clinician to the disease names in ICD-10. LCS refers to the longest string that is a subsequence of every member of a given set of strings. The proposed method of improved LCS in this paper can increase the accuracy of processing in Chinese disease mapping.Entities:
Mesh:
Year: 2017 PMID: 28306739 PMCID: PMC5356997 DOI: 10.1371/journal.pone.0173410
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Example of Standard Diagnosis Library.
| Standard diagnosis | English translation | code |
|---|---|---|
| JIA XING BING DU XING GAN YAN BAN GAN HUN MI | hepatitis A with hepatic coma | B15.000 |
| JI XING JIA XING BING DU XING GAN YAN BAN GAN HUN MI | acute hepatitis A with hepatic coma | B15.001 |
| JI XING ZHONG XING JIA XING BING DU XING GAN YAN BAN GAN HUN MI | acute severe hepatitis A with hepatic coma | B15.002 |
| YA JI XING ZHONG XING JIA XING BING DU XING GAN YAN BAN GAN HUN MI | subacute severe hepatitis A with hepatic coma | B15.003 |
Fig 1Flowchart of LCS algorithm.
Fig 2Corpus of Chinese word segmentation of 181 kinds of hepatitis.
Result words of Chinese word segmentation.
| Types of Chinese strings | The number of diseases about hepatitis | The number of words after Word Segmentation | The number of words after word Filtration | segmentation accuracy |
|---|---|---|---|---|
| Disease names from SDL | 181 | 660 | 115 | 96% |
| Disease names from clinicians | 112 | 532 | 86 | 97.1% |
Similarity calculation based on LCS method.
| L(A) | L(B) | LCSL | LCS | D-LCS | W-LCS |
|---|---|---|---|---|---|
| 1 | 1 | 1 | 1.00 | 1.00 | 1.00 |
| 1 | 2 | 1 | 0.50 | 0.67 | 0.67 |
| 1 | 3 | 1 | 0.33 | 0.50 | 0.50 |
| 1 | 4 | 1 | 0.25 | 0.40 | 0.40 |
| 1 | 5 | 1 | 0.20 | 0.33 | 0.33 |
| 2 | 2 | 2 | 1.00 | 1.00 | 1.00 |
| 2 | 3 | 2 | 0.67 | 0.80 | 0.86 |
| 2 | 4 | 2 | 0.50 | 0.67 | 0.75 |
| 2 | 5 | 2 | 0.40 | 0.57 | 0.67 |
| 2 | 6 | 2 | 0.33 | 0.50 | 0.60 |
| 3 | 3 | 3 | 1.00 | 1.00 | 1.00 |
| 3 | 4 | 3 | 0.75 | 0.86 | 0.92 |
| 3 | 5 | 3 | 0.60 | 0.75 | 0.86 |
| 3 | 6 | 3 | 0.50 | 0.67 | 0.80 |
| 3 | 7 | 3 | 0.43 | 0.60 | 0.75 |
| 4 | 4 | 4 | 1.00 | 1.00 | 1.00 |
| 4 | 5 | 4 | 0.80 | 0.89 | 0.95 |
| 4 | 6 | 4 | 0.67 | 0.80 | 0.91 |
| 4 | 7 | 4 | 0.57 | 0.73 | 0.87 |
| 4 | 8 | 4 | 0.50 | 0.67 | 0.83 |
| 5 | 5 | 5 | 1.00 | 1.00 | 1.00 |
| 5 | 6 | 5 | 0.81 | 0.90 | 0.92 |
| 5 | 7 | 5 | 0.75 | 0.82 | 0.90 |
| 5 | 8 | 5 | 0.67 | 0.82 | 0.87 |
| 5 | 9 | 5 | 0.55 | 0.68 | 0.84 |
| 6 | 6 | 6 | 1.00 | 1.00 | 1.00 |
| 6 | 7 | 6 | 0.80 | 0.85 | 0.91 |
| 6 | 8 | 6 | 0.71 | 0.76 | 0.83 |
| 6 | 9 | 6 | 0.66 | 0.72 | 0.78 |
| 6 | 10 | 6 | 0.60 | 0.65 | 0.70 |
| 0.60 | 0.72 | 0.78 | |||
Fig 3Similarity line chart when L(A) = 1.
Fig 8Similarity line chart when L(A) = 6.
Accuracy analysis under similarity threshold (F-score).
| Threshold | LCS | T-LCS | W-LCS |
|---|---|---|---|
| 0.90 | 0.740 | 0.750 | 0.811 |
| 0.80 | 0.712 | 0.737 | 0.800 |
| 0.70 | 0.676 | 0.703 | 0.751 |
| 0.60 | 0.643 | 0.663 | 0.713 |
| 0.50 | 0.628 | 0.635 | 0.662 |
Fig 9Accuracy analysis chart under similarity threshold (n = 1000).
Accuracy comparison of algorithm experiment result.
| precsion | recall | F-score | |
|---|---|---|---|
| W-LCS | 0.821 | 0.802 | 0.811 |
| bigrams | 0.819 | 0.793 | 0.806 |
| HowNet | 0.813 | 0.756 | 0.783 |
| word matching | 0.781 | 0.741 | 0.760 |
Fig 10Given threshold of coding accuracy and percentage.