| Literature DB >> 28464923 |
Rui Zhang1, Jialin Liu1,2,3, Yong Huang2, Miye Wang2, Qingke Shi2, Jun Chen4, Zhi Zeng5,6.
Abstract
BACKGROUND: It has been shown that the entities in everyday clinical text are often expressed in a way that varies from how they are expressed in the nomenclature. Owing to lots of synonyms, abbreviations, medical jargons or even misspellings in the daily used physician notes in clinical information system (CIS), the terminology without enough synonyms may not be adequately suitable for the task of Chinese clinical term recognition.Entities:
Keywords: Chinese term of clinical finding; Clinical term recognition; Concept mapping of terminology; SNOMED CT; Synonyms enrichment; Terminology localization
Mesh:
Year: 2017 PMID: 28464923 PMCID: PMC5414139 DOI: 10.1186/s12911-017-0455-z
Source DB: PubMed Journal: BMC Med Inform Decis Mak ISSN: 1472-6947 Impact factor: 2.796
Fig. 1The system framework
Key words group for CTCF retrieving and recognition
| Key words group | Function | Example |
|---|---|---|
| Time Period | Matching the time duration phrase. | 年(year),月(month),周(week),天(day),小时(hour)… |
| Number | Matching the numbers in describing time duration. | 1,2,3,4,5,6,7,8,9,0,半(half),一(one),二(two),三(three),+(more than)… |
| Modifier in CTCF | Some modifiers which mingled within CTCF could be ignored in the recognition task. | 持续(constantly),逐渐(gradually),明显(obviously), 稍显(slightly),反复(recurrently)… |
| Exception with context | Some phrases, even matched, were invalid with context. | 抗乙肝药物(anti-HBV drugs), 多尿期(the polyuria stage),最高血压(the highest blood pressure)… |
Fig. 2General picture of rCTCF and CTCF retrieved
Fig. 3CTCFs sorted by frequency of occurrence in HPI
Fig. 4Number of characters in each valid CTCF in HPI
Some system-mapping results of rCTCF candidates for each CTCF
| CTCF | The 1st rCTCF candidate | The 2nd rCTCF candidate | The 3rd rCTCF candidate | The 4th rCTCF candidate | … |
|---|---|---|---|---|---|
| 主动脉夹层 | 主动脉夹层动脉瘤:0.91 | 主动脉扩张:0.75 | 腹主动脉动脉瘤:0.75 | 主动脉瓣关闭不全:0.67 | … |
| 共同性外斜 | 共同性外斜视:0.95 | 共同性内斜视:0.89 | 共同性斜视:0.81 | 间歇性外斜视:0.70 | … |
The Hybrid Similarity (HS) score was followed with each rCTCF candidate
Fig. 5Average ranking place of ‘true synonym’ in the rCTCF candidates with different weights
Fig. 6The ranking place of ‘true synonym’ in the rCTCF candidates
Fig. 7Detection rate of ‘true synonyms’ in each HS score
The performance of CRF with feature CWS and n-gram
| Models | TP | FP | FN | P | R | F | Total |
|---|---|---|---|---|---|---|---|
| CWS1 + U | 9427 | 1598 | 1791 | 0.855 | 0.840 | 0.848 | 11,218 |
| CWS1 + UB | 9650 | 1210 | 1568 | 0.889 | 0.860 | 0.874 | 11,218 |
| CWS1 + UBT | 9616 | 1104 | 1602 | 0.897 | 0.857 | 0.877 | 11,218 |
| CWS1 + UBTQ | 9510 | 1012 | 1708 | 0.903 | 0.848 | 0.875 | 11,218 |
| CWS2 + U | 9367 | 1676 | 1851 | 0.848 | 0.835 | 0.842 | 11,218 |
| CWS2 + UB | 9646 | 1250 | 1572 | 0.885 | 0.860 | 0.872 | 11,218 |
| CWS2 + UBT | 9635 | 1111 | 1583 | 0.897 | 0.859 | 0.877 | 11,218 |
| CWS2 + UBTQ | 9571 | 1042 | 1647 | 0.902 | 0.853 | 0.877 | 11,218 |
| CWS3 + U | 9282 | 1745 | 1936 | 0.842 | 0.827 | 0.835 | 11,218 |
| CWS3 + UB | 9614 | 1296 | 1604 | 0.881 | 0.857 | 0.869 | 11,218 |
| CWS3 + UBT | 9637 | 1145 | 1581 | 0.894 | 0.859 | 0.876 | 11,218 |
| CWS3 + UBTQ | 9559 | 1057 | 1659 | 0.900 | 0.852 | 0.876 | 11,218 |
| CWS4 + U | 9184 | 1749 | 2034 | 0.840 | 0.819 | 0.829 | 11,218 |
| CWS4 + UB | 9553 | 1331 | 1665 | 0.878 | 0.852 | 0. 864 | 11,218 |
| CWS4 + UBT | 9599 | 1156 | 1619 | 0.893 | 0.856 | 0.874 | 11,218 |
| CWS4 + UBTQ | 9542 | 1083 | 1676 | 0.898 | 0.851 | 0.874 | 11,218 |
Fig. 8The performance of CRF with feature CWS and n-gram
The performance of CRF with all features for CTCF recognition
| Round | Models | TP | FP | FN | P | R | F | Total |
|---|---|---|---|---|---|---|---|---|
| - | M0 (baseline) | 9635 | 1111 | 1583 | 0.897 | 0.859 | 0.877 | 11,218 |
| 1st | M0 + F1 | 9654 | 1135 | 1564 | 0.895 | 0.861 | 0.877 | 11,218 |
| 1st | M0 + F2 | 9671 | 1127 | 1547 | 0.896 | 0.862 | 0.879 | 11,218 |
| 1st | M0 + F3 | 9664 | 1135 | 1554 | 0.895 | 0.862 | 0.878 | 11,218 |
| 1st | M0 + F4 | 9678 | 1101 | 1540 | 0.898 | 0.863 | 0.880 | 11,218 |
| 1st | M0 + F5 (M1) | 9711 | 1057 | 1507 | 0.902 | 0.866 | 0.883 | 11,218 |
| 2nd | M1 + F1 | 9717 | 1066 | 1501 | 0.901 | 0.866 | 0.883 | 11,218 |
| 2nd | M1 + F2 | 9725 | 1083 | 1493 | 0.900 | 0.867 | 0.883 | 11,218 |
| 2nd | M1 + F3 | 9732 | 1082 | 1486 | 0.900 | 0.868 | 0.883 | 11,218 |
| 2nd | M1 + F4 (M2) | 9725 | 1071 | 1493 | 0.901 | 0.867 | 0.884 | 11,218 |
| 3rd | M2 + F1 (M3) | 9754 | 1062 | 1464 | 0.901 | 0.870 | 0.885 | 11,218 |
| 3rd | M2 + F2 | 9752 | 1080 | 1466 | 0.900 | 0.869 | 0.885 | 11,218 |
| 3rd | M2 + F3 | 9758 | 1094 | 1460 | 0.899 | 0.870 | 0.884 | 11,218 |
| 4th | M3 + F2 (M4) | 9785 | 1069 | 1433 | 0.902 | 0.872 | 0.887 | 11,218 |
| 4th | M3 + F3 | 9765 | 1097 | 1453 | 0.899 | 0.871 | 0.885 | 11,218 |
F1 the stop character feature, F2 the current word feature, F3 the current and context word feature, F4 the word POS tag feature, F5 the word associative feature
Fig. 9The performance of CRF with all features for CTCF recognition. F1: the stop character feature, F2: the current word feature, F3: the current and context word feature, F4: the word POS tag feature, F5: the word associative feature
The performance of RTBA for CTCF recognition
| Models | TP | FP | FN | P | R | F | Total |
|---|---|---|---|---|---|---|---|
| The best CRF model (Baseline) | 9785 | 1069 | 1433 | 0.902 | 0.872 | 0.887 | 11,218 |
| RTBA with original SCCSE (R1) | 10,165 | 743 | 1053 | 0.932 | 0.906 | 0.919 | 11,218 |
Fig. 10The CTCF predicted by CRF model