| Literature DB >> 33858409 |
Hui Zong1, Jinxuan Yang1, Zeyu Zhang1, Zuofeng Li2, Xiaoyan Zhang3.
Abstract
BACKGROUND: Semantic categorization analysis of clinical trials eligibility criteria based on natural language processing technology is crucial for the task of optimizing clinical trials design and building automated patient recruitment system. However, most of related researches focused on English eligibility criteria, and to the best of our knowledge, there are no researches studied the Chinese eligibility criteria. Thus in this study, we aimed to explore the semantic categories of Chinese eligibility criteria.Entities:
Keywords: Classification; Clinical trials; Clustering; Eligibility criteria; Semantic category
Year: 2021 PMID: 33858409 PMCID: PMC8050926 DOI: 10.1186/s12911-021-01487-w
Source DB: PubMed Journal: BMC Med Inform Decis Mak ISSN: 1472-6947 Impact factor: 2.796
An example of eligibility criteria sentences of Chinese clinical trials registered in ChiCTR
| Chinese inclusion criteria | English inclusion criteria |
|---|---|
| 1. 首次于本中心行肝癌切除术, 术后组织病理学证实HCC; | 1. Patient receives hepatectomy of HCC, which was confirmed with pathology; |
| 2. 术前血清HBsAg ( +); | 2. Serum HBsAg ( +)is confirmed preoperatively; |
| 3. 血清HBV-DNA低于检测下限 (内标法); | 3. Serum HBV-DNA is lower than minimum of detection; |
| 4. 肿瘤未侵犯门静脉、肝静脉或胆管的主要分支; | 4. Tumor did not invade potal vein,hepatic vein or major branch of biliary tract; |
| 5. 肝功能Child-Pugh A或B级; | 5. Liver function with Child–Pugh A or B; |
| 6. 年龄18 ~ 65岁, 性别不限; | 6. Aged from 18–65 years male or female; |
| 7. 单发肿瘤者, 最大径 ≤ 5 cm; 多发肿瘤者, 瘤体数 ≤ 3个且各瘤体最大径 ≤ 3 cm; | 7. Single tumor ≤ 5 cm.for multiple tumors,number of tumor ≤ 3 and every tumor ≤ 3 cm |
Registration number: ChiCTR1800016069
Fig. 1Pipeline of eligibility criteria processing and clustering. a An English EC sentence pre-processing demonstration example. b The process of transform Chinese eligibility criteria into feature matrix based on UMLS semantic types. neop, Neoplastic Process; ftcn, Functional Concept; bpoc, Body Part, Organ, or Organ Component; qlco, Qualitative Concept
The descriptions of pre-processing steps of English eligibility criteria sentences
| English eligibility criteria sentences preprocess | Descriptions |
|---|---|
| Delete ordinal number | There are many types of ordinal number (e.g., “1.”, “①”, “(1)”), and were deleted by regular expression |
| Replace the ASCII code | We replace the ASCII code with the format that MetaMap can handle based on rules |
| Lemmatization | Lemmatization is a process of grouping together the different inflected forms of a word and be analyzed as canonical form of the word. We did it with Python package NLTK |
| Replace abbreviation | We replace the abbreviation with full spelling format based on dictionary |
| Delete symbols of number, operator and unit | Various expression formats of number, operator and unit sometimes will interfere the output of MetaMap, and was deleted by regular expression |
The summarized 8 topic groups and 44 criteria categories of Chinese eligibility criteria, as well as ratio (count) in 19,185 criteria sentences, average incidence and prevalence in 272 HCC related clinical trials
| Topic group | Criteria category | Ratio (count) | Average incidence | Prevalence |
|---|---|---|---|---|
| Health status | Disease | 23.40% (4489) | 2.7904 (759) | 68.75% (187) |
| Symptom | 0.36% (70) | 0.0110 (3) | 1.10% (3) | |
| Sign | 1.64% (314) | 0.0294 (8) | 2.94% (8) | |
| Pregnancy-related activity | 4.65% (893) | 0.4118 (112) | 33.82% (92) | |
| Neoplasm status | 0.14% (26) | 0.1397 (38) | 11.03% (30) | |
| Non-neoplasm disease stage | 0.62% (118) | 0.0074 (2) | 0.74% (2) | |
| Allergy intolerance | 2.80% (538) | 0.2132 (58) | 19.49% (53) | |
| Organ or tissue status | 1.67% (321) | 0.2574 (70) | 21.32% (58) | |
| Life expectancy | 0.66% (127) | 0.1838 (50) | 17.65% (48) | |
| Oral related | 0.16% (31) | 0.0000 (0) | 0.00% (0) | |
| Treatment or health care | Pharmaceutical substance or drug | 3.77% (724) | 0.1213 (33) | 9.93% (27) |
| Therapy or surgery | 6.96% (1336) | 1.2463 (339) | 54.78% (149) | |
| Device | 0.40% (77) | 0.0221 (6) | 2.21% (6) | |
| Nursing | 0.06% (12) | 0.0000 (0) | 0.00% (0) | |
| Diagnostic or lab test | Diagnostic | 5.19% (995) | 0.6103 (166) | 40.81% (111) |
| Laboratory examinations | 4.93% (945) | 0.8934 (243) | 29.41% (80) | |
| Risk assessment | 2.86% (549) | 0.7757 (211) | 42.65% (116) | |
| Receptor status | 0.04% (8) | 0.0074 (2) | 0.74% (2) | |
| Demographic characteristics | Age | 4.12% (790) | 0.2353 (64) | 22.79% (62) |
| Special patient characteristic | 0.31% (59) | 0.0074 (2) | 0.74% (2) | |
| Literacy | 0.17% (32) | 0.0000 (0) | 0.00% (0) | |
| Gender | 0.11% (21) | 0.0000 (0) | 0.00% (0) | |
| Education | 0.07% (14) | 0.0000 (0) | 0.00% (0) | |
| Address | 0.17% (32) | 0.0110 (3) | 1.10% (3) | |
| Ethnicity | 0.08% (15) | 0.0000 (0) | 0.00% (0) | |
| Ethical consideration | Consent | 5.74% (1101) | 0.4632 (126) | 38.60% (105) |
| Enrollment in other studies | 2.35% (451) | 0.1176 (32) | 11.76% (32) | |
| Researcher decision | 1.98% (379) | 0.1324 (36) | 12.50% (34) | |
| Capacity | 0.73% (140) | 0.0184 (5) | 1.84% (5) | |
| Ethical audit | 0.02% (3) | 0.0037 (1) | 0.37% (1) | |
| Compliance with protocol | 1.92% (368) | 0.2022 (55) | 17.65% (48) | |
| Lifestyle choice | Addictive behavior | 1.27% (244) | 0.0221 (6) | 2.21% (6) |
| Bedtime | 0.02% (4) | 0.0000 (0) | 0.00% (0) | |
| Exercise | 0.10% (20) | 0.0000 (0) | 0.00% (0) | |
| Diet | 0.24% (46) | 0.0000 (0) | 0.00% (0) | |
| Alcohol consumer | 0.05% (10) | 0.0000 (0) | 0.00% (0) | |
| Sexual related | 0.02% (3) | 0.0000 (0) | 0.00% (0) | |
| Smoking status | 0.22% (42) | 0.0000 (0) | 0.00% (0) | |
| Blood donation | 0.08% (16) | 0.0000 (0) | 0.00% (0) | |
| Data or Patient source | Encounter | 0.22% (43) | 0.0000 (0) | 0.00% (0) |
| Disabilities | 0.02% (4) | 0.0000 (0) | 0.00% (0) | |
| Healthy | 0.11% (22) | 0.0000 (0) | 0.00% (0) | |
| Data accessible | 0.28% (53) | 0.0294 (8) | 2.94% (8) | |
| Others | Multiple | 19.29% (3700) | 1.4632 (398) | 78.68% (214) |
Fig. 2Comparison of our proposed semantic categories with the previous works. The alignment of our 44 semantic categories (green) with Luo’s 27 semantic classes (red) and Van Spall’s 38 categories (blue). The histograms represent the prevalence. In Luo’s work, prevalence was calculated with 1578 randomly selected clinical trials from www.clinicaltrials.gov. In Van Spall’s work, prevalence was calculated with 283 random clinical trials published in high impact medical journals. In our study, prevalence was calculated with 272 HCC related clinical trials from ChiCTR. The lines represent the two categories have same meaning. The 13 semantic categories not linked with any other categories, are novel categories in Chinese eligibility criteria
Fig. 3The detailed data distribution of each semantic category for training and testing
Fig. 4The F1 score of each semantic category by 9 classifiers
the overall classification performance comparison of 9 classifiers averaged in macro and micro level
| Models | Macro-average | Micro-average | ||||
|---|---|---|---|---|---|---|
| Precision | Recall | F1-score | Precision | Recall | F1-score | |
| Machine learning algorithms | ||||||
| NB | 0.5398 | 0.7403 | 0.5965 | 0.6312 | 0.6312 | 0.6312 |
| kNN | 0.7531 | 0.6693 | 0.6948 | 0.7632 | 0.7632 | 0.7632 |
| LR | 0.8017 | 0.7574 | 0.7732 | 0.8173 | 0.8173 | 0.8173 |
| SVM | 0.7712 | 0.7899 | 0.8293 | 0.8293 | 0.8293 | |
| Deep learning algorithms | ||||||
| CNN | 0.8004 | 0.6951 | 0.7258 | 0.8142 | 0.8142 | 0.8142 |
| RNN | 0.7837 | 0.6925 | 0.7170 | 0.8138 | 0.8138 | 0.8138 |
| FastText | 0.7645 | 0.7188 | 0.7341 | 0.8182 | 0.8182 | 0.8182 |
| Pre-trained language models | ||||||
| BERT | 0.7994 | 0.8023 | 0.7958 | 0.8447 | 0.8447 | 0.8447 |
| ERNIE | 0.7964 | |||||
Bold indicates the best value per metric