| Literature DB >> 33947365 |
Pilar López-Úbeda1, Alexandra Pomares-Quimbaya2, Manuel Carlos Díaz-Galiano3, Stefan Schulz4.
Abstract
BACKGROUND: Controlled vocabularies are fundamental resources for information extraction from clinical texts using natural language processing (NLP). Standard language resources available in the healthcare domain such as the UMLS metathesaurus or SNOMED CT are widely used for this purpose, but with limitations such as lexical ambiguity of clinical terms. However, most of them are unambiguous within text limited to a given clinical specialty. This is one rationale besides others to classify clinical text by the clinical specialty to which they belong.Entities:
Keywords: Clinical specialty; Medical sub-domain; Medical sub-language; Natural language processing; Vocabulary
Mesh:
Year: 2021 PMID: 33947365 PMCID: PMC8094531 DOI: 10.1186/s12911-021-01495-w
Source DB: PubMed Journal: BMC Med Inform Decis Mak ISSN: 1472-6947 Impact factor: 2.796
Fig. 1Overview of the extraction method
Fig. 2Clinical specialty selection process. Including the variations in the number of specialties (left) when applied to the Spanish case (right)
Fig. 3Example of MeSH term information
Method notation
| Notation | |
|---|---|
| The term under scrutiny | |
| The total number of texts in the corpus | |
| The number of texts belonging to the specialty | |
| The number of texts that contain the term | |
| The number of texts belonging to a specialty | |
| The number of specialties that contain the term | |
| The number of occurrences in texts of the specialty | |
| The number of occurrences of the term | |
Fig. 4Terms per specialty and number of Spanish PubMed titles and abstracts. The value inside the point indicates how many standard deviations a specialty is away from the mean (a.k.a. z-score). The average number of tokens in the titles is 13.3, in the abstracts is 249.72
Most frequent n-grams and clinical specialties in which they appears. Riesgo: risk, población: population, atención primaria: primary care
| N-gram | Total frequency | Specialties in which the n-gram mainly occurs |
|---|---|---|
| Cáncer | 21,545 | Medical oncology, preventive medicine, geriatrics, pathology, general surgery |
| Riesgo | 15,796 | Preventive medicine, epidemiology, cardiology, geriatrics, general surgery |
| Salud | 14,608 | Preventive medicine, epidemiology, community psychiatry, geriatrics, family practice |
| Renal | 13,884 | Urology, nephrology, preventive medicine, general surgery, geriatrics |
| Evaluación | 10,655 | Preventive medicine, geriatrics, epidemiology, general surgery, cardiology |
| Población | 9592 | Preventive medicine, epidemiology, geriatrics, cardiology, endocrinology |
| Tumor | 8598 | Medical oncology, pathology, preventive medicine, geriatrics, general surgery |
| Virus | 7997 | Preventive medicine, epidemiology, venereology, immunochemistry, medical oncology |
| Carcinoma | 7973 | Medical oncology, pathology, geriatrics, preventive medicine, general surgery |
| Atención primaria | 7526 | Family practice, preventive medicine, geriatrics, epidemiology, cardiology |
| Factor de riesgo | 6747 | Preventive medicine, epidemiology, cardiology, geriatrics, endocrinology |
| Mortalidad | 6572 | Preventive medicine, epidemiology, geriatrics, neonatology, cardiology |
| Arterial | 6425 | Cardiology, preventive medicine, geriatrics, epidemiology, nephrology |
| Programa | 6216 | Preventive medicine, community psychiatry, epidemiology, geriatrics, family practice |
| Trasplante | 6103 | General surgery, preventive medicine, urology, nephrology, thoracic surgery |
| Pronóstico | 6040 | Preventive medicine, geriatrics, medical oncology, cardiology, pathology |
| Insuficiencia | 5995 | Cardiology, preventive medicine, nephrology, urology, geriatrics |
Fig. 5Similarity between specialties according to their n-grams
Multi-label classification. Annotated data results with SCOVACLIS Score () and removing stop-ngrams (—stop ngrams)
| Classifier | Word representation | P (%) | R (%) | F1 (%) |
|---|---|---|---|---|
| Random forest | TF-IDF | 71.7 | 25.1 | 38.4 |
| Decision tree | TF-IDF | 47.9 | 38.1 | 42.4 |
| KNeighbors | TF-IDF | 63.3 | 39.0 | 48.2 |
| MLP | TF-IDF | 75.1 | 53.3 | 59.3 |
| Random forest | TF-IDF + | 70.0 | 17.5 | 28.7 |
| Decision tree | TF-IDF + | 46.2 | 43.5 | 44.8 |
| KNeighbors | TF-IDF + | 69.3 | 42.6 | 52.7 |
| MLP | TF-IDF + | 74.7 | 57.4 | 64.9 |
| Random forest | 76.0 | 32.6 | 45.6 | |
| Decision tree | 42.6 | 43.1 | 42.8 | |
| KNeighbors | 69.4 | 42.1 | 52.4 | |
| MLP | 75.8 | 43.5 | 55.3 | |
| Random forest | TF-IDF + | 70.3 | 18.9 | 30.6 |
| Decision tree | TF-IDF + | 46.4 | 43.9 | 45.1 |
| KNeighbors | TF-IDF + | 68.9 | 42.7 | 52.7 |
| MLP | TF-IDF + | |||
| Random forest | 76.8 | 32.6 | 45.8 | |
| Decision tree | 43.1 | 43.1 | 43.1 | |
| KNeighbors | 68.9 | 42.7 | 52.7 | |
| MLP | 75.6 | 43.5 | 55.4 |
Multi-label classification. Annotated data results with SCOVACLIS Score () and filtered terms with SNOMED CT
| Original | SNOMED CT filter | ||||||
|---|---|---|---|---|---|---|---|
| Classifier | Word representation | P (%) | R (%) | F1 (%) | P (%) | R (%) | F1 (%) |
| Random forest | TF-IDF + | 70.0 | 17.5 | 28.7 | 69.1 | 14.0 | 22.6 |
| Decision tree | TF-IDF + | 46.2 | 43.5 | 44.8 | 37.3 | 34.0 | 35.8 |
| KNeighbors | TF-IDF + | 69.3 | 42.6 | 52.7 | 62.8 | 31.1 | 41.6 |
| MLP | TF-IDF + | 74.7 | 57.4 | 64.9 | 70.1 | 56.1 | 60.5 |
| Random forest | 76.0 | 32.6 | 45.6 | 65.4 | 14.0 | 23.1 | |
| Decision tree | 42.6 | 43.1 | 42.8 | 31.5 | 26.8 | 28.9 | |
| KNeighbors | 69.4 | 42.1 | 52.4 | 57.9 | 29.0 | 38.7 | |
| MLP | 75.8 | 43.5 | 55.3 | 73.4 | 24.3 | 36.6 | |