| Literature DB >> 31717300 |
Aleksei Dudchenko1,2, Georgy Kopanitsa1.
Abstract
This paper is an extension of the work originally presented in the 16th International Conference on Wearable, Micro and Nano Technologies for Personalized Health. Despite using electronic medical records, free narrative text is still widely used for medical records. To make data from texts available for decision support systems, supervised machine learning algorithms might be successfully applied. In this work, we developed and compared a prototype of a medical data extraction system based on different artificial neural network architectures to process free medical texts in the Russian language. Three classifiers were applied to extract entities from snippets of text. Multi-layer perceptron (MLP) and convolutional neural network (CNN) classifiers showed similar results to all three embedding models. MLP exceeded convolutional network on pipelines that used the embedding model trained on medical records with preliminary lemmatization. Nevertheless, the highest F-score was achieved by CNN. CNN slightly exceeded MLP when the biggest word2vec model was applied (F-score 0.9763).Entities:
Keywords: data extraction; machine learning; medical records; word embedding
Mesh:
Year: 2019 PMID: 31717300 PMCID: PMC6888408 DOI: 10.3390/ijerph16224360
Source DB: PubMed Journal: Int J Environ Res Public Health ISSN: 1660-4601 Impact factor: 3.390
Figure 1Word embeddings and samples’ preprocessing.
SNOMED cods and samples distribution.
| SNOMED Code | SNOMED Names | Count |
|---|---|---|
| 443502000 | Atherosclerosis of coronary artery (disorder) | 235 |
| 57546000 | Asthma with status asthmaticus (disorder) | 11 |
| 72866009 | Varicose veins of lower extremity (disorder) | 22 |
| 4556007 | Gastritis (disorder) | 28 |
| 70153002 | Hemorrhoids (disorder) | 15 |
| 25064002 | Headache (finding) | 52 |
| 386705008 | Lightheadedness (finding) | 34 |
| 1201005 | Benign essential hypertension (disorder) | 125 |
| 84229001 | Fatigue (finding) | 71 |
| 235856003 | Disorder of liver (disorder) | 11 |
| 84089009 | Hiatal hernia (disorder) | 10 |
| 76581006 | Cholecystitis (disorder) | 18 |
| 45816000 | Pyelonephritis (disorder) | 29 |
| 44054006 | Diabetes mellitus type 2 (disorder) | 32 |
| 413838009 | Chronic ischemic heart disease (disorder) | 101 |
| 162864005 | Body mass index 30+—obesity (finding) | 25 |
| 266556005 | Calculus of kidney and ureter (disorder) | 20 |
| 298494008 | Scoliosis of thoracic spine (disorder) | 18 |
| 235494005 | Chronic pancreatitis (disorder) | 18 |
| 51868009 | Duodenal ulcer disease (disorder) | 24 |
| 191268006 | Chronic anemia (disorder) | 11 |
| 102572006 | Edema of lower extremity (finding) | 13 |
| 709044004 | Chronic kidney disease (disorder) | 34 |
| (other) | Snippets without any disorder or finding | 25 |
Word2vec model parameters.
| Embedding Model | Corpus | Total Words | Normalization | Words in the Model | Vector Size |
|---|---|---|---|---|---|
| Word2vec model 1 | 220 medical records | 1,418,728 | no | 7505 | 50 |
| Word2vec model 2 | 220 medical records | 1,418,728 | yes | 3879 | 300 |
| Word2vec model 3 | Russian National Corpus (RNC) and Wikipedia | 788,000,000 | yes | 248,000 | 300 |
F-score.
| Prediction Model | Pipeline 1 | Pipeline 2 | Pipeline 3 |
|---|---|---|---|
| Multi-layer perceptron (MLP) | 0.9374 | 0.9590 | 0.9741 |
| Convolutional neural networks (CNN) | 0.9353 | 0.9525 | 0.9763 |
| Long short-term memory networks (LSTMs) | 0.9351 | 0.9355 | 0.9375 |
Figure 2F-score. (a) grouped by pipelines; (b) grouped by predictive models