| Literature DB >> 34912530 |
Wenke Xiao1, Lijia Jing2, Yaxin Xu1, Shichao Zheng1, Yanxiong Gan1, Chuanbiao Wen1.
Abstract
The amount of medical text data is increasing dramatically. Medical text data record the progress of medicine and imply a large amount of medical knowledge. As a natural language, they are characterized by semistructured, high-dimensional, high data volume semantics and cannot participate in arithmetic operations. Therefore, how to extract useful knowledge or information from the total available data is very important task. Using various techniques of data mining can extract valuable knowledge or information from data. In the current study, we reviewed different approaches to apply for medical text data mining. The advantages and shortcomings for each technique compared to different processes of medical text data were analyzed. We also explored the applications of algorithms for providing insights to the users and enabling them to use the resources for the specific challenges in medical text data. Further, the main challenges in medical text data mining were discussed. Findings of this paper are benefit for helping the researchers to choose the reasonable techniques for mining medical text data and presenting the main challenges to them in medical text data mining.Entities:
Mesh:
Year: 2021 PMID: 34912530 PMCID: PMC8668297 DOI: 10.1155/2021/1285167
Source DB: PubMed Journal: J Healthc Eng ISSN: 2040-2295 Impact factor: 2.682
The detailed algorithm information for missing values in medical text data.
| Algorithm | Principle | Purpose |
|---|---|---|
| Multiple imputation [ | Estimate the value to be interpolated, and add different noises to form multiple groups of optional interpolation values; select the most appropriate interpolation value according to a certain selection basis. | Repeat the simulation to supplement the missing value |
| Expectation maximization [ | Compute maximum likelihood estimates or posterior distributions with incomplete data. | Supplement missing values |
|
| Select its | Estimate missing values with samples |
Figure 1Schematic of natural language processing flow.
The information of analysis methods for medical text data.
| Methods | Purpose | Algorithms | Advantages | Shortcomings |
|---|---|---|---|---|
| Clustering | Classify similar subjects in medical texts |
| 1.Simple and fast | 1. Large amount of data and time-consuming |
|
| ||||
| Classification | Read medical text data for intention recognition | ANN [ | 1. Solve complex mechanisms in text data | 1. Slow training |
| Decision tree [ | 1. Handle continuous variables and missing values | 1. Overfitting | ||
| Naive bayes [ | 1. The learning process is easy | Higher requirements for data independence | ||
|
| ||||
| Association rules | Mine frequent items and corresponding association rules from massive medical text datasets | Apriori [ | Simple and easy to implement | Low efficiency and time-consuming |
| FP-tree [ | 1. Reduce the number of database scans | High memory overhead | ||
| FP-growth [ | 1. Improve data density structure | Harder to achieve | ||
| Logistic Regression | Analyze how variables affect results | Logistic regression [ | 1.Visual understanding and interpretation | 1.Easy underfitting |
The information of ANN mining techniques.
| ANN mining techniques | Advantages | Shortcomings |
|---|---|---|
| Backpropagation [ | 1. Strong nonlinear mapping capability | 1. Local minimization |
| Radial basis function [ | 1. Fast learning speed | Complex structure |
| FNN [ | 1.Reduce feature engineering | Limited modeling capability |
Figure 2ANN algorithm analysis process.
Figure 3NB algorithm analysis process.
Figure 4C4.5 algorithm application flow.
Figure 5Application process of association rules.