| Literature DB >> 32349758 |
Qingyu Chen1, Jingcheng Du1,2, Sun Kim1, W John Wilbur1, Zhiyong Lu3.
Abstract
BACKGROUND: Capturing sentence semantics plays a vital role in a range of text mining applications. Despite continuous efforts on the development of related datasets and models in the general domain, both datasets and models are limited in biomedical and clinical domains. The BioCreative/OHNLP2018 organizers have made the first attempt to annotate 1068 sentence pairs from clinical notes and have called for a community effort to tackle the Semantic Textual Similarity (BioCreative/OHNLP STS) challenge.Entities:
Keywords: Deep learning; Electronic medical records; Machine learning; Sentence similarity
Mesh:
Year: 2020 PMID: 32349758 PMCID: PMC7191680 DOI: 10.1186/s12911-020-1044-0
Source DB: PubMed Journal: BMC Med Inform Decis Mak ISSN: 1472-6947 Impact factor: 2.796
Fig. 1An overview of our models. The Random Forest uses manually crafted features (word tokens, character n-grams, sequence similarity, semantic similarity and named entities). The feature selection of the Random Forest was done on the validation set. The Neural Network uses vectors generated by sentence embeddings as inputs. The validation set was used to monitor the early stopping process of the neural network. The ensembled (stacking) model incorporates both the Random Forest and Neural Network models. The validation set was used to train the ensembled model
Fig. 2Three primary deep learning models to capture sentence similarity. The first is the Convolutional Neural Network Model (1.1 and 1.2), which applies image-related convolutional processing to text. The second is LSTM or Recurrent Neural Network (2 in the figure), which aims to learn the semantics aligned with the input sequence. The third is Encoder-Decoder network (3 in the figure), where the encoder aims to compress the semantics of a sentence into a vector and decoder aims to re-generate the original sentence from the vector. FC layers: fully-connected layers. All three models use fully-connected layers as final stages
Evaluation results on the official test set
| # input models | Test set correlation | |
|---|---|---|
| Random Forest (submission #1) | 1 | 0.8106 |
| Random Forest + Dense Network (submission #2) | 2 | 0.8246 |
| Ensemble model (submission #3) | 8 | |
| Random Forest + Encoder Network (submission #4) | 2 | 0.8258 |
| Random Forest | 1 | 0.8246 |
| Encoder Network | 1 | 0.8384 |
| Ensemble model | 3 | |
Fig. 3Performance of an individual hand-crafted feature on the dataset. The y-axis stands for the Pearson correlation. The left shows the correlation over the entire set; the right shows the correlation over the training & validation set and the test set
Feature ablation study on the Random Forest model. Each set of features is removed, and the difference of the performance is measured
| #features | Validation set | Test set | |
|---|---|---|---|
| Full model | 14 | 0.8832 | 0.8246 |
| - Token-based | 5 | 0.8689 (−1.5%) | 0.8129 (−1.2%) |
| - Character-based | 2 | 0.8655 (−1.8%) | 0.8154 (−0.9%) |
| - Sequence-based | 4 | 0.8697 (−1.4%) | 0.8034 (−2.1%) |
| - Semantic-based | 1 | 0.8704 (−1.3%) | 0.8235 (−0.1%) |
| - Entity-based | 2 | 0.8738 (−0.9%) | 0.8150 (−0.9%) |
Fig. 4A visualization of important features ranked by a tree of the Random Forest model. We randomly picked up the tree and repeated multiple times. The top-ranked features are consistent. A tree makes the decision from top to bottom: the more important the feature, the higher the ranks. In this case, Q-Gram is the most important feature. From left to right, different colors represent the sentence pairs in different degrees of similarity; darker means more similar
Fig. 5The mean squared errors made by Random Forest and Encoder Model by categorizing sentence pairs into different similarity regions