| Literature DB >> 33083472 |
Yuhan Su1, Hongxin Xiang1, Haotian Xie2, Yong Yu1, Shiyan Dong3, Zhaogang Yang3, Na Zhao1.
Abstract
The identification of profiled cancer-related genes plays an essential role in cancer diagnosis and treatment. Based on literature research, the classification of genetic mutations continues to be done manually nowadays. Manual classification of genetic mutations is pathologist-dependent, subjective, and time-consuming. To improve the accuracy of clinical interpretation, scientists have proposed computational-based approaches for automatic analysis of mutations with the advent of next-generation sequencing technologies. Nevertheless, some challenges, such as multiple classifications, the complexity of texts, redundant descriptions, and inconsistent interpretation, have limited the development of algorithms. To overcome these difficulties, we have adapted a deep learning method named Bidirectional Encoder Representations from Transformers (BERT) to classify genetic mutations based on text evidence from an annotated database. During the training, three challenging features such as the extreme length of texts, biased data presentation, and high repeatability were addressed. Finally, the BERT+abstract demonstrates satisfactory results with 0.80 logarithmic loss, 0.6837 recall, and 0.705 F-measure. It is feasible for BERT to classify the genomic mutation text within literature-based datasets. Consequently, BERT is a practical tool for facilitating and significantly speeding up cancer research towards tumor progression, diagnosis, and the design of more precise and effective treatments.Entities:
Mesh:
Year: 2020 PMID: 33083472 PMCID: PMC7563092 DOI: 10.1155/2020/5491963
Source DB: PubMed Journal: Biomed Res Int Impact factor: 3.411
Figure 1The cut-off document views of the datasets.
Figure 2Distribution of the text entry lengths.
Figure 3Distribution of the text entry lengths among different classes.
Figure 4Distribution of the number of genes among 9 classes.
Class information corresponds to the annotated number.
| Annotated number | Class information |
|---|---|
| 1 | Likely loss of function |
| 2 | Likely gain of function |
| 3 | Neutral |
| 4 | Loss of function |
| 5 | Likely neutral |
| 6 | Inconclusive |
| 7 | Gain of function |
| 8 | Likely switch of function |
| 9 | Switch of function |
List of top 20 genes in the datasets.
| Rank | Gene name | Rank | Gene name |
|---|---|---|---|
| 1 | EGFR | 11 | FLT3 |
| 2 | TP53 | 12 | MTOR |
| 3 | CDKN2A | 13 | MAP2K1 |
| 4 | ERBB2 | 14 | PTEN |
| 5 | PDGFRA | 15 | BRCA1 |
| 6 | TSC2 | 16 | BRAF |
| 7 | PIK3CA | 17 | BRCA2 |
| 8 | FGFR2 | 18 | KIT |
| 9 | ALK | 19 | KRAS |
| 10 | VHL | 20 | RET |
Figure 5Distribution of genes among classes.
Figure 6Confusion matrix analysis of the similarity of the texts in different classes.
Figure 7Scheme of the training.
Figure 8Evaluation of four methods.
Figure 9ROC curves of the proposed methods.
Figure 10Confusion matrix tables of proposed four methods.