Ronghui You1, Yuxuan Liu1, Hiroshi Mamitsuka2,3, Shanfeng Zhu1,4,5,6. 1. School of Computer Science and Shanghai Key Lab of Intelligent Information Processing, Fudan University, Shanghai 200433, China. 2. Bioinformatics Center, Institute for Chemical Research, Kyoto University, Uji, Japan. 3. Department of Computer Science, Aalto University, Espoo, Finland. 4. Institute of Science and Technology for Brain-Inspired Intelligence and Shanghai Institute of Artificial Intelligence Algorithms, Shanghai 200433, China. 5. Ministry of Education, Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence, Fudan University, Shanghai 200433, China. 6. Bio-Med Big Data Center, Key Laboratory of Computational Biology, CAS-MPG Partner Institute for Computational Biology, Shanghai Institute of Nutrition and Health, Shanghai Institutes for Biological Science, Chinese Academy of Sciences, Shanghai 200031, China.
Abstract
MOTIVATION: With the rapid increase of biomedical articles, large-scale automatic Medical Subject Headings (MeSH) indexing has become increasingly important. FullMeSH, the only method for large-scale MeSH indexing with full text, suffers from three major drawbacks: FullMeSH (i) uses Learning To Rank, which is time-consuming, (ii) can capture some pre-defined sections only in full text and (iii) ignores the whole MEDLINE database. RESULTS: We propose a computationally lighter, full text and deep-learning-based MeSH indexing method, BERTMeSH, which is flexible for section organization in full text. BERTMeSH has two technologies: (i) the state-of-the-art pre-trained deep contextual representation, Bidirectional Encoder Representations from Transformers (BERT), which makes BERTMeSH capture deep semantics of full text. (ii) A transfer learning strategy for using both full text in PubMed Central (PMC) and title and abstract (only and no full text) in MEDLINE, to take advantages of both. In our experiments, BERTMeSH was pre-trained with 3 million MEDLINE citations and trained on ∼1.5 million full texts in PMC. BERTMeSH outperformed various cutting-edge baselines. For example, for 20 K test articles of PMC, BERTMeSH achieved a Micro F-measure of 69.2%, which was 6.3% higher than FullMeSH with the difference being statistically significant. Also prediction of 20 K test articles needed 5 min by BERTMeSH, while it took more than 10 h by FullMeSH, proving the computational efficiency of BERTMeSH. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
MOTIVATION: With the rapid increase of biomedical articles, large-scale automatic Medical Subject Headings (MeSH) indexing has become increasingly important. FullMeSH, the only method for large-scale MeSH indexing with full text, suffers from three major drawbacks: FullMeSH (i) uses Learning To Rank, which is time-consuming, (ii) can capture some pre-defined sections only in full text and (iii) ignores the whole MEDLINE database. RESULTS: We propose a computationally lighter, full text and deep-learning-based MeSH indexing method, BERTMeSH, which is flexible for section organization in full text. BERTMeSH has two technologies: (i) the state-of-the-art pre-trained deep contextual representation, Bidirectional Encoder Representations from Transformers (BERT), which makes BERTMeSH capture deep semantics of full text. (ii) A transfer learning strategy for using both full text in PubMed Central (PMC) and title and abstract (only and no full text) in MEDLINE, to take advantages of both. In our experiments, BERTMeSH was pre-trained with 3 million MEDLINE citations and trained on ∼1.5 million full texts in PMC. BERTMeSH outperformed various cutting-edge baselines. For example, for 20 K test articles of PMC, BERTMeSH achieved a Micro F-measure of 69.2%, which was 6.3% higher than FullMeSH with the difference being statistically significant. Also prediction of 20 K test articles needed 5 min by BERTMeSH, while it took more than 10 h by FullMeSH, proving the computational efficiency of BERTMeSH. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Authors: Fernanda S Tonin; Vanessa Gmünder; Aline F Bonetti; Antonio M Mendes; Fernando Fernandez-Llimos Journal: Explor Res Clin Soc Pharm Date: 2022-08-20