Simon Baker1, Ilona Silins2, Yufan Guo3, Imran Ali2, Johan Högberg2, Ulla Stenius2, Anna Korhonen3. 1. Computer Laboratory, University of Cambridge, Cambridge, UK. 2. Institute of Environmental Medicine, Karolinska Institutet, Stockholm, Sweden and. 3. Department of Theoretical and Applied Linguistics, University of Cambridge, Cambridge, UK.
Abstract
MOTIVATION: The hallmarks of cancer have become highly influential in cancer research. They reduce the complexity of cancer into 10 principles (e.g. resisting cell death and sustaining proliferative signaling) that explain the biological capabilities acquired during the development of human tumors. Since new research depends crucially on existing knowledge, technology for semantic classification of scientific literature according to the hallmarks of cancer could greatly support literature review, knowledge discovery and applications in cancer research. RESULTS: We present the first step toward the development of such technology. We introduce a corpus of 1499 PubMed abstracts annotated according to the scientific evidence they provide for the 10 currently known hallmarks of cancer. We use this corpus to train a system that classifies PubMed literature according to the hallmarks. The system uses supervised machine learning and rich features largely based on biomedical text mining. We report good performance in both intrinsic and extrinsic evaluations, demonstrating both the accuracy of the methodology and its potential in supporting practical cancer research. We discuss how this approach could be developed and applied further in the future. AVAILABILITY AND IMPLEMENTATION: The corpus of hallmark-annotated PubMed abstracts and the software for classification are available at: http://www.cl.cam.ac.uk/∼sb895/HoC.html. CONTACT: simon.baker@cl.cam.ac.uk.
MOTIVATION: The hallmarks of cancer have become highly influential in cancer research. They reduce the complexity of cancer into 10 principles (e.g. resisting cell death and sustaining proliferative signaling) that explain the biological capabilities acquired during the development of humantumors. Since new research depends crucially on existing knowledge, technology for semantic classification of scientific literature according to the hallmarks of cancer could greatly support literature review, knowledge discovery and applications in cancer research. RESULTS: We present the first step toward the development of such technology. We introduce a corpus of 1499 PubMed abstracts annotated according to the scientific evidence they provide for the 10 currently known hallmarks of cancer. We use this corpus to train a system that classifies PubMed literature according to the hallmarks. The system uses supervised machine learning and rich features largely based on biomedical text mining. We report good performance in both intrinsic and extrinsic evaluations, demonstrating both the accuracy of the methodology and its potential in supporting practical cancer research. We discuss how this approach could be developed and applied further in the future. AVAILABILITY AND IMPLEMENTATION: The corpus of hallmark-annotated PubMed abstracts and the software for classification are available at: http://www.cl.cam.ac.uk/∼sb895/HoC.html. CONTACT: simon.baker@cl.cam.ac.uk.
Authors: Helena Balabin; Charles Tapley Hoyt; Colin Birkenbihl; Benjamin M Gyori; John Bachman; Alpha Tom Kodamullil; Paul G Plöger; Martin Hofmann-Apitius; Daniel Domingo-Fernández Journal: Bioinformatics Date: 2022-01-05 Impact factor: 6.937
Authors: Olga Majewska; Charlotte Collins; Simon Baker; Jari Björne; Susan Windisch Brown; Anna Korhonen; Martha Palmer Journal: J Biomed Semantics Date: 2021-07-15
Authors: Sampo Pyysalo; Simon Baker; Imran Ali; Stefan Haselwimmer; Tejas Shah; Andrew Young; Yufan Guo; Johan Högberg; Ulla Stenius; Masashi Narita; Anna Korhonen Journal: Bioinformatics Date: 2019-05-01 Impact factor: 6.937