| Literature DB >> 36262290 |
Vatsala Nundloll1, Robert Smail2, Carly Stevens2, Gordon Blair1.
Abstract
Data heterogeneity is a pressing issue and is further compounded if we have to deal with data from textual documents. The unstructured nature of such documents implies that collating, comparing and analysing the information contained therein can be a challenging task. Automating these processes can help to unleash insightful knowledge that otherwise remains buried in them. Moreover, integrating the extracted information from the documents with other related information can help to make more information-rich queries. In this context, the paper presents a comprehensive review of text extraction and data integration techniques to enable this automation process in an ecological context. The paper investigates into extracting valuable floristic information from a historical Botany journal. The purpose behind this extraction is to bring to light relevant pieces of information contained within the document. In addition, the paper also explores the need to integrate the extracted information together with other related information from disparate sources. All the information is then rendered into a query-able form in order to make unified queries. Hence, the paper makes use of a combination of Machine Learning, Natural Language Processing and Semantic Web techniques to achieve this. The proposed approach is demonstrated through the information extracted from the journal and the information-rich queries made through the integration process. The paper shows that the approach has a merit in extracting relevant information from the journal, discusses how the machine learning models have been designed to classify complex information and also gives a measure of their performance. The paper also shows that the approach has a merit in query time in regard to querying floristic information from a multi-source linked data model.Entities:
Keywords: Data extraction; Machine learning; Natural language processing; Ontologies; Semantic integration; Unstructured data
Year: 2022 PMID: 36262290 PMCID: PMC9573881 DOI: 10.1016/j.heliyon.2022.e10710
Source DB: PubMed Journal: Heliyon ISSN: 2405-8440
Figure 1Types of information extracted from the Journal of Botany (figure (a) - courtesy of the Botany journal (1885)).
Figure 2Methodology proposed to extract information from historical text.
Linguistic features of Spacy.
| Features | Description |
|---|---|
| Tokenization | splits a text into meaningful segments, called tokens |
| Part-of-Speech | parses and tags a token as a component of the grammar (e.g. proper noun, adjective, etc.) and enables to make a prediction of which tag/label is more applicable in this context |
| Named Entity Recognition | assigns labels to contiguous spans of tokens |
| Entity Linking | assigns a unique identifier to each entity to perform entity linking |
| Merging & Splitting | enables merging and splitting of tokens |
| Dependency Parse | parses a sentence into different components like nouns, verbs, etc., and enables navigating through these components in a tree-like structure |
| Sentence Segmentation | splits sentences |
NER features used in Prodigy (v1.6.1).
| Features | Description | Function name |
|---|---|---|
| Fully manual | Manually Annotate raw text | ner.manual |
| Add suggestions from patterns and update existing model | Use pattern files to annotate part of the text and manually add further annotations and add new entity to the existing model | ner.teach |
| Train the model | Train the model using the training data | ner.batch-train |
| Error-check the model | Check and manually correct the suggestions made by the model | ner.make-gold |
Figure 3Creation of plant species pattern file.
Figure 6Annotated extracts from Journal of Botany (figure (a) - courtesy of the Botany journal (1885)).
Figure 7Annotating extracts from Journal of Botany using Prodigy.
Figure 4MongoDB queries showing entities extracted together their contextual information.
Figure 5Vocabularies for the linked data model.
Figure 9Querying the linked data model on GraphDB.
Training and Test data sets.
| Data used in NER model | Size |
|---|---|
| Training Data | 30000 words |
| Test Data 1 | 300 words |
| Test Data 2 | 400 words |
| Test Data 3 | 500 words |
Figure 8Evaluation of the NER model trained in Prodigy (v1.6.1).
Metrics for accuracy of the NER model.
| Entity | Annotations | Accepted | Rejected | Ignored |
|---|---|---|---|---|
| plant | 2645 | 489 | 54 | 2102 |
| observer | 3821 | 645 | 3176 | 0 |
| abundance | 1507 | 254 | 1253 | 0 |
| spatialRelations | 1247 | 312 | 935 | 0 |
| topographicTerms | 1247 | 312 | 935 | 0 |
| location | 3953 | 1179 | 2774 | 0 |
| location_terms (spatialRelations+topographicTerms+location) | 2501 | 390 | 67 | 2044 |
| lakedistrict | 13970 | 2820 | 5946 | 5904 |
Time taken to query from GraphDB (in seconds).
| Query | Time Taken (s) |
|---|---|
| Query a sentence from the journal extract | 0.1000 s |
| Query from the geocoordinates database | 0.1000 s |
| Query from plant taxonomy dataset | 0.1000 s |
| Unified Query from journal extract and synonyms dataset | 0.3000 s |
| Unified Query from journal extract and geocoordinates dataset | 0.2000 s |
| Unified Query from journal extract and taxonomy dataset | 0.2000 s |