| Literature DB >> 35243479 |
Tim Beck1,2, Tom Shorter1, Yan Hu3,4, Zhuoyu Li3, Shujian Sun3, Casiana M Popovici3,4, Nicholas A R McQuibban3,5, Filip Makraduli3, Cheng S Yeung3, Thomas Rowlands1, Joram M Posma2,3.
Abstract
To analyse large corpora using machine learning and other Natural Language Processing (NLP) algorithms, the corpora need to be standardized. The BioC format is a community-driven simple data structure for sharing text and annotations, however there is limited access to biomedical literature in BioC format and a lack of bioinformatics tools to convert online publication HTML formats to BioC. We present Auto-CORPus (Automated pipeline for Consistent Outputs from Research Publications), a novel NLP tool for the standardization and conversion of publication HTML and table image files to three convenient machine-interpretable outputs to support biomedical text analytics. Firstly, Auto-CORPus can be configured to convert HTML from various publication sources to BioC. To standardize the description of heterogenous publication sections, the Information Artifact Ontology is used to annotate each section within the BioC output. Secondly, Auto-CORPus transforms publication tables to a JSON format to store, exchange and annotate table data between text analytics systems. The BioC specification does not include a data structure for representing publication table data, so we present a JSON format for sharing table content and metadata. Inline tables within full-text HTML files and linked tables within separate HTML files are processed and converted to machine-interpretable table JSON format. Finally, Auto-CORPus extracts abbreviations declared within publication text and provides an abbreviations JSON output that relates an abbreviation with the full definition. This abbreviation collection supports text mining tasks such as named entity recognition by including abbreviations unique to individual publications that are not contained within standard bio-ontologies and dictionaries. The Auto-CORPus package is freely available with detailed instructions from GitHub at: https://github.com/omicsNLP/Auto-CORPus.Entities:
Keywords: biomedical literature; health data; natural language processing; semantics; text mining
Year: 2022 PMID: 35243479 PMCID: PMC8885717 DOI: 10.3389/fdgth.2022.788124
Source DB: PubMed Journal: Front Digit Health ISSN: 2673-253X
Publishers and journals included in the publisher dataset.
|
|
|
|
|
|
|
|---|---|---|---|---|---|
| American Heart Association | Circulation Cardiovascular Genetics | 52 | 39 | Inline | – |
| American Physical Society | Physical Review Letters | 6 | – | Inline | – |
| American Psychological Association | Psychological Bulletin | 3 | – | Inline | – |
| American Society of Hematology | Blood | 31 | 25 | Inline | – |
| American Thoracic Society | American Journal of Respiratory and Critical Care Medicine | 20 | 18 | Inline | – |
| BioMed Central | BMC Medical Genetics | 43 | 43 | Linked HTML | 160 |
| Cell Press | American Journal of Human Genetics | 5 | 5 | Inline | – |
| Elsevier | Biological Psychiatry | 5 | 5 | Inline | – |
| Gastroenterology | 5 | 2 | Inline | – | |
| Frontiers | Frontiers in Genetics | 20 | 20 | Linked images | n/a |
| Frontiers in Physics | 3 | – | Inline | – | |
| Frontiers in Psychology | 4 | – | Inline | – | |
| Massachusetts Medical Society | The New England Journal of Medicine | 20 | 12 | Linked images | n/a |
| Mosby | The Journal of Allergy and Clinical Immunology | 5 | 3 | Inline | – |
| Nature Portfolio | European Journal of Human Genetics | 50 | 50 | Linked HTML | 123 |
| Journal of Human Genetics | 37 | 3 | Linked HTML | 90 | |
| Molecular Psychiatry | 103 | 78 | Linked HTML | 262 | |
| Nature Physics | 3 | – | – | – | |
| Scientific Reports | 80 | 80 | Linked HTML | 190 | |
| The Pharmacogenomics Journal | 37 | 16 | Linked HTML | 116 | |
| Translational Psychiatry | 41 | 41 | Linked HTML | 87 | |
| Oxford University Press | Human Molecular Genetics | 254 | 186 | Inline | – |
| PLOS | PLOS One | 20 | 20 | Linked images | n/a |
| SAGE Publications | Psychological Science | 3 | – | Inline | - |
| Springer | Human Genetics | 5 | 2 | Linked HTML | 13 |
| Wiley-Blackwell | American Journal of Medical Genetics | 5 | 0 | Inline | – |
| Total | 860 | 648 | 1,041 |
The full-text files were downloaded in HTML format and the linked table files were downloaded when available in HTML formats. The full-text files that overlap with the OA dataset were used to assess the consistency of outputs generated from different sources.
These publications are not part of the publisher dataset for evaluating tables, but are used for evaluating the accuracy of IAO header mapping.
Figure 1An extract of the Auto-CORPus BioC JSON created from the PMC3606015 full-text HTML file. Each section is annotated with IAO terms. The “autocorpus_fulltext.key” file describes the contents of the full-text JSON file (https://github.com/omicsNLP/Auto-CORPus/blob/main/keyFiles/autocorpus_fulltext.key).
Figure 2An extract from the Auto-CORPus abbreviations JSON created from the PMC4068805 full-text HTML file. For each abbreviation the corresponding long form definition is given along with the algorithm(s) used to detect the abbreviation. Most of the abbreviations shown were independently identified in both the full-text and in the abbreviations section of the publication. A variation in the definition of “RP” was detected: in the abbreviations section this was defined as “reverse phase,” however in the full-text this was defined as “reversed phase.” The “autocorpus_abbreviations.key” file describes the contents of the abbreviations JSON file (https://github.com/omicsNLP/Auto-CORPus/blob/main/keyFiles/autocorpus_abbreviations.key).
Figure 3Flow diagram demonstrating the process of classifying publication sections with IAO terms. The unfiltered digraph is visualized in Supplementary Figure 1, and the process of combining DPGs and mapping unmapped nodes using anchor points in Supplementary Figure 2. DPG, directed path graph; G(V,E), graph(vertex, edge); IAO, information artifact ontology.
New synonyms identified for existing IAO terms from the fuzzy and digraph mappings of 2,441 publications.
|
|
|
|
|---|---|---|
| abbreviations (IAO:0000606) | abbreviations, abbreviations list, abbreviations used, list of abbreviations, list of abbreviations used | |
| abstract (IAO:0000315) | abstract |
|
| acknowledgments (IAO:0000324) | acknowledgments, acknowledgments | |
| author contributions (IAO:0000323) | author contributions, contributions by the authors | |
| author information (IAO:0000607) | author information, authors' information |
|
| availability (IAO:0000611) | availability, availability and requirements | |
| conclusion (IAO:0000615) | concluding remarks, conclusion, conclusions, findings, summary | conclusion and perspectives, summary and conclusion |
| conflict of interest (IAO:0000616) | competing interests, conflict of interest, conflict of interest statement, declaration of competing interests, disclosure of potential conflicts of interest |
|
| consent (IAO:0000618) | consent | Informed consent |
| discussion (IAO:0000319) | discussion, discussion section |
|
| ethical approval (IAO:0000620) | ethical approval | ethics approval and consent to participate, |
| footnote (IAO:0000325) | endnote, footnote |
|
| funding source declaration (IAO:0000623) | funding, funding information, funding sources, funding statement, funding/support, source of funding, sources of funding |
|
| future directions (IAO:0000625) | future challenges, future considerations, future developments, future directions, future outlook, future perspectives, future plans, future prospects, future research, future research directions, future studies, future work |
|
| introduction (IAO:0000316) | background, introduction | introductory paragraph |
| materials (IAO:0000633) | materials | data, data description |
| methods (IAO:0000317) | experimental, experimental procedures, experimental section, materials and methods, methods | analytical methods, concise methods, |
| references (IAO:0000320) | bibliography, literature cited, references | |
| statistical analysis (IAO:0000644) | statistical analysis | statistical methods, statistical methods and analysis, statistics |
| study limitations (IAO:0000631) | limitations, study limitations | strengths and limitations, study strengths and limitations |
| supplementary material (IAO:0000326) | additional information, appendix, supplemental information, supplementary material, supporting information |
IAO v2020-06-10.
Elements in italics have previously been submitted by us for inclusion into IAO and added in the v2020-12-09 IAO release.
Figure 4Unmapped nodes in the digraph (Figure 3) connected to “abstract” as ego node, excluding corpus specific nodes, grouped into different categories. Unlabeled nodes are titles of paragraphs in the main text.
(A) Proposed new IAO terms to define publication sections that were derived from analyzing the sections of 2,441 publications. (B) Proposed new IAO terms to define parts of a table section. Elements in italics have previously been submitted by us for inclusion into IAO and added in the v2020-12-09 IAO release.
|
|
| |
|---|---|---|
|
| ||
| Disclosure | “A part of a document used to disclose any associations by authors that might be perceived as to potentially interfere with or prevent them from reporting research with complete objectivity.” | Author disclosure statement, declarations, disclosure, disclosure statement, disclosures |
|
| “ | Central illustration, |
| Highlights | “A short collection of key messages that describe the core findings and essence of the article in concise form. It is distinct and separate from the abstract and only conveys the results and concept of a study. It is devoid of jargon, acronyms and abbreviations and targeted at a broader, non-technical audience.” | Author summary, editors' summary, highlights, key points, overview, research in context, significance, TOC |
| Participants | “A section describing the recruitment of subjects into a research study. This section is distinct from the ‘patients' section and mostly focusses on healthy volunteers.” | Participants, sample |
|
| ||
| Table title | “A textual entity that names a table.” | |
| Table caption | “A textual entity that describes a table.” | |
| Table footer | “A part of a table that provides additional information about a specific other part of the table. Footers are spatially segregated from the rest of the table and are usually indicated by a superscripted number or letter, or a special typographic character such as †.” | Table key, table note, table notes |
Figure 5Final digraph model used in Auto-CORPus to classify paragraphs after fuzzy matching to IAO terms (v2020-06-10). This model includes new (proposed) section terms and each section contains new synonyms identified in this analysis. “Associated Data” is included as this is a PMC-specific header found before abstracts and can be used to indicate the start of most articles, all IAO terms are indicated in orange.
Figure 6Extracts of the Auto-CORPus table JSON file generated to store metadata and content for an example table. (A) The parts of a table stored in table JSON. The section titles are underlined. The table shown is the PMC version (PMC4245044) of Table 1 from (15). (B) The title and caption table metadata stored in table JSON. (C) Each column heading in the table content is split between two rows, so the strings from both cells are concatenated with a pipe symbol in the table JSON. Headers that span multiple columns of sub-headers are replicated in each header cell as here with the pipe symbol. (D) The table content for the first row from the first section is shown in table JSON. Superscript characters are identified using HTML markup. (E) The footer table metadata stored in table JSON. The “autocorpus_tables.key” file describes the contents of the tables JSON file (https://github.com/omicsNLP/Auto-CORPus/blob/main/keyFiles/autocorpus_tables.key).
Differences between the Auto-CORPus BioC and PMC BioC JSON outputs.
|
|
|
|
|---|---|---|
| Section titles | Section titles, subtitles, subsubtitles (and so on) are linked to the passage text they apply to | Section titles, subtitles, subsubtitles (and so on) precede the passage text they apply to |
| Section types | Section types are annotated using IAO terms | Section types are described using custom labels |
| Offset counts | Offset increased by 1 for every character (including whitespace) in a passage | Offset increased by the number of bytes in the text of a passage plus one space |
| Table and figure sections | Structured table data are stored in table JSON. Figure captions are included in the BioC JSON in the sequential order in which they occur within paragraphs. | Table data and figure captions occur at the end of the JSON document. Table content is given as XML. |
| Abbreviations section | Abbreviations section stored in abbreviations JSON. Abbreviation and definition components are related. Incomplete/one-sided definitions are not stored. | Abbreviations and definitions from the abbreviations section are stored separately as text with no relations between the two components. Incomplete/one-sided definitions are stored. |
| Link anchor text | Link anchor text retained (HTML element tags removed). | Link anchor text removed. |
| Character encoding | UTF-8 used for outputs | Available in Unicode and ASCII |