| Literature DB >> 35950838 |
Maaly Nassar1,2, Alexander B Rogers1, Francesco Talo'1, Santiago Sanchez1, Zunaira Shafique1, Robert D Finn1, Johanna McEntyre1.
Abstract
Metagenomics is a culture-independent method for studying the microbes inhabiting a particular environment. Comparing the composition of samples (functionally/taxonomically), either from a longitudinal study or cross-sectional studies, can provide clues into how the microbiota has adapted to the environment. However, a recurring challenge, especially when comparing results between independent studies, is that key metadata about the sample and molecular methods used to extract and sequence the genetic material are often missing from sequence records, making it difficult to account for confounding factors. Nevertheless, these missing metadata may be found in the narrative of publications describing the research. Here, we describe a machine learning framework that automatically extracts essential metadata for a wide range of metagenomics studies from the literature contained in Europe PMC. This framework has enabled the extraction of metadata from 114,099 publications in Europe PMC, including 19,900 publications describing metagenomics studies in European Nucleotide Archive (ENA) and MGnify. Using this framework, a new metagenomics annotations pipeline was developed and integrated into Europe PMC to regularly enrich up-to-date ENA and MGnify metagenomics studies with metadata extracted from research articles. These metadata are now available for researchers to explore and retrieve in the MGnify and Europe PMC websites, as well as Europe PMC annotations API.Entities:
Mesh:
Year: 2022 PMID: 35950838 PMCID: PMC9366992 DOI: 10.1093/gigascience/giac077
Source DB: PubMed Journal: Gigascience ISSN: 2047-217X Impact factor: 7.658
Metagenomics: entity types
| Entity | Definition |
|---|---|
| Ecoregion | Microbiome natural environment |
| Host | Microbiome living organism or host |
| Engineered | Microbiome humanmade environment |
| Date | Microbiome sample collection date |
| Place | The place of microbiome environment or host |
| Site | The site of microbiome sample within place |
| Body-site | The organ or tissue of microbiome sample |
| Sample-material | The material of the microbiome sample (e.g., water, mucus, soil) |
| State | The state of the microbiome environment or host (e.g., disease) |
| Treatment | Any treatment performed on the host (e.g., medicine) or the environment (e.g., fertilizer) from which the sample was collected |
| Kit | DNA extraction kit |
| Primer | PCR primers |
| Gene | Microbiome target genes (e.g., ribosomal RNA subunit and amplified region(s)) |
| LS | Library source or library strategy (e.g., amplicon, whole genome) |
| LCM | Library construction method or layout (e.g., paired end, single end) |
| Sequencing | Sequencing platform |
Figure 1:Receiver operating characteristic (ROC) curves of biome classifiers, using TF-IDF (A) or Doc2VecMGnify (B) as training features.
Figure 4:A clustered stacked bar plot showing the overlap of metadata terms between ENA (author submitted terms that are part of the study) and those derived from Europe PMC (EPMC) articles using our framework. For each study, we have evaluated 20 of the most populated MIxS fields in 19,209 studies, to establish if the metadata provided by both sources were identical (green), nonidentical (yellow), or provided by one of the metadata sources only (unique, red). For each MIxS field, each stacked bar shows the number of studies having identical, nonidentical, and unique metadata from each source.
Precision, recall, and F1-scores of the best-performing random forest biome classifiers
| Classifier features | Class | Precision | Recall | F1-score | Support |
|---|---|---|---|---|---|
| TF-IDF | Engineered | 0.86 | 0.9 | 0.88* | 20 |
| Environmental | 0.86 | 0.9 | 0.88* | 20 | |
| Host-associated | 0.94 | 0.85 | 0.89* | 20 | |
| Doc2VecMGnify | Engineered | 0.77 | 0.85 | 0.81 | 20 |
| Environmental | 0.89 | 0.8 | 0.84 | 20 | |
| Host-associated | 0.85 | 0.85 | 0.85 | 20 |
Classifier features were either TF-IDF or Doc2VecMGnify (embeddings generated from MGnify cross-referenced publications). Support = number of publications in test data sets per class. *TF-IDF biome classifiers outperformed the ones trained on Doc2VecMGnify. n_estimators = 300 (TF-IDF) and 250 (Doc2VecMGnify). max_depth = 25. Random state = 9.
Token-wise: precision, recall, and F1-score of the 16 best-performing NER models
| Entity | Learning rate | Epoch | Recall | Precision | F1-score |
|---|---|---|---|---|---|
| Ecoregion | 4e-5 | 50 | 0.95 | 1 | 0.98 |
| Host | 2e-5 | 90 | 0.89 | 0.93 | 0.9 |
| Engineered | 2e-5 | 10 | 0.65 | 0.93 | 0.75 |
| Date | 4e-5 | 90 | 0.78 | 0.91 | 0.83 |
| Place | 3e-5 | 90 | 0.78 | 0.86 | 0.82 |
| Site | 4e-5 | 10 | 0.71 | 0.85 | 0.77 |
| Body-site | 4e-5 | 90 | 0.98 | 0.95 | 0.97 |
| Sample-material | 5e-5 | 110 | 0.8 | 0.9 | 0.85 |
| State | 5e-5 | 110 | 0.65 | 0.8 | 0.71 |
| Treatment | 4e-5 | 30 | 0.66 | 0.8 | 0.73 |
| Kit | 2e-5 | 70 | 0.94 | 0.91 | 0.92 |
| Primer | 5e-5 | 70 | 0.94 | 0.97 | 0.96 |
| Gene | 1e-5 | 10 | 0.86 | 0.92 | 0.89 |
| LS | 5e-5 | 50 | 0.8 | 0.95 | 0.85 |
| LCM | 4e-5 | 50 | 0.86 | 1 | 0.92 |
| Sequencing | 5e-5 | 110 | 0.84 | 0.89 | 0.87 |
Figure 2:Screenshot of Europe PMC article. Annotations panel with metagenomics entities and annotations (right). Highlighted annotations in full-text using SciLite tool (left) (SciLite article view from PMC8791192).
Figure 3:Composite screenshot showing annotations-enriched metadata for a MGnify study. Annotated publications are highlighted within a study (left) and annotations are shown in context (right).
Some examples of how the metagenomics annotations extracted from the publications describing ENA records enrich the metadata for those studies.
| Study | PMCID | MIxS (entity type) | ENA metadata | Metagenomics annotations |
| PRJDB8863 | PMC6941062 | pcr_primers (primer) | – | RNA (ribosomal RNA) genes (forward: 5′-ACACTCTTTCCCTACACGACGCTCTTCCGATCTGTGCCAGCMGCCGCGGTAA-3′; reverse: 5′-GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTGGACTACHVGGGTWTCTAAT-3') |
| PRJDB9293 | PMC8147061 | target_gene (gene) | – | 16S ribosomal RNA, V3–V4 |
| PRJDB5614 | PMC5745017 | nucl_acid_ext (kit) | – | RNA PowerSoil total RNA isolation kit |
| PRJEB27411 | PMC6072794 | Env_package (state) | – | 100-year drought, 75.6 mm of rainfall, mesotrophic, denitrification |
| PRJEB15392 | PMC5405060 | health_disease_stat (state) | – | Caries, gingivitis, medically healthy, oropharyngeal mucositis, poor oral hygiene, pulpal diseases |
| PRJEB22207 | PMC7044117 | Env_package (treatment) | – | Antibiotic, benzylpenicillin, cefotaxime, gentamicin, meropenem, metronidazole, probiotic supplementation, vancomycin |
| PRJDB10581 | PMC8151423 | env_local_scale (body-site) | Oral, rectal, cervical, posterior vaginal fornix |
For example, in the last row of the table, the project PRJDB10581 is linked to PMCID PMC8151423, from which the terms “Oral, rectal, cervical, posterior vaginal fornix” have been extracted to supplement the term “body-site” from the ENA record. Other examples include metadata about PCR primers (pcr_primers), marker genes (target_gene), nucleic acid extraction kit (nucl_acid_ext), body site (env_local_scale), environment phenomena (state), health or disease states (state), and treatment.