| Literature DB >> 35657113 |
Giuseppe Serna Garcia1, Michele Leone1, Anna Bernasconi1, Mark J Carman1.
Abstract
The Gene Expression Omnibus (GEO) is a public archive containing >4 million digital samples from functional genomics experiments collected over almost two decades. The accompanying metadata describing the experiments suffer from redundancy, inconsistency and incompleteness due to the prevalence of free text and the lack of well-defined data formats and their validation. To remedy this situation, we created Genomic Metadata Integration (GeMI; http://gmql.eu/gemi/), a web application that learns to automatically extract structured metadata (in the form of key-value pairs) from the plain text descriptions of GEO experiments. The extracted information can then be indexed for structured search and used for various downstream data mining activities. GeMI works in continuous interaction with its users. The natural language processing transformer-based model at the core of our system is a fine-tuned version of the Generative Pre-trained Transformer 2 (GPT2) model that is able to learn continuously from the feedback of the users thanks to an active learning framework designed for the purpose. As a part of such a framework, a machine learning interpretation mechanism (that exploits saliency maps) allows the users to understand easily and quickly whether the predictions of the model are correct and improves the overall usability. GeMI's ability to extract attributes not explicitly mentioned (such as sex, tissue type, cell type, ethnicity and disease) allows researchers to perform specific queries and classification of experiments, which was previously possible only after spending time and resources with tedious manual annotation. The usefulness of GeMI is demonstrated on practical research use cases. Database URL http://gmql.eu/gemi/.Entities:
Mesh:
Year: 2022 PMID: 35657113 PMCID: PMC9216561 DOI: 10.1093/database/baac036
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 4.462
Figure 1.Number of samples (GSM, left y-axis) and experiments (GSE, right y-axis) made available by the GEO portal; raw data were extracted from https://www.ncbi.nlm.nih.gov/geo/browse/?view=samples and https://www.ncbi.nlm.nih.gov/geo/browse/?view=series).
List of attributes considered from the Cistrome and ENCODE datasets and how they are mapped into GeMI’s output
| Cistrome | ENCODE | GeMI |
|---|---|---|
| Cell Line | Cell Line | |
| Cell Type | Classification | Cell Type |
| Tissue Type | Biosample term name | Tissue |
| Assay Name | Technique | |
| Assay Type | Technique Type | |
| Target of Assay | Target | |
| Organism | Species | |
| Life stage | Life stage | |
| Age | Age | |
| Age units | Age units | |
| Sex | Sex | |
| Ethnicity | Ethnicity | |
| Health status | Disease | |
| Classification | Classification | |
| Investigated as | Feature |
Figure 2.Example input document and possible output format.
| Input | Task | Output |
|---|---|---|
| <BOS> [Input sentence] <SEP> | Cell line: | HeLa-S3 <EOS > |
Figure 3.The gradient-based saliency map implemented in the GeMI tool. The words referring to the prediction of the ‘Sex’ attribute for the GSM1348947, ‘ms’, ‘benign tissue’, ‘benign prostate’ and ‘ms 36c7’ are highlighted because they are used by the model to predict the necessary fields.
Figure 4.The four iterative phases of the AL framework.
Figure 5.Schematic representation of the comparative experimental setting. On the right, separate models were trained on the ENCODE and Cistrome datasets (as presented in (7)). On the left, the task conditional setting presented in this work, employing the two training datasets together.
Figure 6.Bar plot representing the accuracy of experiments on the two separate baselines (trained on ENCODE or Cistrome data) and on our new model. On the x-axis we report all the attributes considered for prediction.
Comparisons of inference time between (i) GeMI (with ENCODE-derived attributes) and the ENCODE baseline and (ii) GeMI (with Cistrome-derived attributes) and the Cistrome baseline
| Model | Training time | Inference time per sample |
|---|---|---|
| GeMI (Cistrome attr. only) | 10 h | 0.27 s |
| Baseline Cistrome | 2.49 h | 0.38 s |
| GeMI (ENCODE attr. only) | 10 h | 0.49 s |
| Baseline ENCODE | 0.58 h | 0.81 s |
Figure 7.Overview of the GeMI interface, divided in four panels. Panel A represents loaded samples with original and predicted information. Panel B provides the gradient-based saliency map related to the sample selected in the table above. Panel C shows the predicted values for the selected sample, also reporting the for accuracy of the prediction. Panel D allows users to actively modify the prediction of the model and save the suggestions.
Attributes describing SARS-CoV-2 sequences in four data sources
|
|
|
| Semantic annotations predicted by GeMI | |
|---|---|
| POLR2A, CTCF, H3K4me3, H3K27ac | 16 |
|
|
|
| POLR2A, CTCF, H3K4me1, H3K27ac | 15 |
| POLR2A, CTCF, H3K4me3, H3K27me3 | 15 |
| POLR2A, CTCF, H3K4me3, unknown | 15 |
| POLR2A, CTCF, H3K4me1, H3K36me3 | 14 |
| POLR2A, CTCF, H3K4me3, H3K36me3 | 14 |
| POLR2A, CTCF, H3K4me1, unknown | 14 |
| POLR2A, CTCF, H3K4me1, H3K27me3 | 14 |
| POLR2A, CTCF, H3K27ac, H3K27me3 | 14 |
|
| |
| POLR2A, CTCF, H3K27me3, H3K4me3 | 13 |
| POLR2A, CTCF, H3K27ac, H3K4me3 | 11 |
| POLR2A, CTCF, H3K27me3, H3K27ac | 10 |
| POLR2A, CTCF, H3K36me3, H3K4me3 | 8 |
| POLR2A, CTCF, H3K27me3, H3K36me3 | 8 |
| POLR2A, CTCF, MYC, H3K4me3 | 8 |
|
|
|
| POLR2A, CTCF, H3K4me1, H3K27me3 | 7 |
| POLR2A, CTCF, H3K4me1, H3K27ac | 7 |
| POLR2A, CTCF, H3K27ac, H3K36me3 | 6 |
List of semantic annotations for the set {POLR2A, CTCF, H3K4me1, H3K4me3}, using OnASSiS or GeMI
| Source | Cell type | Disease |
|---|---|---|
| OnASSiS | Cell, erythroblast | Unknown |
| Cell | Unknown | |
| Endodermal cell | Colorectal cancer, cancer | |
| Fibroblast | Unknown | |
| Lining cell, mesodermal cell | Cancer, chronic myeloid leukemia | |
| Lining cell | Unknown | |
| Lining cell | Neuroblastoma | |
| Progenitor cell | Unknown | |
| GeMI | B lymphocyte | Unknown |
| Embryonic stem cell | Healthy | |
| Embryonic stem cell | Unknown | |
| Epithelium | Breast cancer (adenocarcinoma) | |
| Epithelium | Cervical adenocarcinoma | |
| Epithelium | Hepatocellular carcinoma | |
| Epithelium | Mammary ductal carcinoma | |
| Epithelium | Prostate adenocarcinoma | |
| Epithelium | Unknown | |
| Erythroblast | Chronic myelogenous leukemia | |
| Erythroblast | Unknown | |
| Fibroblast | Unknown | |
| Keratinocyte | Unknown | |
| Lymphoblastoid | Unknown |
Figure 8.Bar plot of the answers to the question about the intuitiveness of GeMI according to the survey participant.
Figure 9.Bar plot of the answers to the question about the usefulness of GeMI for user’s future researches according to the survey participants.
Taxonomy of user-provided suggestions for improvement of GeMI
|
Consider additional fields provided by GEO (e.g. platform) Allow a free schema, to include user-defined attributes (possibly, information about genotype or treatment) Add more possibilities to denote unknown attribute values: non specified, not applicable, none User interface Add information regarding the GSM in Panel B for user reference Reshape screen to provide a more comprehensive initial view of panels Aim to user corrections’ normalization by providing guidelines and value references (e.g. for life stage values) Provide feedback to the user when the sample to be annotated changes Return number and type of updated values after re-training Perform a stronger training of the model with gene expression-related datasets (now skewed toward ChIP-seq) Integrate the possibility of using parts of sentences suggested by users as ‘relevant for prediction’ |