| Literature DB >> 35285473 |
Mustafa H Gunturkun1, Efraim Flashner2, Tengfei Wang1, Megan K Mulligan2, Robert W Williams2, Pjotr Prins2, Hao Chen1.
Abstract
Interpreting and integrating results from omics studies typically requires a comprehensive and time consuming survey of extant literature. GeneCup is a literature mining web service that retrieves sentences containing user-provided gene symbols and keywords from PubMed abstracts. The keywords are organized into an ontology and can be extended to include results from human genome-wide association studies. We provide a drug addiction keyword ontology that contains over 300 keywords as an example. The literature search is conducted by querying the PubMed server using a programming interface, which is followed by retrieving abstracts from a local copy of the PubMed archive. The main results presented to the user are sentences where gene symbol and keywords co-occur. These sentences are presented through an interactive graphical interface or as tables. All results are linked to the original abstract in PubMed. In addition, a convolutional neural network is employed to distinguish sentences describing systemic stress from those describing cellular stress. The automated and comprehensive search strategy provided by GeneCup facilitates the integration of new discoveries from omic studies with existing literature. GeneCup is free and open source software. The source code of GeneCup and the link to a running instance is available at https://github.com/hakangunturkun/GeneCup.Entities:
Keywords: PubMed; addiction; custom ontology; literature mining; web service
Mesh:
Year: 2022 PMID: 35285473 PMCID: PMC9073678 DOI: 10.1093/g3journal/jkac059
Source DB: PubMed Journal: G3 (Bethesda) ISSN: 2160-1836 Impact factor: 3.542
Fig. 1.Overview of the workflow of GeneCup. GeneCup allows users to query the relationship of any gene with a list of keywords hierarchically organized into a custom ontology. This information is automatically extracted from PubMed and NHGRI-EBI GWAS catalog. The users have an option to choose keyword categories during the search. Searches are conducted using EUtils against the PubMed database but abstracts are retrieved from a locally mirrored copy of PubMed. The results are displayed as a cytoscape graph (Fig. 3) and a table. The graph and the table have many interactive elements, including displaying sentences that include the gene symbols and the keywords. Custom ontologies and search results are archived on the server if the user chooses to log in. When the default addiction ontology is used, sentences containing the keyword stress are classified using a CNN into 1 of 2 classes: systemic stress or cellular stress (Figs. 2 and 4). Dashed lines: Server operations invoked as needed. Solid lines: Server operations for default queries.
Fig. 3.An interactive Cytoscape graph visualizing gene–keyword relationships. An interactive Cytoscape graph visualizing gene–keyword relationships. Nodes (circles) represent either search terms (in red) or keywords (colored according to the mini ontology; GWAS results are in gray). Clicking the keyword nodes displays the individual terms that are included in the search. Clicking the gene symbols displays their synonyms. The edges represent relationships between nodes. The number of PubMed abstracts where the gene symbol and keyword co-occur in the same sentence are displayed on the edges. The width of the edge is correlated with the number of abstracts. Clicking on the edges shows these sentences, which are linked back to PubMed abstracts. Nodes can be moved about for better visibility of relationships. These genes were taken from a recent genome-wide association study of opioid cessation (Cox ).
Fig. 2.Pipeline for training the CNN that classifies sentences containing the word “stress.” Terms specific to “system stress” or “cellular stress” were obtained by using the cosine similarity tool in Python’s Gensim library against the word2vec embeddings derived from PubMed and PMC text. Abstracts including these terms were fetched from PubMed. These words were then “tokenized” and were splitted into training and validation sets. Input layer of the model passed the training data to the embedding layer. After a 1D convolutional layer, downsampling is implemented by a maximum pooling layer. Output is flattened and connected to 2 fully connected layers. We use the rectifier unit function to activate the neurons in the convolution layer and the dense layer. Last dense layer is activated by the sigmoid function. The final weights of the model classify input sentences into either system stress or cellular stress.
Fig. 4.Steps for classifying sentences using a trained neural network. Steps for classifying sentences using a trained neural network. Abstracts are fetched from the locally mirrored copy of PubMed and are parsed into sentences. Punctuation marks and stop words are removed and the remaining words of the sentences are stemmed. The words are tokenized by using the Tokenizer library of the Keras API. The weight matrices of the trained model are multiplied by the sentence matrix to predict whether the input sentences are related to system stress or cellular stress.
Confusion matrix of CNN on test data.
| Predicted class | ||||
|---|---|---|---|---|
| Systemic stress | Cellular stress | |||
| Actual class | Systemic stress | 4,853 (TP) | 147 (FN) | Sensitivity: 97% |
| Cellular stress | 310 (FP) | 4,690 (TN) | Specificity: 94% | |
| Precision: 94% | Negative predictive value: 97% | Accuracy: 95% | ||