| Literature DB >> 27152122 |
Maha Soliman1, Olfa Nasraoui2, Nigel G F Cooper1.
Abstract
BACKGROUND: The volume of biomedical literature and its underlying knowledge base is rapidly expanding, making it beyond the ability of a single human being to read through all the literature. Several automated methods have been developed to help make sense of this dilemma. The present study reports on the results of a text mining approach to extract gene interactions from the data warehouse of published experimental results which are then used to benchmark an interaction network associated with glaucoma. To the best of our knowledge, there is, as yet, no glaucoma interaction network derived solely from text mining approaches. The presence of such a network could provide a useful summative knowledge base to complement other forms of clinical information related to this disease.Entities:
Keywords: Glaucoma; Interaction network; Relation extraction; Text mining
Year: 2016 PMID: 27152122 PMCID: PMC4857381 DOI: 10.1186/s13040-016-0096-2
Source DB: PubMed Journal: BioData Min ISSN: 1756-0381 Impact factor: 2.522
Fig. 1The workflow pipeline followed to build the glaucoma interaction network. Step 1: PubMed Central is queried for glaucoma related articles. Step 2: all glaucoma articles are collected and a glaucoma collection is constructed. Step 3: each document in the resulting collection is processed using the text mining pipeline detailed in Fig. 2 and a set of relations is obtained. Step 4: relations are stored into a database and filtered using SQL queries. Step 5: Filtered relations are subjected to manual inspection to identify meaningful relations worthy of validation. Step 6: inspected relations are then validated and evaluated against external reference databases. Step 7: validated relations are mapped to nodes and edges to form a potential glaucoma network. Step 8: network analysis of the resulting network is performed. The left panel contains external databases needed by each step of the workflow. See Table 1 for definition of BD, and BO
Fig. 2The Text Mining Pipeline. The text mining pipeline that corresponds to step 3 in Fig. 1. First, the segmenter module segments each article into its constituent sentences denoted s1 to sn. Second, the sentence tokenizer module tokenizes each sentence into a bag of words denoted w1 to wn. Third, the part of Speech POS module identifies the role of each word in a sentence. Fourth, the name entity recognition module NER extracts gene mentions E1, E2, En from the words of the sentence. Finally the relation extraction module (RE) extracts relations R1, R2, Rn from the words of the sentence. The output interaction from applying this sequence of modules is in the form: “Es, Rs, Es” and is saved in a database of interactions
Glaucoma benchmark and non-benchmark genes used in building the network
| Abbreviation | Definition | Number | Percent |
|---|---|---|---|
| BO | Benchmark glaucoma genes from OMIM database queried with “Glaucoma” | 155 | 51 % |
| BD | Benchmark glaucoma genes from DisGeNET database queried with “Glaucoma” | 180 | 59 % |
| BC | Benchmark glaucoma genes from the intersection of OMIM and DisGeNET databases | 30 (BO∩BD) | 10 % |
| BG | Benchmark glaucoma genes from union of BO and BD | 305 (BO⋃BD) | 100 % |
| NBG | Non-benchmark genes from PubMed Central | 150 | N/A |
For simplicity, benchmark genes used to build the interaction network are abbreviated as BG. If BG are obtained from OMIM, then we call them BO. If BG are obtained from DisGeNET, then we call them BD. Benchmark genes, common to OMIM and DisGeNET, are called BC. Genes that are not benchmark genes are called NBG. The definition, number and percentages of all benchmark genes are listed in columns 2 to 4
Fig. 7Extracted glaucoma network. Glaucoma network laid with different node sizes. The node size reflects the node degree of a gene where the degree is the total of the number of in-degree and out-degree links. The nodes colored in cyan belong to the BC. The known relations are colored in black. The new extracted relations are colored in blue. The relations with disconnected nodes are colored in green. The relations with unverified nodes are colored in red
Genes related to Primary Open Angle Glaucoma (POAG) and Exfoliation syndrome (XFS)
| Gene | Disease | Confidence | Support |
|---|---|---|---|
|
| POAG | 0.98 | 30 |
|
| XFS | 0.98 | 12 |
|
| XFS | 0.98 | 1 |
|
| POAG | 0.97 | 12 |
|
| POAG | 0.97 | 4 |
|
| POAG | 0.97 | 2 |
|
| POAG | 0.96 | 2 |
|
| POAG | 0.96 | 1 |
|
| POAG | 0.94 | 7 |
|
| POAG | 0.94 | 3 |
|
| POAG | 0.93 | 17 |
|
| POAG | 0.93 | 5 |
|
| POAG | 0.92 | 13 |
|
| POAG | 0.92 | 2 |
|
| POAG | 0.92 | 2 |
|
| POAG | 0.91 | 4 |
|
| POAG | 0.89 | 2 |
|
| XFS | 0.88 | 1 |
|
| XFS | 0.88 | 2 |
|
| POAG | 0.87 | 2 |
|
| POAG | 0.86 | 3 |
|
| POAG | 0.86 | 2 |
|
| POAG | 0.85 | 2 |
|
| POAG | 0.83 | 4 |
|
| XFS | 0.83 | 1 |
|
| POAG | 0.82 | 2 |
|
| POAG | 0.81 | 2 |
|
| POAG | 0.78 | 2 |
|
| XFS | 0.67 | 1 |
|
| XFS | 0.67 | 1 |
|
| POAG | 0.66 | 1 |
|
|
|
|
|
The gene and its related disease are listed under the “Gene” and “Disease” columns respectively. The confidence column is the maximum of all confidence values reported by ReVerb for the same relation, extracted from multiple articles. Relations with low confidence are bolded. The support column is the count of articles listing the same gene relation
Fig. 3Illustration of the three types of extracted relations found by GeneMANIA in the glaucoma corpus. The total number of extracted relations from the workflow were 257 and they were distributed into 76 known, 149 new, 11 disconnected, and 21 were unverifiable relations. Each type of relation is represented by a picture below it. A known relation is illustrated by three circles directly linked to each other, where a circle represents a gene. A new relation is illustrated by a dotted line between blue and black genes, because an indirect path could be established from the blue to the black gene through the red gene. An unverified relation is illustrated by a question mark in the black gene and a dotted line between the blue and black gene. A disconnected relation is illustrated by the disconnected black gene from the rest of the connected genes
Twenty one extracted relations with unverified links from GeneMANIA
| Gene1 | Gene2 | Confidence | Unverified node | PMC Excerpt | PMCID/Year | Remark |
|---|---|---|---|---|---|---|
|
|
| 0.93 |
| CDKN2B-AS1 has been shown to be involved in the regulation of CDKN2B, CDKN2A and ARF expression. | PMC4132588/2014 |
|
|
|
| 0.93 |
| CDKN2B-AS1 has been shown to be involved in the regulation of CDKN2B, CDKN2A and ARF expression. | PMC4132588/2014 | CDKN2B-AS1 is CDKN2B antisense. |
|
|
| 0.93 |
| CDKN2B-AS1 has been shown to be involved in the regulation of CDKN2B, CDKN2A and ARF expression. | PMC4132588/2014 | CDKN2B-AS1 is CDKN2B antisense. |
|
|
| 0.92 |
| CDKN2BAS also regulates the expression of CDKN2A, a gene previously shown to be down-regulated in other neurodegenerative disorders, including Alzheimer’s disease, suggesting that regulation of CDKN2A expression by CDKN2BAS could also contribute to degeneration of the optic nerve in glaucoma. | PMC3343074/2012 | CDKN2BAS is CDKN2B antisense. GeneMANIA does not recognize gene anti-sense |
|
|
| 0.90 |
| In mouse, human OSM activates the heterodimer of LIF receptor ß (LIFRß and gp130, like CNTF. | PMC4171539/2014 | LIFRB is a mouse gene that GeneMANIA did not recognize |
|
|
| 0.9 |
| Protein levels of VEGFA were also down-regulated with miR410 overexpression and up-regulated with miR-410 interference. | PMC400246/2014 | GeneMANIA does not recognize microRNAs. |
|
|
| 0.89 |
| The binding of STAT1 induces the expression of ANRIL, and represses CDKN2B in endothelial cells. | PMC3565320/2013 | GeneMANIA does not recognize locus |
|
|
| 0.83 |
| DKK1 and KCNJ2 which were shown to be affected by PITX2 siRNAs by real time PCR experiments were each previously reported in one study. | PMC2654047/2009 |
|
|
|
| 0.83 |
| DKK1 and KCNJ2 which were shown to be affected by PITX2 siRNAs by real time PCR experiments were each previously reported in one study. | PMC2654047/2009 |
|
|
|
| 0.82 |
| LTBP2 was predicted to be regulated by KLF4 (at 10 promoters), SP1 (at eight promoters), GATA4 and TEAD (at five promoters) and XCPE1 (at four promoters) was associated with LTBP2. | PMC4019825/2014 |
|
|
|
| 0.78 |
| To narrow down the potential candidate CNVs (genes) and match the identified CNVs to target regions and/or genes, we first focused on known chromosomal loci for PCG, namely GLC3A (2p2-p21), which harbors CYP1B1, GLC3B (1p36.2-p36.1), and GLC3C (14q23). | PMC3250374/2011 | GeneMANIA does not recognize gene locus |
|
|
| 0.78 |
| To narrow down the potential candidate CNVs (genes) and match the identified CNVs to target regions and/or genes, we first focused on known chromosomal loci for PCG, namely GLC3A (2p2-p21), which harbors CYP1B1, GLC3B (1p36.2-p36.1), and GLC3C (14q23). | PMC3250374/2011 | GeneMANIA does not recognize gene locus |
|
|
| 0.74 |
| Recently, it was found that E50K mutant strongly interacted with TBK1, which evoked intracellular insolubility of OPTN, leading to improper OPTN transition from the endoplasmic reticulum to the Golgi body. | PMC4077773/2014 | GeneMANIA recognizes |
|
|
| 0.74 |
| The 3′ deletion identified in family 86 contained ELP4 and DCD4, which are located downstream of PAX6. | PMC3044699/2011 |
|
|
|
| 0.60 |
| However, catalytically inactive CMT disease-related MTMR2 mutants lead to NEFL assembly defects and to pathologies similar to the one caused by NEFL mutations, suggesting that MTMR2 and NEFL may function in a common pathway in the development and maintenance of peripheral axons. | PMC3514635/2012 | GeneMANIA does not recognize |
|
|
| 0.50 |
| It has been suggested that inhibition of EPO production could be caused by the toxicity of prefibrillar aggregates of TTR V30M. | PMC4087117/2014 | GeneMANIA recognizes |
|
|
|
|
| Further characterization of BDNF-AS indicates that BDNF-AS recruits EZH2 and the PRC2 complex to the BDNF promoter to repress BDNF transcription through H3K27me3 histone modifications. | PMC4047558/2014 |
|
|
|
|
|
| Further characterization of BDNF-AS indicates that BDNF-AS recruits EZH2 and the PRC2 complex to the BDNF promoter to repress BDNF transcription through H3K27me3 histone modifications. | PMC4047558/2014 |
|
|
|
|
|
| Further characterization of BDNF-AS indicates that BDNF-AS recruits EZH2 and the PRC2 complex to the BDNF promoter to repress BDNF transcription through H3K27me3 histone modifications. | PMC4047558/2014 |
|
|
|
|
|
| It would be interesting to investigate whether the application of an inhibitor to CSTA, such as its siRNA, could restore the normal MYOC processing and affect the outcome of the disease. | PMC3352898/2012 |
|
|
|
|
|
| More, recently, Minegishi and coworkers reported that the over-expression of a glaucoma causing-mutation in OPTN, Glu50Lys, produces an accumulation of insoluble OPTN protein that can be blocked with chemical inhibition of TBK1 activity in HEK293 cells. | PMC4038935/2014 |
|
The genes in each extracted relation are listed under the “Gene1” and the “Gene2” columns respectively. A measure of confidence, reported by ReVerb, is listed under the “Confidence” column, and relations with low confidence (<0.5) are bolded. The unverified node is listed under the “Unverified node” column. The associated text that relates the two genes is listed under the “PMC Excerpt” column. Some genes were identified by their synonyms found in either GeneCards or GeneMANIA. The PMCID of the original article coupled with the year of publication is given under”PMCID/Year” column. Important remarks and gene synonyms may be listed under the “Remark” column
Eleven extracted relations with disconnected gene nodes from GeneMANIA
| Gene1 | Gene2 | Confidence | Disconnected node | PMC Excerpt | PMCID/Year | Remark |
|---|---|---|---|---|---|---|
|
|
| 0.96 |
| ELP4 and DCDC1 are located downstream of PAX6. | PMC2375324/2008 | |
|
|
| 0.93 |
| ALB was used to normalize ELP4 and PAX6 values for the detection of the relative copy number of the deletion region. | PMC3859656/2013 | |
|
|
| 0.88 |
| We found 10 candidate POAG genes that were highly expressed in both the CPE and NPE (AKAP13, C1QBP, CHSY1, COL8A2, CYP1B1, FBN1, IBTK, MFN2, TMCO1, and TMEM248), three genes that were expressed significantly higher in the CPE (CDH1, CDKN2B, and SIX1), and six genes that were expressed significantly higher in the NPE (ATOH7, CYP1B1, FBN1, MYOC, PAX6, and SIX6). | PMC3909915/2014 | |
|
|
| 0.88 |
| We found 10 candidate POAG genes that were highly expressed in both the CPE and NPE (AKAP13, C1QBP, CHSY1, COL8A2, CYP1B1, FBN1, IBTK, MFN2, TMCO1, and TMEM248), three genes that were expressed significantly higher in the CPE (CDH1, CDKN2B, and SIX1), and six genes that were expressed significantly higher in the NPE (ATOH7, CYP1B1, FBN1, MYOC, PAX6, and SIX6). | PMC3909915/2014 | |
|
|
| 0.85 |
| For example, GSK3B has a direct connection with IL4 and a secondary connection with MTHFR. | PMC2653647/2009 | |
|
|
| 0.85 |
| Each bar represents the relative expression of VSX1 normalized to GAPDH in a different tissue/age; mean ± SD (Sc: sclera, Co: cornea, Ir: iris, CB: ciliary body, Len: lens, Cho: | PMC2267740/2008 | |
|
|
| 0.80 |
| the HMGB1 inhibitor GA attenuated diabetes-induced upregulation of HMGB1 and downregulation of BDNF | PMC3671668/2013 |
|
|
|
| 0.78 |
| Thus the SHH and GDF11 regulate ATOH7, which in turn regulates Brn3b. | PMC2883590/2010 | |
|
|
|
|
| Recent immunohistological studies in NPS patients with severe glomerular disease suggest a possible regulation of type III collagen by LMX1B, while the homozygous | PMC2669506/2007 |
|
|
|
|
|
| Research has demonstrated that retinal neurons and RGCs are mainly comprised of anteriorized NPS that express PAX6 and OTX2. | PMC3747054/2013 | |
|
|
|
|
| Research has demonstrated that retinal neurons and RGCs are mainly comprised of anteriorized NPS that express PAX6 and OTX2 | PMC3747054/2013 |
The genes in each extracted relation are listed under the “Gene1” and the “Gene2” columns, respectively. A measure of confidence, reported by ReVerb, is listed under the “Confidence” column and relations with low confidence (<0.5) are bolded. The disconnected node in the relation is listed under the “Disconnected node” column. The associated text that relates the two genes is listed under the “PMC Excerpt” column. Some genes were identified by their synonyms found in either GeneCards or GeneMANIA. The PMCID of the original article, coupled with the year of publication, is given under ”PMCID/Year” column. Important remarks and gene synonyms may be listed under the “Remark” column
Percentages of extracted relations
| Finding Type | Description | Percentage |
|---|---|---|
| Known | Verified | 76/257 ~ 30 % |
| New | Can be verified via one or more indirect paths from the known network | 149/257 ~ 58 % |
| Disconnected | Potential discovery that can be verified by lab experiment in the future | 11/257 ~ 4 % |
| Unverified | Gene symbols could not be found in GeneMANIA, HUGO or GeneCards | 21/257 ~ 8 % |
The Total number of unique and valid relations is 257, which are classified into known, new, disconnected, and unverified relations, respectively. Description and percentage of each class is given under the “Description” and “Percentage” columns
Fig. 4The top 50 gene pair occurrences in our filtered glaucoma corpus. The occurrence frequency of a pair is calculated as the number of articles that has listed this pair in its content. Multiple occurrences of a pair per article is considered one occurrence
Fig. 5Biological processes associated with extracted non benchmark genes. A pie chart, generated with the aid of PANTHER, with a listing of biological processes associated with 150 extracted non benchmark genes
Functional analysis of the 150 extracted non-benchmark genes
| Biological Process | Gene Count | Corrected |
|---|---|---|
| Regulation of apoptosis** | 25 | 2.73E-06 |
| Inflammatory Response* | 12 | 0.002 |
| Immune Response** | 17 | 0.004 |
| Regulation of response to stimulus** | 9 | 0.01 |
| Defense Response* | 15 | 0.01 |
Biological processes, reported by DAVID, are suffixed by * and are associated with their genes count and corrected p-value. Biological processes, that are common to both PANTHER and DAVID are suffixed by ** and are associated with their gene count and corrected p-values, obtained from DAVID
Fig. 6Pathways associated with extracted non benchmark genes. Common pathways reported with the aid of PANTHER and GeneCodis for the 150 extracted non- benchmark genes
Pathway analysis of the 150 extracted NBG
| Pathway name | Count of genes in pathway | FDR | % of genes in pathway | Supporting References |
|---|---|---|---|---|
| Gonadotropin releasing hormone receptor pathwayb | 9 | 8.1 | [ | |
| Interleukin signaling pathwayb | 6 | 5.4 | [ | |
| Wnt signalling pathwaya | 5 | 0.006 | 4.2 | [ |
| Jak-STAT signaling pathwaya | 5 | 0.001 | 1.8 | [ |
| PDGF signaling pathwayb | 5 | 4.5 | [ | |
| TGF-beta signaling pathwaya | 4 | 0.01 | 3.6 | [ |
| Apoptosis signaling pathway b | 2 | 1.8 | [ |
Common pathways, reported by both GeneCodis and PANTHER, are suffixed by a and the associated false discovery rate (FDR) from GeneCodis is reported. Pathways, reported by PANTHER, are suffixed by b. The percentage of total genes in the pathway is reported along supporting references that link glaucoma to the pathway
Fig. 8Glaucoma network path distribution by the Cytoscape network analyser
Fig. 9Glaucoma network node degree distribution. The glaucoma node degree distribution, generated by the Cytoscape network analyser, follows a power law fitted to the form y = 137.67x − 1.99
Genes (nodes) with the top 10° in the extracted glaucoma interaction network
| Gene(node) | Degree |
|---|---|
|
| 17 |
|
| 14 |
|
| 13 |
|
| 11 |
|
| 10 |
|
| 9 |
|
| 9 |
|
| 9 |
|
| 9 |
|
| 9 |
The degree column represents the total number of a node’s ingoing and outgoing links. Note that CYP1B1 heads the list with a total of 17 links
Distribution of articles in the text retrieval step, depending on their accessibility and relevance
| Relevant | Not Relevant | Total | |
|---|---|---|---|
| Retrieved open access articles | 7425 | 1235 | 8660 |
| Restricted access (not Retrieved) | 22733 | unknown | __ |
| Total | 31393 | __ | __ |
Relevant articles are those that contain at least one occurrence of the word “glaucoma” in their text. The portion of restricted access articles, are not relevant, is unknown to us at the time of writing this article
Evaluation metrics for the retrieval step
| Metric | Value |
|---|---|
| Precision | 7425/8660 = 85 % |
| Recall | 7425/31393 = 23 % |
| F1 | 36 % |
Evaluation metrics are computed based on Table 9. Note that recall is limited by the number of open access articles at this time
Performance measures of the used LingPipe NER tagger
| Tagger | Entity Type | Recall (%) | Precision (%) | F-score (%) |
|---|---|---|---|---|
| GENIA | Protein | 81.41 | 65.82 | 72.79 |
| DNA | 66.76 | 65.64 | 66.2 | |
| RNA | 68.64 | 60.45 | 64.29 | |
| Cell Line | 59.6 | 56.12 | 57.81 | |
| Cell Type | 70.54 | 78.51 | 74.31 | |
| Overall | 75.78 | 67.45 | 71.37 | |
| GENTAG | Gene/Protein | 79 | 88 | 70.8 |
Reported measures for the GENIA tagger is based on the GENIA performance web site [62] while performance measures of the GENTAG tagger is the average of the measures reported in [63, 64]
Fig. 10A smaller glaucoma interconnected subnetwork resulting from applying the modularity algorithm in Gephi on the original glaucoma network. The glaucoma network in Fig. 7 was subjected to the Gephi modularity clustering algorithm to identify communities and classes within the network. Five distinct classes colored in green, purple, red, yellow, and blue respectively, can be seen
Clusters extracted from the giant components in the glaucoma network and their associated profiles
| Cluster | # Nodes | BG | NBG | Node with highest degree | Known relations | New relations | Unverified relations | Disconnected relations |
|---|---|---|---|---|---|---|---|---|
| Green | 36 | 6 | 11 |
| 10 | 14 | 8 | 4 |
| Purple | 23 | 1 | 9 |
| 7 | 15 | 0 | 1 |
| Red | 15 | 0 | 5 |
| 2 | 12 | 0 | 1 |
| Yellow | 13 | 2 | 5 |
| 2 | 11 | 0 | 0 |
| Blue | 9 | 2 | 7 |
| 4 | 5 | 0 | 0 |
The giant components in the glaucoma network depicted in Fig. 7 are clustered into five clusters. Clusters are ordered in descending order of the number of nodes in each cluster. Cluster properties include number of BG, NBG, highest degree, and the number of different types of relations contained within the cluster
Fig. 11Modularity community classes and associated node sizes. The modularity classes are listed on the X axis while the number of nodes is on the Y axis. The highest number of nodes is 36 in modularity class 3, while the least is 9 in modularity cluster 0. The value of modularity, before and after applying the resolution, is listed on the top left of the figure. A resolution value of 9.0, was used in association with the modularity algorithm to obtain dense, well separated classes