Literature DB >> 21685059

Discovering and visualizing indirect associations between biomedical concepts.

Yoshimasa Tsuruoka¹, Makoto Miwa, Kaisei Hamamoto, Jun'ichi Tsujii, Sophia Ananiadou.

Abstract

MOTIVATION: Discovering useful associations between biomedical concepts has been one of the main goals in biomedical text-mining, and understanding their biomedical contexts is crucial in the discovery process. Hence, we need a text-mining system that helps users explore various types of (possibly hidden) associations in an easy and comprehensible manner.
RESULTS: This article describes FACTA+, a real-time text-mining system for finding and visualizing indirect associations between biomedical concepts from MEDLINE abstracts. The system can be used as a text search engine like PubMed with additional features to help users discover and visualize indirect associations between important biomedical concepts such as genes, diseases and chemical compounds. FACTA+ inherits all functionality from its predecessor, FACTA, and extends it by incorporating three new features: (i) detecting biomolecular events in text using a machine learning model, (ii) discovering hidden associations using co-occurrence statistics between concepts, and (iii) visualizing associations to improve the interpretability of the output. To the best of our knowledge, FACTA+ is the first real-time web application that offers the functionality of finding concepts involving biomolecular events and visualizing indirect associations of concepts with both their categories and importance. AVAILABILITY: FACTA+ is available as a web application at http://refine1-nactem.mc.man.ac.uk/facta/, and its visualizer is available at http://refine1-nactem.mc.man.ac.uk/facta-visualizer/. CONTACT: tsuruoka@jaist.ac.jp.

Entities: CellLine Chemical Disease Gene Species

Mesh：

Year: 2011 PMID： 21685059 PMCID： PMC3117364 DOI： 10.1093/bioinformatics/btr214

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 INTRODUCTION

Text search engines such as PubMed are crucial in everyday research activities in biomedical sciences as a significant fraction of biomedical knowledge is still accessible only in textual form. We have previously developed FACTA, a text-mining system designed to help researchers find direct associations between biomedical concepts in an interactive fashion (Tsuruoka ). It is capable of producing ranked lists of important biomedical concepts, e.g. genes, diseases and chemical compounds, which are considered relevant to the query according to their co-occurrence statistics. The system has also been used as a search engine (Kemper ) to link biomolecular pathways to textual evidence. This article describes three new classes of functionality that are introduced to extend and improve FACTA. The first extension is the use of biomolecular events as semantic metadata used for search. Semantic metadata derived from text to index digital documents for retrieval purposes have been used in systems like SUISEKI (Blaschke and Valencia, 2002), iHOP (Hoffmann and Valencia, 2005), Chilibot (Chen and Sharp, 2004), GoWeb (Dietze and Schroeder, 2009), Hanalyzer (Leach ), Semantic MEDLINE (Kilicoglu ), MEDIE (Miyao ), EBIMed (Rebholz-Schuhmann ) and KLEIO (Nobata ). Automatic extraction of events has a broad range of applications in biology (Ananiadou ), ranging from support for the creation and annotation of pathways (Kemper ) to automatic population or enrichment of databases. This novel event extension allows users to specify a concept which involves a biomolecular event. The second extension is to help users discover indirectly associated concepts. Discovering hidden, previously unknown and potentially useful associations between biomedical concepts such as diseases and chemical compounds from the literature is a long-standing goal in biomedical text-mining (Swanson and Smalheiser, 1997). The pioneering work of Swanson Swanson (1986) hypothesized the role of fish oil in clinical treatment of Raynaud's disease, combining different pieces of information from the literature, and the hypothesis was later confirmed with experimental evidence. Among various approaches to the automatic generation of such hypotheses (Frijters ; Weeber ; 2005; Wren ), we adopt a simple approach using two-step associations. More specifically, we attempt to discover indirect associations by combining two known associations which are obtained from direct co-occurrence statistics. In this work, we give a probabilistic interpretation to the strengths of the discovered indirect associations, which allows the system to rank them in the order of their expected amount of information. The third extension is visualization. The output format of FACTA was a tabular format—the associated concepts found by the system are categorized, ranked and presented in multiple columns. Although the tabular format may be useful enough in most cases, visualizing the output can help the user grasp the results more intuitively. The visualization component we have introduced in FACTA+ uses a technique called treemapping (Shneiderman, 2009), which enables the user to easily understand the relative importance of many different concepts. This article is organized as follows. Section 2 describes our machine learning approach to detecting biomolecular events in text. Section 3 presents our algorithm for discovering indirect associations by using co-occurrence statistics. Section 4 describes the functionality of visualizing associations detected by the text-mining components. Concluding remarks are given in Section 5.

2 RECOGNIZING BIOMOLECULAR EVENTS

The first extension we have introduced in FACTA is the ability to detect biomolecular events mentioned in MEDLINE articles, thereby allowing the user to issue queries including such events. For example, FACTA+ allows the user to specify the documents that contain the word ‘ERK2’ and also mention positive regulation events, by using the query ‘ERK2 GENIA:Positive_regulation’. This extension is motivated by the fact that biomolecular events have recently received considerable attention as an important type of information in biomedical text-mining (Ananiadou ; Bjorne ; Miwa ). In this work, our definition of biomolecular events follows that of the GENIA event corpus (Kim ), in which events are basically characterized by verbs or nominalized verbs. For example, the sentence ‘In Escherichia Coli, glnAP2 may be activated by NifA.’ contains one event specified by the verb ‘activated’, with its arguments being ‘In Escherichia Coli’, ‘glnAP2’ and ‘NifA’. In the GENIA event definition, every event is represented with a trigger and their arguments. Table 1 shows some examples of the events in the corpus with the trigger words italicized. For example, ‘express’ is the trigger word for the gene expression event in the first row in the table.

Table 1.

Examples of event-describing phrases

Event type	Phrase
Gene expression	Although resting Jurkat cells expressFas, …
Positive regulation	Fas mRNA was induced approximately 10-fold in …
Binding	… responses induced by CD40engagement.
Phosphorylation	Differential expression and phosphorylation of CTCF, a c-myc …
Regulation	Transcriptional regulation of the alpha2 integrin gene in …
Negative regulation	…, a specific inhibitor of MAPK kinase 1, …

The terms in bold are protein names, and the italicized words are event triggers.

Examples of event-describing phrases The terms in bold are protein names, and the italicized words are event triggers. Recognizing the complete information on events involves the task of detecting triggers and their arguments, and there is a large body of previous work tackling this problem (Airola ; Divoli and Attwood, 2005; Huang ; Miwa ; Miyao ). However, we are not concerned with the task of detecting arguments in this article, since FACTA+ only uses information on abstract-level occurrences of concepts. Since every event is represented with a trigger, what we need for event recognition is a component that can accurately detect triggers in text. Perhaps the most straight-forward approach to detecting trigger words in text is to use a dictionary, but pure dictionary-matching is not suitable for event recognition, since trigger words are often very ambiguous. For example, as seen in Table 1, the word ‘express’ is used as a trigger word for the gene expression event, but the word ‘express’ is a very common verb and used in many different meanings. Therefore, including the word ‘express’ in the dictionary would produce many false positive matchings. We use a machine learning-based approach to sidestep this ambiguity problem, and use the GENIA event corpus (Kim ) as training data. More specifically, we used the data released for the BioNLP'09 shared task (Kim ) for training and testing our machine learning models. This shared task data is derived from the GENIA event corpus and contains annotations on nine event types concerning protein biology, which are a subset of the biomolecular events defined in the GENIA event ontology. The machine learning models trained on the shard task data are used to recognize event triggers in text and their event types, and FACTA+ simply regards the detection of a trigger as an occurrence of the corresponding event in the abstract. Although this simple approach has a risk of producing false positives—because we disregard some semantically important types of information such as modality and negation (Garten ; Krallinger, 2010; Nawaz ), we leave it for future work.

2.1 Related work

The most straight-forward approach to detecting trigger words is to use a dictionary. Buyko ) created a dictionary by manually curating and extending a lexicon derived from the original GENIA corpus with the help of researchers with a background in biology. A disambiguation step is performed by considering the co-occurrence statistics between each trigger word with event types in a training corpus. This disambiguation is used for some dictionary-based approaches [e.g. Kilicoglu and Bergler (2009), MacKinlay )]. Vlachos ) extracted frequent triggers using a one-sense-per-term assumption, and performed soft matching (using lemmas and stems) to alleviate the problem of potential variability of trigger words. Vlachos (2010) extended the extracted dictionary by incorporating ‘light’ and ‘ultra-light’ triggers, which represent the discriminative modifiers of triggers. Kaljurand ) extracted the dictionary from a training corpus, and disambiguated the trigger words by considering two kinds of co-occurrence statistics: one between each token and token considered to be a trigger and one between an event structure (event type and argument combination) and the trigger. Kilicoglu and Bergler (2009) manually cleaned the dictionary by removing ambiguous triggers, and also added variations of prefixes and nominal forms of verbs to the dictionary. Van Landeghem ) built two separated manually cleaned dictionaries for unary events and other events. Cohen ) selected triggers by iteratively testing manually constructed patterns. Another popular approach is to use machine learning. Björne ) and Miwa ) used a multi-class support vector machines (SVMs) to detect and disambiguate trigger words. Morante ) detected and disambiguated trigger words using IB1 memory-based classifier. MacKinlay ) combined the outputs from a dictionary-based look-up tagger and a conditional random field (CRF)-based tagger. Some other approaches detected events without a specialized module for trigger detection. Riedel ) and Poon and Vanderwende (2010) detected events using Markov logic networks (MLNs). Neves ) used the case-based reasoning, which finds ‘case-solution’ of a token including event, trigger, and argument types by retrieving the most similar, frequent case in the training data. Hakenberg ) extracted shortest link paths on parse tree in events as queries, and also created regular expression-based patterns for regulation events. They grouped similar terms together manually, and applied both queries and patterns to the development and test datasets to detect triggers and arguments.

2.2 Detecting trigger words

To detect trigger words, we use a CRF model (Lafferty ). CRF models are log-linear probabilistic models for predicting sequences, which are widely used in biomedical text-mining as the machine learning model for named entity recognition (Okanohara ; Settles, 2004). The task of detecting trigger words can be performed with a CRF model by converting the task to a sequence prediction problem, in which the trigger sequences are represented with the ‘IOB2’ representation (Sang and Veenstra, 1999). In this representation, the beginning word of a trigger is given the ‘B’ tag. The following words are given the ‘I’ tag. The other words in the sentence are given the ‘O’ tag. The task of the CRF model is then to predict an ‘IOB’ sequence for a given sentence. In this work, the ‘IOB’ tags are combined with the nine different types of biomolecular events defined in the BioNLP'09 shared task data (Available at http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/SharedTask/).

2.3 Joint learning

In this work, we propose to use a model that performs the joint learning to recognize event triggers and protein names simultaneously. The motivation for our joint learning approach is that the presence of a protein name often indicates the presence of a trigger word in its vicinity. It should be noted that, unlike the shared task setting, we cannot use the information from gold-standard annotations for protein names, because we need to process the whole MEDLINE corpus for FACTA+. The joint CRF model uses three additional tags: ‘B-Protein’, ‘I-Protein’ and ‘Filler’. Table 2 shows an example of an IOB tag sequence for the sentence ‘CD44 activated the transcription factor AP-1’. Note that the trigger word ‘activated’ is followed by a protein name but there is a gap between them. The tags assigned to ‘CD44’, ‘AP’, ‘-’ and ‘1’ are the ones added to recognize protein names. The ‘Filler’ tags are introduced to represent the regions that reside between the protein names and trigger words belonging to the same event. The filler tags enable the CRF model to propagate information from the existence of trigger words to non-adjacent protein names. In other words, the fact that a trigger word is followed by a protein name is captured by two transition features: (i) transition from ‘B-Positive_regulation’ to ‘Filler’ and (ii) transition from ‘Filler’ to ‘B_Protein’.

Table 2.

Joint learning of event triggers and protein names

Word	Tag
CD44	B-Protein
activated	B-Positive_regulation
the	Filler
transcription	Filler
factor	Filler
AP	B-Protein
-	I-Protein
1	I-Protein
.	O

Joint learning of event triggers and protein names

2.4 Experiments

We present experimental results to evaluate the performance of our joint learning approach. We compare our joint learning approach against two baseline approaches (models). The three CRF models used in the experiments are as follows. Triggers Only A model limited to recognize only trigger words. The training data contains only the annotations on trigger words. Since there are nine different types of events in the data, this model considers 19(=2×9+1) different possible tags for each word. Joint A model to recognize protein names and trigger words jointly. However, the training data for this model does not include the Filler tag. This model considers 21(=19+2) different possible tags for each word. Joint + Filler A model to recognize protein names and trigger words jointly. The training data also include the Filler tag as described in the previous subsection. This model considers 22(=21+1) different possible tags for each word. We trained these CRF models using the training data (consisting of 800 MEDLINE abstracts) in the BioNLP'09 shared task corpus, and evaluated them using its development data (consisting of 150 abstracts). The corpus was preprocessed with simple rule-based scripts to perform sentence segmentation and tokenization. The feature templates used in our CRF models are shown in Table 3. The features include word n-grams, substrings and the shape of the current word and tag transitions.

Table 3.

Feature templates used in the CRF tagger

Word unigram	w_i−5, w_i−4, w_i−3, w_i−2, w_i−1, w_iw_i+1, w_i+2, w_i+3, w_i+4, w_i+5	& y_i
Word bigram	w_i−1w_i, w_iw_i+1	& y_i
Word trigram	w_i−1w_iw_i+1	& y_i
Substrings	substrings of w_i	& y_i
	(up to length 10)
Word shape	S(w_i)	& y_i
Tag bigram	True	& y_i−1y_i

w is the current word. y is the current tag. Word shape S(w) is produced by converting capital letters into ‘A’, small letters into ‘a’ and numerals into ‘#’.

Feature templates used in the CRF tagger w is the current word. y is the current tag. Word shape S(w) is produced by converting capital letters into ‘A’, small letters into ‘a’ and numerals into ‘#’. Table 4 shows the results. The first nine rows in the table correspond to the nine types of biomolecular events defined in the corpus, and the bottom row shows the micro-average of the scores. Our proposed approach (i.e. the ‘Joint + Filler’ model) significantly outperforms the ‘Triggers Only’ model. This shows that the contextual information from the protein names is useful in detecting trigger words. It should also be noted that the performance of the Joint model without the filler tag is worse than the ‘Joint + Filler’ model, suggesting that it is important to explicitly transfer the information on the neighbouring tags in a CRF model.

Table 4.

Accuracy of trigger detection

	Triggers Only			Joint			Joint + Filler
	Precision	Recall	F-score	Precision	Recall	F-score	Precision	Recall	F-score
Gene expression	70.9	60.8	65.5	74.9	57.4	65.0	77.9	66.4	71.7
Transcription	66.7	39.4	49.5	62.5	37.9	47.2	67.5	40.9	50.9
Protein catabolism	93.8	79.0	85.7	93.8	79.0	85.7	93.8	79.0	85.7
Localization	86.4	47.5	61.3	82.8	60.0	69.6	85.2	57.5	68.7
Binding	64.0	26.7	37.6	67.5	31.1	42.6	72.8	32.8	45.2
Phosphorylation	68.6	63.2	65.8	75.8	65.8	70.4	76.7	60.5	67.7
Regulation	57.8	19.0	28.6	54.5	21.9	31.2	50.0	13.9	21.7
Positive regulation	64.5	33.6	44.1	62.0	33.8	43.7	65.2	35.4	45.8
Negative regulation	61.3	30.7	40.9	58.2	30.7	40.2	61.8	28.0	38.5

Micro Average	67.2	38.4	48.9	67.1	39.1	49.4	70.5	40.4	51.4

Accuracy of trigger detection Our approach consistently improved the performance for detecting event triggers, but the performance of detecting binding and regulation events was not very high. This is because these events can take multiple arguments, and also because regulation events can take other events as arguments. Rich linguistic information is required to deal with such event structures, and such triggers are not our current focus. Note that the performance figures presented in this table are not comparable to those reported for the BioNLP shared task, since we did not use the gold standard information on the gene/protein names due to our purpose to evaluate the accuracy of trigger detection in a real-world setting where no gold standard annotation for gene/protein names are available. The machine learning model described above (i.e. ‘Joint + Filler’) was applied to the whole MEDLINE corpus containing 20 033 079 articles, and the recognized events are indexed by FACTA+ so that it can accept queries including biomolecular events. The number of articles indexed for each event type ranged from 53 262 (Protein catabolism) to 1 537 441 (Regulation). We have also carried out a small-scale experiment to assess the quality of this indexing for the whole MEDLINE. We asked a bioNLP researcher with biology background to check the 10 latest articles returned by FACTA+ for each event class to see whether they are really relevant to the target class. In other words, the abstract-level precision was manually evaluated for each event class. The result was that 86 out of the 90 abstracts were actually relevant to the target event class. Although the recall of this event indexing is not completely clear, the precision is probably good enough to be used in practice.

3 DISCOVERING INDIRECT ASSOCIATIONS

A common approach to automatic discovery of useful hypotheses is to combine two (or more) known associations, i.e. if concept X is associated with concept Y, and concept Y is associated with concept Z, then the potential association between X and Z is considered as a useful hypothesis unless there is already a known association between X and Z. This approach is often called Swanson's ABC model approach after his seminal work on literature-based hypothesis generation (Swanson, 1990). Figure 1 illustrates this approach in the context of implementing it on FACTA+, where the user provides a starting concept as a query to the system. We call the concepts that are directly associated with the query the pivot concepts, and the concepts that are indirectly associated with the query through those pivot concepts the target concepts.

Fig. 1.

Finding indirectly associated concepts.

3.1 Related work

There are a number of publicly available software tools that offer the functionality for discovering indirect associations. Anni 2.0 (Jelier ) is a Java client–server application which provides an ontology-based interface to MEDLINE. It can find concepts that have many intermediate concepts in common, thereby allowing the user to discover concepts that do not directly co-occur with the starting concept. Unlike FACTA+, the starting concept is defined as a combination of predefined concepts provided by the system, i.e. free keywords cannot be used to define a concept. BITOLA (Hristovski ) is a web application which allows the user to retrieve target concepts using MeSH terms as pivot concepts. It can also incorporate biomedical knowledge (e.g. chromosomal location) to improve the precision of the output. BITOLA requires the user to specify each pivot concept manually. In other words, the pivot concepts that are not selected by the user are not used for computing the association strengths between the query concept and the target concepts, whereas FACTA+ retrieves target concepts by considering all potential pivot concepts of a specified class. Arrowsmith (Smalheiser ) is perhaps a more well-known tool for literature-based hypotheses generation, but it is designed to find concepts or terms that interlink two distinct concepts—it is more similar, in our terminology, to finding pivot concepts given a query and a target concept. CoPub Discovery (Frijters ) is a web application based on a co-occurrence database in CoPub (Frijters ). It allows the user to discover not only target concepts by considering all potential pivot concepts of a specified class, also but pivot concepts given a query and a target concept. It employed two concept classes Gene and BioConcept for pivot concepts, and used regular expressions to search for keywords in MEDLINE and linking a query to the starting concept, while FACTA+ employs six concept classes, event-based detection and flexible keyword matching. CoPub Discovery shows the ranking of the target concepts, whereas FACTA+ incorporates a visualization for the indirect associations. As for the scoring scheme for ranking indirect associations, Yetisgen-Yildiz and Pratt (2009) describe a comprehensive overview of existing approaches and compare the performance of four different criteria (Association Rules, TF-IDF, Z-score and Mutual Information Measure) using a cut-off date technique.

3.2 Interface

Like many text search engines, FACTA+ accepts arbitrary keywords, concept identifies or their boolean combination as a query to specify the starting concept (i.e. concept X in Fig. 1). The system first retrieves pivot concepts using co-occurrence statistics from the literature, and then produces target concepts that are scored and ranked in accordance to their association strengths with the query and pivot concepts. One of the distinct features of our system is that it achieves real-time responses in most cases while allowing the user to use a very flexible query as the input to the system. Currently, FACTA+ accepts the following six classes of biomedical concepts as pivot and target concepts: human genes/proteins, diseases, symptoms, drugs, enzymes and chemical compounds. The user can choose one of these classes as pivot and target concepts when performing a search. As an example of indirect associations, the search result for the input query ‘E-cadherin and GENIA:Negative_regulation’ returned by FACTA+ is shown in Figure 2. The first row in the table shows that the query concept is indirectly associated with a disease ‘acute respiratory failure’ through multiple genes including tumour suppressor candidate 3. The visualized version of this search result is shown in Figure 4.

Fig. 2.

A screen-shot of FACTA+ search results for indirect associations. The links and icons in the table give the user a quick access to the textual evidence (snippets) of the associations.

Fig. 4.

Visualization of indirectly associated concepts using treemapping and links.

A screen-shot of FACTA+ search results for indirect associations. The links and icons in the table give the user a quick access to the textual evidence (snippets) of the associations. E-cadherin is a cell adhesion molecule involved with the binding between a cell and other cells or extracellular matrix. The search results shown in Figures 2 and 4 indicate that E-cadherin is associated with multiple nervous system disorders (e.g. Alzheimer's disease, Parkinson's disease, epilepsy) via several proteins/genes, even though E-cadherin itself rarely appears with such disorders (see also Fig. 3 for direct associations). This indirect search result suggests that E-cadherin could be a potential candidate of drug target for nervous system disorders.

Fig. 3.

Visualization of directly associated concepts using treemapping.

Visualization of directly associated concepts using treemapping. Visualization of indirectly associated concepts using treemapping and links.

3.3 Ranking

Since the number of indirectly associated concepts can be vast, it is important to rank them properly when they are presented to the user. On the one hand, the ranking criterion should favour ‘hidden’ associations, because the associations that can be easily observed in the existing literature are not interesting to the users who are seeking for novel associations. In fact, such ‘known’ associations can be browsed by using the existing functionality of FACTA. On the other hand, the indirect associations output by the system need to be plausible. To incorporate these two factors, FACTA+ defines a ranking score for each target concept Z as follows: where D(X,Z) is the strength of direct association between concept X and Z, and R(X,Z) is the reliability of the indirect association between the two. Notice that this score can translate into the expected amount of information (in the information theoretic sense) if R(X,Z) and D(X,Z) are given as probability values. The term −logD(X,Z) takes on a large value if the strength of association between X and Z is weak. In other words, this term represents how hidden or surprising the association is. If we assume that D(·,·) are given as probabilities, and that all associations connecting X with Z are independent, the reliability term can be computed as follows: since the probability that the connection between X and Z is true is given by the probability that there is at least one true path connecting X with Z. Now the remaining question is how D(·,·) are computed as probabilities. In FACTA+, we approximate them using conditional probabilities: where P(V|W) is the (smoothed) conditional probability of the occurrence of concept V given that of concept W within the same document. P(W|V) is defined likewise. We take the maximum of the conditional probabilities of both directions to avoid missing any association detectable from direct co-occurrence statistics.

3.4 Indexing

To achieve real-time responses, we pre-compute the association statistics between all predefined concepts (i.e. between all possible Ys and Zs) and store them on memory—this may sound prohibitively expensive but is possible because we need to store only the information about the pairs of concepts that actually co-occur in at least one document. The number of pairwise associations indexed for achieving real-time discovery of indirect associations was 49 620 438. The indexing for the predefined concepts was performed by dictionary matching. More specifically, we employed the longest matching method using six different dictionaries built for the aforementioned six classes of concepts. The number of unique concepts indexed for the whole MEDLINE was 107 060. One of the major difficulties in the indexing process was the problem of semantic ambiguity of the terms. We used a simple rule-based method to alleviate the problem of acronym ambiguity, but the ambiguity problems with terms like protein family names are difficult to solve and left for future work.

4 VISUALIZATION

The ranked list of found concepts is often used for showing the results and is used in FACTA+ and other services like Arrowsmith, and it is useful for displaying the details of the extracted results, but lists are not well suited to grasp the big picture since the ranking scores are not intuitive to use. We, therefore, developed a visualization component to make the found concepts easy to understand. The primary technique used in this component is treemapping (Shneiderman, 2009), a method for visualizing hierarchical data by using a set of rectangles. Figures 3 and 4 show the visualization of directly and indirectly associated concepts for the query ‘E-cadherin and GENIA:Negative_regulation’. In these visualizations, each rectangle represents an individual concept, and its area is proportional to the score given to the concept. The component is built using Adobe Flash because it enables rich graphical expressions suitable for visualization like gradient fill and smooth animation.

4.1 Related work

One of the most well-known sites which use treemapping is Newsmap (http://newsmap.jp/). Newsmap visualizes popular Google News stories. Stories are represented as rectangles and they are categorized into countries or topics. Users can browse the full text of the story by clicking a rectangle.

4.2 Visualizing direct and indirect associations

FACTA+ has two visualization components: one for presenting directly associated concepts (e.g. Fig. 3) and the other indirectly associated concepts (e.g. Fig. 4). The functionality and usage of these visualization components are explained in this subsection. The directly associated concepts to a user query are visualized as rectangles. The concepts are grouped into six categories (human genes/proteins, diseases, symptoms, drugs, enzymes and chemical compounds) and each category forms a parent rectangle. The number of concepts shown is limited at first, but more results become visible by using a pager control. Users can easily grasp the relative importance of each concept using the size of each rectangle. The rectangles are arranged to maintain the similar aspect ratios to make the rectangles visually recognizable. Users can also focus on a particular set of categories by using check boxes. The layout of rectangles is recalculated instantaneously. This is done using a smooth animation so that the change is visually traceable. The method for visualizing indirectly associated concepts is slightly different to the one for directly associated concepts. Pivot concepts co-occurred with the query are shown on the left-hand side, and target concepts co-occurred with the pivot concepts are shown on the right-hand side. In addition to the treemapping, we introduced a ‘link’ visualization between pivot concepts and target concepts. Links represent co-occurrences between pivot and target concepts, and the width of each link indicates the strength of its association. When users point the mouse cursor on a particular pivot concept, links from the concept to its corresponding target concepts appear. Similarly, when users point a target concept, links from the concept to its corresponding pivot concepts appear. In both directly and indirectly associated concepts views, users can browse underlying documents by clicking a rectangle or a link and select ‘view documents’.

5 CONCLUSION

We have presented three extensions which have recently been introduced in FACTA+, a text search engine for MEDLINE. First, we have proposed a joint learning approach to detecting biomolecular events described in text. The performance of detecting trigger words has been significantly improved by performing the task jointly with protein name recognition. Second, we have presented a method for detecting indirect associations. The associations are ranked by the level of novelty and reliability, which are estimated by combining the strengths of multiple known associations that are directly observable from co-occurrence statistics in the literature. Third, we have implemented a novel visualization component which provides an intuitive overview of the discovered concepts and their associations. Each concept is represented with a coloured rectangle; the colour shows the category and the area indicates the importance. Each association is displayed with a link whose width indicates the importance. The three classes of functionality described in this article are implemented in FACTA+. The system accepts concept identifiers, arbitrary keywords and their boolean combinations as a query and immediately produces a ranked list (or its visualization) of concepts that are indirectly associated with the query. FACTA+ is implemented in C++ and currently running on a single Linux server with 32GB of memory. The service is available at http://refine1-nactem.mc.man.ac.uk/facta/, and the visualization system is available at http://refine1-nactem.mc.man.ac.uk/facta-visualizer/.

28 in total

1. Generating hypotheses by discovering implicit associations in the literature: a case report of a search for new potential therapeutic uses for thalidomide.

Authors: Marc Weeber; Rein Vos; Henny Klein; Lolkje T W De Jong-Van Den Berg; Alan R Aronson; Grietje Molema
Journal: J Am Med Inform Assoc Date: 2003-01-28 Impact factor: 4.497

Discovering and visualizing indirect associations between biomedical concepts.

1 INTRODUCTION

2 RECOGNIZING BIOMOLECULAR EVENTS

2.1 Related work

2.2 Detecting trigger words

2.3 Joint learning

2.4 Experiments

3 DISCOVERING INDIRECT ASSOCIATIONS

3.1 Related work

3.2 Interface

3.3 Ranking

3.4 Indexing

4 VISUALIZATION

4.1 Related work

4.2 Visualizing direct and indirect associations

5 CONCLUSION

1. Generating hypotheses by discovering implicit associations in the literature: a case report of a search for new potential therapeutic uses for thalidomide.

2. Discovering patterns to extract protein-protein interactions from full texts.

Review 3. Online tools to support literature-based discovery in the life sciences.

4. Using literature-based discovery to identify disease candidate genes.

5. BioIE: extracting informative sentences from the biomedical literature.

6. A new evaluation methodology for literature-based discovery systems.

7. Medical literature as a potential source of new knowledge.

8. Implementing the iHOP concept for navigation of biomedical literature.

9. Fish oil, Raynaud's syndrome, and undiscovered public knowledge.

10. Content-rich biological network constructed by mining PubMed abstracts.

1. Mining the pharmacogenomics literature--a survey of the state of the art.

Review 2. Mining electronic health records: towards better research applications and clinical care.

3. Combining Literature Mining and Machine Learning for Predicting Biomedical Discoveries.

4. GO2PUB: Querying PubMed with semantic expansion of gene ontology terms.

5. Studying PubMed usages in the field for complex problem solving: Implications for tool design.

Review 6. Recent advances in biomedical literature mining.

7. Extracting semantically enriched events from biomedical literature.

8. Boosting automatic event extraction from the literature using domain adaptation and coreference resolution.

9. The autoimmune tautology: an in silico approach.

10. A method for integrating and ranking the evidence for biochemical pathways by mining reactions from text.