| Literature DB >> 25992265 |
Peggy Cellier1, Thierry Charnois2, Marc Plantevit3, Christophe Rigotti4, Bruno Crémilleux5, Olivier Gandrillon6, Jiří Kléma7, Jean-Luc Manguin5.
Abstract
BACKGROUND: Discovering gene interactions and their characterizations from biological text collections is a crucial issue in bioinformatics. Indeed, text collections are large and it is very difficult for biologists to fully take benefit from this amount of knowledge. Natural Language Processing (NLP) methods have been applied to extract background knowledge from biomedical texts. Some of existing NLP approaches are based on handcrafted rules and thus are time consuming and often devoted to a specific corpus. Machine learning based NLP methods, give good results but generate outcomes that are not really understandable by a user.Entities:
Keywords: Data mining; Gene interactions; Information extraction; Natural language processing; Sequential pattern mining
Year: 2015 PMID: 25992265 PMCID: PMC4436157 DOI: 10.1186/s13326-015-0023-3
Source DB: PubMed Journal: J Biomed Semantics
Figure 1General framework to extract gene interactions. Figure 1 presents the overall process to detect and characterize gene interactions. There are two steps. The first step is the extraction of patterns. Sequential patterns are mined from a learning corpus that contains sentences representing gene interactions. In order to reduce the number of extracted patterns, constraints and recursive mining are applied. At the end, few sequential patterns corresponding to candidate linguistic interaction or characterization rules remain. A key point is that the sequence of words expressing the interaction in a pattern is automatically discovered. As an example, the sequence of words
Example of information for a sequential pattern
|
|
|
|
|
|
|---|---|---|---|---|
| 10204582 | SHC1 | CRKL | not in BioGRID | These results suggest a fundamental role for the |
| tyrosine phosphorylation of Cbl, CrkL, SLP-76, and | ||||
|
| ||||
| of Cbl | ||||
| and Nck in Fc gammaRI signaling in human | ||||
| macrophages. | ||||
| 10204582 | CBL | CRKL | in BioGRID | PP1, a specific inhibitor of Src kinases, inhibited |
| the Fc gammaRI-induced respiratory burst, | ||||
| as well as the tyrosine phosphorylation of | ||||
|
| ||||
|
|
Table 5 gives 2 interactions (highlighted in bold in the table) detected by the sequential pattern: “AGENE association with AGENE”.
The meaning of the columns is: the id number of the paper in PubMed, the genes that interact, the verdict of the oracle and the sentence where the interaction is recognized. For instance, the pattern detects that in paper 10204582, an interaction between genes SHC1 and CRKL is mentioned, because the pattern matches a sentence of the abstract of the paper, but according to BioGRID, there is no interaction between SHC1 and CRKL and the discovered interaction in the sentences in paper 10204582 is unexpected because not in BioGRID and interesting. The pattern also detects that in this paper, an interaction between genes CBL and CRKL is mentioned, and indeed, according to BioGRID, there is an interaction between CBL and CRKL mentioned in paper 10204582.
Example of a sequence database
|
|
|
|---|---|
| ... | ... |
| S1 | 〈 |
|
| |
|
| |
| S2 | 〈 |
|
| |
| ... | ... |
Table 1 shows an excerpt of a sequence database which contains two interaction sentences:
S1: “Recent studies have suggested that c-myc may be vital for regulation of hTERT mRNA expression and telomerase activity.” and
S2: “Injection of frpHE mRNA in Xenopus embryos inhibited the Wnt-8 mediated dorsal axis duplication.”.
Figure 2Taxonomy for characterization patterns. Figure 2 describes the taxonomy used to classified the extracted sequential patterns.
Results of the application of the extracted patterns
|
|
|
|
|
|
|
|---|---|---|---|---|---|
|
|
| ||||
|
| |||||
|
| 1955 | 78.6 | 35.6 |
| [34.7,41.5] |
|
| 1100 | 46.5 | 25.3 | 32.8 | [15.9,40.6] |
|
| 200 | 75.0 | 83.0 | 78.7 | − |
|
| 145 | 66.8 | 46.7 | 55.0 | [38.3,69.8] |
Table 2 gives the list of the four testing corpora used to evaluate the proposed approach, and the results of the evaluation. The meaning of the columns is: the name of the corpus, the number of the sentences in the corpus, the recall score of the proposed approach applied on the corpus, the precision score of the proposed approach applied on the corpus, thef-score of the proposed approach applied on the corpus. The last column indicates the range of the f-scores presented in [11] with also a cross-corpus validation.
Details of information presented in paper [ 11 ]
|
|
|
|
|
|
|
|
|
|
|
|
|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
| - | 41.5 | - | 34.7 | - | 40.3 | - | 39.6 | - | 37.9 |
|
| 40.6 | - | 24.3 | - | 24.8 | - | 15.9 | - | 22.5 | - |
|
| 59.0 | 61.8 | 43.2 | 51.3 | 51.0 | 69.8 | 38.3 | 62.4 | 61.6 | 62.1 |
The acronyms used in this table are the ones used in paper [11]: SL: Shallow linguistic kernel; SpT: Spectrum tree kernel; kBSPS: k-band shortest path spectrum kernel; edit: Edit distance kernel; APG: All-paths graph kernel. See paper [11] for more details.
Examples of information about the application of information extraction rules
|
|
| |
|---|---|---|
|
|
|
|
|
| ||
|
| 6 | 1 |
| response@nn | ||
|
| 3 | 3 |
|
| 0 | undefined |
|
| 6 | 4 |
| with@in | ||
|
| 8 | 5 |
Table 4 gives an excerpt of provided information about patterns extracted from the PubMed corpus. The meaning of the columns is: sequential pattern, number of interactions detected by the pattern and number of detected interactions that are correct with respect to the oracle, i.e. interactions that also exist in BioGRID.
The first pattern can be read as “a gene followed by a gene then by the word the and the word response”. This pattern detects 6 interactions and 1 is in BioGRID. The second pattern can be read as “a gene followed by a gene then by the word serine”. It detects 3 interactions that are all in BioGRID. The third pattern can be read as “a gene followed by the verb reveal in past tense, then by a gene”. This pattern does not detect interactions in the rule validation corpus, thus no information is provided to evaluate it. The fourth pattern can be read as “a gene followed by the noun association, then by the word with and a gene name”. It detects 6 interactions out of which 4 are in BioGRID. The fifth pattern can be read as “a gene followed by the verb bind in present tense, then by the word to and a gene name”. This pattern detects 8 interactions and 5 of them are in BioGRID. For example, it detects that the following complex sentence “Cbl is a cytosolic protein that is rapidly tyrosine phosphorylated in response to Fc receptor activation and binds to the adaptor proteins Grb2, CrkL, and Nck.” contains an association between two signalling molecules (Cbl and Grb2).