Literature DB >> 23055626

A text mining approach to detect mentions of protein glycosylation in biomedical text.

Abstract

Protein Glycosylation is an important post translational event that plays a pivotal role in protein folding and protein is trafficking. We describe a dictionary based and a rule based approach to mine 'mentions' of protein glycosylation in text. The dictionary based approach relies on a set of manually curated dictionaries specially constructed to address this task. Abstracts are then screened for the 'mentions' of words from these dictionaries which are further scored followed by classification on the basis of a threshold. The rule based approaches also relies on the words in the dictionary to arrive at the features which are used for classification. The performance of the system using both the approaches has been evaluated using a manually curated corpus of 3133 abstracts. The evaluation suggests that the performance of the Rule based approach supersedes that of the Dictionary based approach.

Entities: Chemical Disease Gene Species

Keywords: Dictionary -based approach; Glycosylation; Rule-based approach; Text mining

Year: 2012 PMID： 23055626 PMCID： PMC3449393 DOI： 10.6026/97320630008758

Source DB: PubMed Journal: Bioinformation ISSN： 0973-2063

Background

Protein glycosylation is the most common post⁷ translational modification of proteins. It is a complex process involving many functional proteins resulting in a great diversity of carbohydrate–protein bonds and glycan structures. Glycosylation plays a key role in biological processes and is linked to several molecular and genetic disorders. Glycosylation of some proteins has a great impact on their structures and functions resulting in modulation of many important biological processes. Alterations in glycosylation occur in many pathological states, and genetically determined defects in glycosylation are the reason of severe diseases. Thus, literature on protein glycosylation has been of interest to several researchers. However, the ever increasing amount of scientific literature and biological data calls for tools and methods to make this information available from text into more computable forms. Existing approaches to annotate the data in biological databases rely heavily on expert human curation [1]. Given the growing volume of literature and new high-throughput methods, it is becoming necessary to provide tools that can reduce time and cost of curation, increase consistency of annotation, and provide the linkages to supporting evidence in the literature that make the annotations useful to researchers. Several dictionary and rule based approaches have been designed in the past decade to address similar needs. Dictionary based approaches have been extensively used in text mining systems. Oscar is an open source system for recognizing chemical entity ‘mentions‘; it integrates a dictionary of compound names, as well as using regular expressions, heuristics, and certain word combinations to find chemical names in text. Several recent works include dictionary look up: PLAN2L, a web-based online search system that integrates text mining and information extraction techniques [2]. Enzyme databases such as FRENDA and AMENDA are additional databases created by continuously improved text-mining procedures and employ the usage of dictionaries [3]. MGI scans full text articles and performs entity recognition for mouse gene ‘mentions‘ based on a dictionary of mouse genes and human orthologs [4]. Also, several teams at the Biocreative task [5] have also employed dictionary based approaches for term identification Rule based approaches are frequently used for text mining tasks. One of the most successful rules-based approaches to gene and protein NER (Named Entity Recognition) in biomedical texts has been the AbGene system of Tanabe and Wilbur. ProMiner [6] is a dictionary- and rule-based system that applies sophisticated algorithms for recognizing complex, multi-word named entities in abstracts and full text articles. Other methods include statistical and machine learning based methods such as Support Vector Machines, co-occurrence based approaches and C-value method. With this work we describe a text mining approach to categorize text that has ‘mentions‘ of protein glycosylation from those that do not. The work presented here employs both a dictionary based as well as a rule based approach to detect abstracts having ‘mentions‘ of glycosylation. The dictionary based approach was designed keeping in mind that certain concepts can be expressed using different type of grammar. However, the choice of usage of terms related to glycosylation is restricted to a limited set of words and their synonyms. The dictionary based approach tries to cover the scope of these words that one can use if possibly speaking of or referring to protein glycosylation. The abstracts were screened for ‘mentions‘ of these terms which have been categorized into 10 different dictionaries and then scored using several scoring schemes. For each scoring scheme a weighting methodology was used so that a different confidence is associated with each type of word as all the terms in the dictionary are not equally important with respect to glycosylation. Also these terms might be associated with text that is not related to glycosylation or merely as a passing reference. For example an abstract may speak of homology modeling of mucin (a glycosylated protein which is a part of the dictionaries constructed). However the text may be irrelevant with respect to glycosylation. This calls for the development of a robust scoring and thresholding scheme to distinguish such text from the relevant one. The rule based approach makes use of the J48 module available at the WEKA data mining tool [7]. J48 is an implementation of the C4.5 algorithm for generating a pruned or unpruned decision trees. A set of words having high information content were selected for framing these rules. Several combinations were tried to arrive at a set of rules with the best coverage and support. The performances of both the approaches, as well as variations within the approaches have been evaluated using a manually curated corpus.

Methodology

A summary of the methodology incorporated has been diagrammatically represented (Figure 1).

Figure 1

Summary of Methodology

Preparation of the Corpora:

Abstracts were fetched from PubMed by querying for the terms associated with protein glycosylation and subjected to the following process: Two datasets were prepared in this fashion. Corpus 1 which comprises of 300 positive & 300 negative abstracts was used in the initial studies to check for the ability of the approach to categorize the dataset. In order to validate the results obtained the dataset was scaled up to contain 1600 positive and 1533 negative abstract Dictionary based approach Abstracts were annotated as positive if they contained information on the glycosylation process along with at least a ‘mention’ of the glycoprotein, glycosylating enzyme or the site of glycosylation; Abstracts having some ‘mentions‘ from the dictionaries however not implying glycosylation were selected as the negative dataset.

Preparation of Dictionaries:

10 dictionaries containing several variations, alternative names and synonyms for words which are relevant to Protein glycosylation were constructed: Carbohydrate names; Amino acid common names, one letter code and three letter codes and chemical formula; Gene Ontology Terms; Pathway Names; Names and synonyms of enzymes that glycosylate proteins from several resources such as SwissProt, TrEMBL, BRENDA, CAZY and KEGG BRITE databases; Gene Names and synonyms of enzymes that glycosylate proteins obtained from the alternative names and recommended gene names section of the SwissProt database; E.C numbers of enzymes those glycosylate proteins; Names and synonyms of proteins that are observed/reported to have undergone glycosylation from the UniProt database, OGLYCBASE and MESH database; Gene Names and synonyms of proteins that are observed/reported to have undergone glycosylation from SwissProt entries; Words that are observed to appear most frequently in biomedical text having ‘mentions‘ of protein glycosylation with the help of LingPipe tool.

Derivation of weights:

Weighting schemes were devised in order to attach higher significance to dictionary terms that are strong markers of glycosylation vis-a-vis ones that are not. The Weighting methodologies used in this approach are those that are proposed by Lan et al [8] and are classified into two types.

Unsupervised weighting methods:

This approach included using normalized weights (weight for the dictionary depends on the contribution of words from that particular dictionary with respect to the total set of words from all the dictionaries), term frequency (TF), normalized term frequency (TF values are normalized with respect to the number of words in the abstract), TF-IDF as used by Lan et al [8] as the weights.

Supervised weighting methods:

This approach included using Relevance Factor (RF), TF-RF, and Normalized TF–RF as the weights. The weighting schemes in the Supervised category are also implemented as that used by Lan et al [8].

Scoring of abstracts and determination of thresholds:

PERL scripts were written to match occurrences of terms from the dictionary. The frequency of occurrence of terms in combination with the weights associated with dictionary to which they belong was used to compute a score for the abstracts. A combination of thresholds was used to arrive at a threshold that could accurately categorize abstracts. Once the abstracts were scored using all the schemes the results were then analyzed to assess the performance by means of statistical measures like Precision, Recall, and F-measure.

Rule based approach:

J48 module available at the WEKA data mining tool was used for rule based approach.

Feature Calculation:

In the rule based approach used, individual abstracts were considered as data points. The features for these abstracts were given as numerical values which were vectors of fixed lengths. Features used were the term frequencies in the abstracts.

Feature Selection:

A subset of high frequency words in the corpora were used to select the most informative attributes. On subjecting them to the Information Gain module for the ranking of attributes, the J48 algorithm returns a tree which contains decision rules. Rules were selected so as to contain only those with good Coverage and Support. Instances which were not covered by any rule were assigned to majority class of the left out examples.

Approaches using Decision Rules:

Approach 1: Subset of 100 features selected by Information Gain; Approach 2: Since most of the terms in these 100 features come from the ‘most-frequently occurring words dictionary’ an approach which filtered out these words was used; Approach 3: Subset of 500 features from the dictionary words excluding words from the most frequent dictionary; Approach 4: The term occurrences were normalized with respect to the occurrence of terms from the dictionary to which the term in question belong to.

Discussion

It was observed that the weighting methodologies were not able to assign very different weights to the different dictionaries. This resulted in similar resolving power for terms from all the dictionaries and a relatively poor performance of the dictionaries as compared to the rule based approach and the machine learning approach. Since the normalized weights did have some significant differences for different dictionaries, this approach fared better than other weighting methodologies. Several rule based approaches were tested for performance. In the first approach the attributes used as input comprised of frequencies of the top 100 high frequency words from the dictionaries in individual abstracts. Since the features in the above approach were independent of the dictionary from which the terms came from, another approach was devised such that dictionary information was also taken into account. Thus, in the second experiment the attributes used were the frequencies normalized with respect to the frequency of occurrence of the words from the dictionary from which the word came. It was observed that on normalizing the frequencies of occurrence of the words from the dictionaries with respect to the dictionary from which the word came there was no significant increase in the accuracy of categorization Table 1(see supplementary material). Since the approaches used so far resulted in maximum number of rules coming out of the “Most Frequently” occurring words dictionary another approach was devised. In this approach the same procedure was repeated as before. But this time the terms from the most frequent words Dictionary were not included in the feature set. This was done in order to arrive at some domain related rules from other dictionaries whose performance was being over shadowed by the “Most Frequently” occurring words dictionary. However, the removal of these words from the attributes resulted in a significant drop in accuracy for both the corpora, the results for which are tabulated. All the approaches discussed above were repeated, such that numbers of features selected were scaled from 100 to 519. This approach performed well when ‘Most frequent words’ dictionary was included. However on removal of the attributes from the ‘Most frequent words’ dictionary this approach too showed poor performance in spite of significantly increasing the words from the other dictionaries. Thus the strength of the dictionary approach lies with terms from the most frequently occurring word dictionary.

Conclusion

With the enormous increase in Scientific literature, researchers and manual curators are unable to keep at pace with the available information. There is thus a pressing need for text mining systems that automate the process and simplify this task. Several text mining systems have been successfully developed and tailored to the several problems at hand. With this study we have tried to use a text mining approach in order to detect mentions of protein glycosylation in text. The study has been conducted only with biomedical abstracts. The approach has used both a Dictionary-based and a Rule-based approach for this task. Categorical dictionaries have been created for this task. These dictionaries serve as the reference for the term identification in text. In the Dictionary-based approach a number of schemes for scoring have been experimented with. Both the supervised and unsupervised weighting do not show any significant difference in performance. The “Rule-based approach” fares better than the “Dictionary-based” for both the Corpora used for this task. Reduction in the dictionary size shows some improvement in performance.

8 in total

1. Tagging gene and protein names in biomedical text.

Authors: Lorraine Tanabe; W John Wilbur
Journal: Bioinformatics Date: 2002-08 Impact factor: 6.937

Review 2. A survey of current work in biomedical text mining.

Authors: Aaron M Cohen; William R Hersh
Journal: Brief Bioinform Date: 2005-03 Impact factor: 11.622

3. Supervised and traditional term weighting methods for automatic text categorization.

Authors: Man Lan; Chew Lim Tan; Jian Su; Yue Lu
Journal: IEEE Trans Pattern Anal Mach Intell Date: 2009-04 Impact factor: 6.226

4. Integrating text mining into the MGI biocuration workflow.

Authors: K G Dowell; M S McAndrews-Hill; D P Hill; H J Drabkin; J A Blake
Journal: Database (Oxford) Date: 2009-11-21 Impact factor: 3.451

5. ESTree db: a tool for peach functional genomics.

Authors: Barbara Lazzari; Andrea Caprera; Alberto Vecchietti; Alessandra Stella; Luciano Milanesi; Carlo Pozzi
Journal: BMC Bioinformatics Date: 2005-12-01 Impact factor: 3.169

6. PLAN2L: a web tool for integrated text mining and literature-based bioentity relation extraction.

Authors: Martin Krallinger; Carlos Rodriguez-Penagos; Ashish Tendulkar; Alfonso Valencia
Journal: Nucleic Acids Res Date: 2009-06-11 Impact factor: 16.971

7. BRENDA, AMENDA and FRENDA the enzyme information system: new content and tools in 2009.

Authors: Antje Chang; Maurice Scheer; Andreas Grote; Ida Schomburg; Dietmar Schomburg
Journal: Nucleic Acids Res Date: 2008-11-04 Impact factor: 16.971

8. Text mining for biology--the way forward: opinions from leading scientists.

Authors: Russ B Altman; Casey M Bergman; Judith Blake; Christian Blaschke; Aaron Cohen; Frank Gannon; Les Grivell; Udo Hahn; William Hersh; Lynette Hirschman; Lars Juhl Jensen; Martin Krallinger; Barend Mons; Seán I O'Donoghue; Manuel C Peitsch; Dietrich Rebholz-Schuhmann; Hagit Shatkay; Alfonso Valencia
Journal: Genome Biol Date: 2008-09-01 Impact factor: 13.583

8 in total