Warning: Undefined array key "mm" in /www/wwwroot/www.ai-bt.com/si.php on line 10 Deprecated: trim(): Passing null to parameter #1 ($string) of type string is deprecated in /www/wwwroot/www.ai-bt.com/si.php on line 10 Combining text mining and sequence analysis to discover protein functional regions.

Literature DB >> 14992511

Combining text mining and sequence analysis to discover protein functional regions.

Abstract

Recently presented protein sequence classification models can identify relevant regions of the sequence. This observation has many potential applications to detecting functional regions of proteins. However, identifying such sequence regions automatically is difficult in practice, as relatively few types of information have enough annotated sequences to perform this analysis. Our approach addresses this data scarcity problem by combining text and sequence analysis. First, we train a text classifier over the explicit textual annotations available for some of the sequences in the dataset, and use the trained classifier to predict the class for the rest of the unlabeled sequences. We then train a joint sequence text classifier over the text contained in the functional annotations of the sequences, and the actual sequences in this larger, automatically extended dataset. Finally, we project the classifier onto the original sequences to determine the relevant regions of the sequences. We demonstrate the effectiveness of our approach by predicting protein sub-cellular localization and determining localization specific functional regions of these proteins.

Mesh：

Substances：
Proteins

Year: 2004 PMID： 14992511 DOI： 10.1142/9789812704856_0028

Source DB: PubMed Journal: Pac Symp Biocomput ISSN： 2335-6928

Keyword Cloud
Cited

3 in total

Combining text mining and sequence analysis to discover protein functional regions.

1. Translational drug-interaction corpus.

2. New directions in biomedical text annotation: definitions, guidelines and corpus construction.

3. ngLOC: an n-gram-based Bayesian method for estimating the subcellular proteomes of eukaryotes.