Literature DB >> 14992511

Combining text mining and sequence analysis to discover protein functional regions.

E Eskin1, E Agichtein.   

Abstract

Recently presented protein sequence classification models can identify relevant regions of the sequence. This observation has many potential applications to detecting functional regions of proteins. However, identifying such sequence regions automatically is difficult in practice, as relatively few types of information have enough annotated sequences to perform this analysis. Our approach addresses this data scarcity problem by combining text and sequence analysis. First, we train a text classifier over the explicit textual annotations available for some of the sequences in the dataset, and use the trained classifier to predict the class for the rest of the unlabeled sequences. We then train a joint sequence text classifier over the text contained in the functional annotations of the sequences, and the actual sequences in this larger, automatically extended dataset. Finally, we project the classifier onto the original sequences to determine the relevant regions of the sequences. We demonstrate the effectiveness of our approach by predicting protein sub-cellular localization and determining localization specific functional regions of these proteins.

Mesh:

Substances:

Year:  2004        PMID: 14992511     DOI: 10.1142/9789812704856_0028

Source DB:  PubMed          Journal:  Pac Symp Biocomput        ISSN: 2335-6928


  3 in total

1.  Translational drug-interaction corpus.

Authors:  Shijun Zhang; Hengyi Wu; Lei Wang; Gongbo Zhang; Luis M Rocha; Hagit Shatkay; Lang Li
Journal:  Database (Oxford)       Date:  2022-05-18       Impact factor: 4.462

2.  New directions in biomedical text annotation: definitions, guidelines and corpus construction.

Authors:  W John Wilbur; Andrey Rzhetsky; Hagit Shatkay
Journal:  BMC Bioinformatics       Date:  2006-07-25       Impact factor: 3.169

3.  ngLOC: an n-gram-based Bayesian method for estimating the subcellular proteomes of eukaryotes.

Authors:  Brian R King; Chittibabu Guda
Journal:  Genome Biol       Date:  2007       Impact factor: 13.583

  3 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.