Warning: Undefined array key "mm" in /www/wwwroot/www.ai-bt.com/si.php on line 10 Deprecated: trim(): Passing null to parameter #1 ($string) of type string is deprecated in /www/wwwroot/www.ai-bt.com/si.php on line 10 Protein classification using modified n-grams and skip-grams.

Literature DB >> 29309523

Protein classification using modified n-grams and skip-grams.

S M Ashiqul Islam¹, Benjamin J Heil², Christopher Michel Kearney^1,3, Erich J Baker^1,2.

Abstract

Motivation: Classification by supervised machine learning greatly facilitates the annotation of protein characteristics from their primary sequence. However, the feature generation step in this process requires detailed knowledge of attributes used to classify the proteins. Lack of this knowledge risks the selection of irrelevant features, resulting in a faulty model. In this study, we introduce a supervised protein classification method with a novel means of automating the work-intensive feature generation step via a Natural Language Processing (NLP)-dependent model, using a modified combination of n-grams and skip-grams (m-NGSG).
Results: A meta-comparison of cross-validation accuracy with twelve training datasets from nine different published studies demonstrates a consistent increase in accuracy of m-NGSG when compared to contemporary classification and feature generation models. We expect this model to accelerate the classification of proteins from primary sequence data and increase the accessibility of protein characteristic prediction to a broader range of scientists. Availability and implementation: m-NGSG is freely available at Bitbucket: https://bitbucket.org/sm_islam/mngsg/src. A web server is available at watson.ecs.baylor.edu/ngsg. Contact: erich_baker@baylor.edu. Supplementary information: Supplementary data are available at Bioinformatics online.

Entities: Gene

Mesh：

Substances：
Proteins

Year: 2018 PMID： 29309523 DOI： 10.1093/bioinformatics/btx823

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

Keyword Cloud
Cited

4 in total

Review 1. Classes, Databases, and Prediction Methods of Pharmaceutically and Commercially Important Cystine-Stabilized Peptides.

Authors: S M Ashiqul Islam; Christopher Michel Kearney; Erich Baker
Journal: Toxins (Basel) Date: 2018-06-19 Impact factor: 4.546

2. Probabilistic variable-length segmentation of protein sequences for discriminative motif discovery (DiMotif) and sequence embedding (ProtVecX).

Authors: Ehsaneddin Asgari; Alice C McHardy; Mohammad R K Mofrad
Journal: Sci Rep Date: 2019-03-05 Impact factor: 4.379

Review 3. Representation learning applications in biological sequence analysis.

Authors: Hitoshi Iuchi; Taro Matsutani; Keisuke Yamada; Natsuki Iwano; Shunsuke Sumi; Shion Hosoda; Shitao Zhao; Tsukasa Fukunaga; Michiaki Hamada
Journal: Comput Struct Biotechnol J Date: 2021-05-23 Impact factor: 7.271

4. Assigning biological function using hidden signatures in cystine-stabilized peptide sequences.

Authors: S M Ashiqul Islam; Christopher Michel Kearney; Erich J Baker
Journal: Sci Rep Date: 2018-06-13 Impact factor: 4.379

4 in total