Literature DB >> 31022378

Using word embedding technique to efficiently represent protein sequences for identifying substrate specificities of transporters.

Trinh-Trung-Duong Nguyen1, Nguyen-Quoc-Khanh Le2, Quang-Thai Ho1, Dinh-Van Phan3, Yu-Yen Ou4.   

Abstract

Membrane transport proteins and their substrate specificities play crucial roles in various cellular functions. Identifying the substrate specificities of membrane transport proteins is closely related to protein-target interaction prediction, drug design, membrane recruitment, and dysregulation analysis, thus being an important problem for bioinformatics researchers. In this study, we applied word embedding approach, the main cause for natural language processing breakout in recent years, to protein sequences of transporters. We defined each protein sequence based on the word embeddings and frequencies of its biological words. The protein features were then fed into machine learning models for prediction. We also varied the lengths of protein sequence's constituent biological words to find the optimal length which generated the most discriminative feature set. Compared to four other feature types created from protein sequences, our proposed features can help prediction models yield superior performance. Our best models reach an average area under the curve of 0.96 and 0.99, respectively on the 5-fold cross validation and the independent test. With this result, our study can help biologists identify transporters based on substrate specificities as well as provides a basis for further research that enriches a field of applying natural language processing techniques in bioinformatics.
Copyright © 2019 Elsevier Inc. All rights reserved.

Keywords:  Feature extraction; Natural language processing; Protein function prediction; Substrate specificities; Support vector machine; Transporter; Word embeddings

Year:  2019        PMID: 31022378     DOI: 10.1016/j.ab.2019.04.011

Source DB:  PubMed          Journal:  Anal Biochem        ISSN: 0003-2697            Impact factor:   3.365


  6 in total

1.  Word2vec neural model-based techniqueto generate protein vectors for combating COVID-19: a machine learning approach.

Authors:  Toby A Adjuik; Daniel Ananey-Obiri
Journal:  Int J Inf Technol       Date:  2022-05-19

2.  Trader as a new optimization algorithm predicts drug-target interactions efficiently.

Authors:  Yosef Masoudi-Sobhanzadeh; Yadollah Omidi; Massoud Amanlou; Ali Masoudi-Nejad
Journal:  Sci Rep       Date:  2019-06-27       Impact factor: 4.379

3.  TooT-T: discrimination of transport proteins from non-transport proteins.

Authors:  Munira Alballa; Gregory Butler
Journal:  BMC Bioinformatics       Date:  2020-04-23       Impact factor: 3.169

4.  ISTRF: Identification of sucrose transporter using random forest.

Authors:  Dong Chen; Sai Li; Yu Chen
Journal:  Front Genet       Date:  2022-09-12       Impact factor: 4.772

Review 5.  Representation learning applications in biological sequence analysis.

Authors:  Hitoshi Iuchi; Taro Matsutani; Keisuke Yamada; Natsuki Iwano; Shunsuke Sumi; Shion Hosoda; Shitao Zhao; Tsukasa Fukunaga; Michiaki Hamada
Journal:  Comput Struct Biotechnol J       Date:  2021-05-23       Impact factor: 7.271

6.  TNFPred: identifying tumor necrosis factors using hybrid features based on word embeddings.

Authors:  Trinh-Trung-Duong Nguyen; Nguyen-Quoc-Khanh Le; Quang-Thai Ho; Dinh-Van Phan; Yu-Yen Ou
Journal:  BMC Med Genomics       Date:  2020-10-22       Impact factor: 3.063

  6 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.