Literature DB >> 27294176

Benchmark data for identifying multi-functional types of membrane proteins.

Shibiao Wan1, Man-Wai Mak1, Sun-Yuan Kung2.   

Abstract

Identifying membrane proteins and their multi-functional types is an indispensable yet challenging topic in proteomics and bioinformatics. In this article, we provide data that are used for training and testing Mem-ADSVM (Wan et al., 2016. "Mem-ADSVM: a two-layer multi-label predictor for identifying multi-functional types of membrane proteins" [1]), a two-layer multi-label predictor for predicting multi-functional types of membrane proteins.

Entities:  

Year:  2016        PMID: 27294176      PMCID: PMC4889873          DOI: 10.1016/j.dib.2016.05.024

Source DB:  PubMed          Journal:  Data Brief        ISSN: 2352-3409


Specifications Table Value of the data Knowing the functional types of membrane proteins can be helpful to elucidate the biological functions of membrane proteins. This article presents the first comprehensive dataset that contains non-membrane proteins, single-functional-type membrane proteins and multi-functional-type membrane proteins. The dataset presented here can be used as an important benchmark dataset to evaluate the performance of membrane-protein predictors.

Data

Using benchmark datasets for evaluating the performance of predictors are of great significance in various domains of bioinformatics [5], [6], [7], [8], [9], [10], such as membrane protein type prediction [11]. However, existing benchmark datasets for predicting membrane proteins are either incomplete or non-stringent. This data article describes a stringent and comprehensive benchmark dataset that comprises non-membrane proteins, single-functional-type membrane proteins and multi-functional-type membrane proteins. All of the benchmark datasets (Dataset II(C) together with Dataset I, Dataset II(A) and Dataset II(B)) are accessible from the link in .

Experimental design, materials and methods

The dataset (we named as ‘Dataset II(C)’) here is a benchmark dataset to evaluate Mem-ADSVM [1], a webserver to identify membrane proteins and their multi-functional types. Dataset II(C) was created based on two previous datasets [5], [8], which we named as Dataset I [5] and Dataset II(A) [8]. First, we retrieved all of the 7965 non-membrane proteins in Dataset I. The procedures to create Dataset I are as follows: (1) select proteins in the UniProtKB/Swiss-Prot database; (2) exclude those protein sequences annotated with “fragment” (3) exclude those protein sequences with less than 50 amino acid residues; (4) remove those protein sequences annotated with ambiguous words, such as “by similarity”, “potential”, “probable”, etc.; (5) remove those sequences which are annotated with “membrane protein” (6) use BLASTCLUST [12] to reduce the sequence similarity to no more than 80%. The procedures for obtaining Dataset II(A) are similar to those for Dataset I except that the former collected membrane proteins instead of excluding them, and the former reduced the sequence identity to 25% instead of 80%. Because the sequence identity of Dataset I (80%) was much higher than that of Dataset II(A) (25%), we used BLASTCLUST to reduce the sequence similarity to 25%, leading to 2009 non-membrane proteins. Then, we combined these 2009 non-membrane proteins with Dataset II(A) (5307 membrane proteins) to constitute Dataset II(C) with a total of 7316 proteins, of which 7126 belong to one type, 185 to two types and 5 to three types. Specifically, the distribution of Dataset II(C) is as follows: (1) 626 single-pass type I, (2) 299 single-pass type II, (3) 42 single-pass type III, (4) 73 single-pass type IV, (5) 2437 multi-pass, (6) 403 Lipid-anchor, (7) 172 GPI-anchor, (8) 1450 peripheral and (9) 2009 non-membrane.
Subject areaBiology
More specific subject areaBioinformatics/Computational Biology
Type of dataText
How data was acquiredProcess datasets that were obtained by searching against the UniProtKB/Swiss-Prot database with a series of stringent criteria
Data formatAnalyzed
Experimental factorsProteins were manually annotated and were extracted from UniProtKB.
Experimental featuresFor each protein sequence, its associated gene ontology (GO) information was retrieved by searching a compact GO-term database[2], [3], [4]with its homologous accession number.
Data source locationHong Kong SAR, China
Data accessibilityThe dataset is available with this article andhttp://bioinfo.eie.polyu.edu.hk/MemADSVMServer/datasets.html
  10 in total

1.  Mem-mEN: Predicting Multi-Functional Types of Membrane Proteins by Interpretable Elastic Nets.

Authors:  Shibiao Wan; Man-Wai Mak; Sun-Yuan Kung
Journal:  IEEE/ACM Trans Comput Biol Bioinform       Date:  2015-08-28       Impact factor: 3.710

2.  mPLR-Loc: an adaptive decision multi-label classifier based on penalized logistic regression for protein subcellular localization prediction.

Authors:  Shibiao Wan; Man-Wai Mak; Sun-Yuan Kung
Journal:  Anal Biochem       Date:  2014-10-31       Impact factor: 3.365

3.  mLASSO-Hum: A LASSO-based interpretable human-protein subcellular localization predictor.

Authors:  Shibiao Wan; Man-Wai Mak; Sun-Yuan Kung
Journal:  J Theor Biol       Date:  2015-07-09       Impact factor: 2.691

4.  iMem-Seq: A Multi-label Learning Classifier for Predicting Membrane Proteins Types.

Authors:  Xuan Xiao; Hong-Liang Zou; Wei-Zhong Lin
Journal:  J Membr Biol       Date:  2015-03-22       Impact factor: 1.843

5.  R3P-Loc: a compact multi-label predictor using ridge regression and random projection for protein subcellular localization.

Authors:  Shibiao Wan; Man-Wai Mak; Sun-Yuan Kung
Journal:  J Theor Biol       Date:  2014-07-02       Impact factor: 2.691

6.  Mem-ADSVM: A two-layer multi-label predictor for identifying multi-functional types of membrane proteins.

Authors:  Shibiao Wan; Man-Wai Mak; Sun-Yuan Kung
Journal:  J Theor Biol       Date:  2016-03-19       Impact factor: 2.691

7.  A multilabel model based on Chou's pseudo-amino acid composition for identifying membrane proteins with both single and multiple functional types.

Authors:  Chao Huang; Jing-Qi Yuan
Journal:  J Membr Biol       Date:  2013-04-02       Impact factor: 1.843

8.  GOASVM: a subcellular location predictor by incorporating term-frequency gene ontology into the general form of Chou's pseudo-amino acid composition.

Authors:  Shibiao Wan; Man-Wai Mak; Sun-Yuan Kung
Journal:  J Theor Biol       Date:  2013-01-29       Impact factor: 2.691

9.  mGOASVM: Multi-label protein subcellular localization based on gene ontology and support vector machines.

Authors:  Shibiao Wan; Man-Wai Mak; Sun-Yuan Kung
Journal:  BMC Bioinformatics       Date:  2012-11-06       Impact factor: 3.169

10.  HybridGO-Loc: mining hybrid features on gene ontology for predicting subcellular localization of multi-location proteins.

Authors:  Shibiao Wan; Man-Wai Mak; Sun-Yuan Kung
Journal:  PLoS One       Date:  2014-03-19       Impact factor: 3.240

  10 in total
  1 in total

1.  Accurate classification of membrane protein types based on sequence and evolutionary information using deep learning.

Authors:  Lei Guo; Shunfang Wang; Mingyuan Li; Zicheng Cao
Journal:  BMC Bioinformatics       Date:  2019-12-24       Impact factor: 3.169

  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.