| Literature DB >> 31921945 |
Sarthak Mishra1, Yash Pratap Rastogi1, Suraiya Jabin1, Punit Kaur2, Mohammad Amir1, Shabanam Khatoon1.
Abstract
Protein function prediction has been the most worked upon and the most challenging problem for computational biologists. The vast majority of known proteins have yet not been characterised experi<span class="Species">mentally, and there is significant gap between their structures and functions. New un-annotated sequences are being added to the public protein databases (e.g. UniprotKB) at an enormous pace [1]. Such proteins with unknown functions might play key role in the metabolism, growth and development regulation. Thus, if functions of unknown proteins left undiscovered, researchers may skip important information(s). Based on their sequence, structure, evolutionary history, and their association with other proteins, tools of computational biology can provide insights into the function of proteins [2]. For proteins with well characterised close relatives, it is trivial to infer function. Orphan proteins without discernible sequence relatives present a greater challenge [3]. Here the task of experimental characterisation is blind and becomes unwieldy. It is highly unlikely that all known proteins will ever be completely experimentally characterised [4]. Thus, there is an emergent need to develop fast and accurate computational approaches to fulfil this requirement. Towards this end, we prepared a dataset for protein function prediction by extracting protein sequences and annotations of reviewed prokaryotic proteins (total count 323,719 as accessed on date March 10, 2019) belonging to 9 bacterial phyla Actinobacteria, Bacteroidetes, Chlamydiae, Cyanobacteria, Firmicutes, Fusobacteria, Proteobacteria, Spirochaetes and Tenericutes. Corresponding to the most frequent 1739 Gene Ontology (Molecular Function) terms, samples were filtered, and 171,212 proteins were retrieved for feature generation. The Dataset was generated by calculating the sequence, sub-sequence, physiochemical, annotation-based features for each 171,212 reviewed proteins using method in [10]. These features constitute a total of 9890 attributes for each sequence of protein along with 1739 Gene Ontology terms. Each protein sequence is assigned one or more of 1739 Gene Ontology (Molecular Function) term as its target label. The Dataset contains the Entry and Entry name of each sequence corresponding to UniprotKB Database. This dataset being huge in size (171,212 samples X 9890 features, 1739 classes with multiple values) and equipped with enough number of positive and negative samples of each 1739 class, is good for testing efficiency of any upcoming deep learning models [5]. We divided the full dataset of 171,212 reviewed proteins in the ratio 3:1 to form Train/Test dataset 1; train dataset with 128,409 samples and test dataset with 42,803 samples to facilitate training of a deep learning model. The train and test datasets are stratified to contain good proportion of each 1739 classes. We then prepared a dataset 2 of pathogenic unreviewed proteins of the 9 bacterial phyla each with 9890 features same as train/train dataset of reviewed proteins but without target labels in order to predict their functions using deep learning model proposed in [5].Entities:
Keywords: Annotation based features; Function prediction; Molecular function; Motif; Physicochemical features; Reviewed protein; Sequence-based features; Unreviewed protein
Year: 2019 PMID: 31921945 PMCID: PMC6950771 DOI: 10.1016/j.dib.2019.105002
Source DB: PubMed Journal: Data Brief ISSN: 2352-3409
Summary of different feature groups and their descriptors [5].
| S. No | Feature Group | Feature Name | Python Package used | Number of descriptor values |
|---|---|---|---|---|
| 1. | Sequence-based | Protein Length | Biopython | 1 |
| 2. | Sequence-based | Amino acid composition | ifeature | 20 |
| 3. | Sequence-based | Dipeptide composition | Ifeature | 400 |
| 4. | Sequence-based | Tripeptide composition | ifeature | 8000 |
| 5. | Sequence-based | Pseudo amino acid composition | ifeature | 49 |
| 6. | Subsequence-based | Motif count | Biopython | 541 |
| 7. | Physicochemical-based | Molecular weight | Biopython | 1 |
| 8. | Physicochemical-based | Instability index | Biopython | 1 |
| 9. | Physicochemical-based | Isoelectric point | Biopython | 1 |
| 10. | Physicochemical-based | GRAVY | Biopython | 1 |
| 11. | Physicochemical-based | Extinction Coefficient | Biopython | 2 |
| 12. | Physicochemical-based | Secondary structure fraction | Biopython | 3 |
| 13. | Physicochemical-based | Grouped amino acid composition | ifeature | 5 |
| 14. | Physicochemical-based | Moran autocorrelation | ifeature | 232 |
| 15. | Physicochemical-based | Composition, Transition and Distribution | ifeature | 273 |
| 16. | Physicochemical-based | Conjoint Triad | ifeature | 343 |
| 17. | Annotation-based | Annotation based features (subcellular localisation, binding preference and presence of transmembrane region) | urllib (web-scrapping) | 17 |
| TOTAL | 9890 |
Specifications Table
| Subject | Biochemistry, Genetics and Molecular Biology (General) |
| Specific subject area | Deep learning task for protein function prediction of 9 bacterial phyla into multi-valued and multi-class labels |
| Type of data | Tables (excel sheets) and Fasta files |
| How data were acquired | Web-Scraping and Feature Generation through Python libraries |
| Data format | |
Fasta Sequences of 171,212 proteins of 9 bacterial phyla | |
Train/Test Dataset 1 with 9890 extracted features and 1739 GO terms in the form of Training vectors for 171,212 proteins of 9 bacterial phyla | |
Test Dataset 2 with 9890 extracted features for unreviewed protein of the 9 phyla extracted from UniProtKB for predictions using deep neural network based protein function prediction model [ | |
| Parameters for data collection | Both Reviewed and Unreviewed protein sequences were collected from UniprotKB belonging to 9 bacterial Phyla. Reviewed Proteins were used to generate Dataset for Training and Testing ( |
| Description of data collection | Data was collected using Python Web-Scraping library from UniprotKB and Prosite Servers. The 323,719 reviewed protein Sequences were downloaded from UniprotKB and their Motifs were extracted from the Prosite Server. The Sequences were then filtered using relevant 1739 Gene Ontology (Molecular Function domain). The sequence, subsequence (motif count), annotation, and physiochemical features for filtered 171,212 protein sequences were generated using method in [10]. The final Dataset contains Entry, Entry name, Sequences, 9890 generated features and 1739 GO terms for each sample. |
| Data source location | |
| Data accessibility | With the article |
| Related research article | Author's name: Sarthak Mishra, Yash Pratap Rastogi, Suraiya Jabin, Punit Kaur, Mohammad Amir, Shabanam Khatoon |
This dataset contributes important step towards the protein function prediction problem for bacterial species. Researchers trying to design new deep learning models can use this dataset for testing performance of their model. We provide 1739 molecular function domain GO terms as target label in the dataset for designing a supervised learning model but these 1739 GO terms can be used as features as well for some other kind of study such as clustering of bacterial proteins into functional groups etc. This dataset being huge in size, can be used to test and design GPU based parallelized deep learning algorithms for multi-class labelling. |