Literature DB >> 36262138

An analytical study on the identification of N-linked glycosylation sites using machine learning model.

Muhammad Aizaz Akmal1, Muhammad Awais Hassan2, Shoaib Muhammad2, Khaldoon S Khurshid2, Abdullah Mohamed3.   

Abstract

N-linked is the most common type of glycosylation which plays a significant role in identifying various diseases such as type I diabetes and cancer and helps in drug development. Most of the proteins cannot perform their biological and psychological functionalities without undergoing such modification. Therefore, it is essential to identify such sites by computational techniques because of experimental limitations. This study aims to analyze and synthesize the progress to discover N-linked places using machine learning methods. It also explores the performance of currently available tools to predict such sites. Almost seventy research articles published in recognized journals of the N-linked glycosylation field have shortlisted after the rigorous filtering process. The findings of the studies have been reported based on multiple aspects: publication channel, feature set construction method, training algorithm, and performance evaluation. Moreover, a literature survey has developed a taxonomy of N-linked sequence identification. Our study focuses on the performance evaluation criteria, and the importance of N-linked glycosylation motivates us to discover resources that use computational methods instead of the experimental method due to its limitations.
© 2022 Akmal et al.

Entities:  

Keywords:  Artificial intelligence; Deep learning; Glycosylation; Machine learning; N-linked; Performance evaluation criteria

Year:  2022        PMID: 36262138      PMCID: PMC9575850          DOI: 10.7717/peerj-cs.1069

Source DB:  PubMed          Journal:  PeerJ Comput Sci        ISSN: 2376-5992


Introduction

The process of glycosylation is considered to be one of the most complex type of post translation modification (PTM) in eukaryotes cells (Akmal, Rasool & Khan, 2017; Yang et al., 2019). The post translation modification occurs when protein, after synthesis, undergo different type of changes and without these modification proteins cannot perform their psychological functionalities properly (Yang et al., 2019). Nearly 200 different types of such post translation modification have been discovered and glycosylation is most important amongst them as it plays a vital role in different biological functions such as cell communication, protein folding, recognition of antigens and −50% of the human genomes are glycosylated (Akmal, Rasool & Khan, 2017; Akmal et al., 2020; Yang et al., 2019). The glycosylation sites are very relevant for cancer discovery as well as for further drug development (He, Wei & Zou, 2019; Hwang et al., 2020). Glycosylation sites are classified into five types: N-linked, O-linked, C-linked, glypiation and phospho glycosylation (Lei, Tang & Du, 2017). It is very much important to identify such sites. There are various techniques to identify such sites, broadly it can be classified into experimental and computational method (Audagnotto & Dal Peraro, 2017). The experimental method requires the understanding of cell biology and the functions of cell structure (Hwang et al., 2020). The well-know techniques used for experimental identification are: radioactive label, chromatin immunoprecipitation (ChIP), mass spectrometry (MS) and liquid chromatography (LCG) (Akmal et al., 2020; Hwang et al., 2020; Naseer et al., 2020a). In computational method, researchers discover valuable information from the structure of protein sequences and apply some artificially intelligent algorithms to predict the relevant glycosylation or any other PTM sites (Hamby & Hirst, 2008; He, Wei & Zou, 2019; Shek, Kotidis & Betenbaugh, 2021; Naseer et al., 2021b; Murad et al., 2021). The N-linked glycosylation is the primary glycosylation type, as 90% of glycosylated sites belong to the N-linked glycosylation (Akmal, Rasool & Khan, 2017). Usually, N-glycans are attached to glycoproteins on asparagine residues within the Asn-X-Ser/Thr sequon (except proline, X could be any amino residue) (Zhang et al., 2021b; Alkuhlani et al., 2021). N-linked glycans plays vital role in intrinsic and extrinsic (Alkuhlani et al., 2021). Apart from improving the protein’s stability, it provides a structural component to the cell surface. N-glycan also mediate cell-to-cell interaction and controls the glycoprotein in the cellular environment (Naseer et al., 2020b). N-linked glycan helps is identification of various diseases such as type I diabetes, cancer, rheumatoid arthritis, and Crohn’s disease (Alkuhlani et al., 2021; Naseer et al., 2020a; Khan et al., 2020b). Therefore, it is very much important to identify such sites, but the identification of such sites using experimental technique is time-consuming and expensive as well (Coff et al., 2020; Akmal et al., 2020; Qiu et al., 2018). Therefore, researchers have developed several computational models based on artificial neural network (ANN) to predict the N-linked sites (Le, Sandag & Ou, 2018; Butt et al., 2016; Alkuhlani et al., 2021). Although, few reviews exist on N-linked prediction model, but they mainly focus on algorithm used to train the model and less focused on the feature set construction and performance metric, as shown in Table 1. These studies only analyzed the models developed up to 2019.
Table 1

Proposed survey comparison with existing studies.

Article Ref. No.FocusYearSurvey approachQuality assessmentN-linked model (Tool)Feature constructionTraining algorithmOrganism typePerformance metric (ACC, SN, SP)Target repository
Alkuhlani et al. (2021) Glycosylation sites prediction tool using AI.2021Informal
Shek, Kotidis & Betenbaugh (2021) Experimental and computation method for PTM site prediction2021Informal
Kuo-Chen (2019) PTM sites prediction model develop using Chou’s 5 step model.2019Informal✗ (other PTM)
He, Wei & Zou (2019) Research progress in PTM site prediction.2019Informal✗ (glyco type not specified)
Audagnotto & Dal Peraro (2017) Tools used for PTM.2017Informal
This surveyN-linked site prediction tool including training algorithm, and feature approach which helps to construct an efficient model for other PTM.2021Systematic Review5
The glycosylated region of N-linked sites appears at the specific location within the protein sequence, as protein sequence consists of the chain of amino acid and each amino acid out of known 20 is represented by specific alphabetic character (Qiu et al., 2018; Yang et al., 2019; Kumari, Kumar & Kumar, 2018). In computational approach, it is required to extract some useful information from these sequences to construct the feature vector (Butt, Rasool & Khan, 2017; Chien et al., 2020; Hamby & Hirst, 2008; Naseer et al., 2021b). The feature vectors of glycosylated and non-glycosylated N-linked sites have certain pattern of protein sequences and these patterns have identified through the various technique (algorithm) of machine learning method (Taherzadeh et al., 2019; Tran, Pham & Ou, 2021; Hayat & Khan, 2011; Park et al., 2019; Xiang, Zou & Zhao, 2021; Dimeglio et al., 2020). The evidence of organism type also helps in the successful identification of such sites (Huang & Li, 2018). The existing reviews are compared on various perspectives such as quality assessment scores, availability of N-linked model, feature set construction method, training model algorithm, specie type, performance metric and target repositories as shown in Table 1. The proposed study only focused on the review articles accepted in recognized journals because of reliability (Barukab et al., 2019). This comparison helps the need to build the survey. The rational of our work is to provide the comprehensive systematic literature review on the identification of N-linked sites to bring out the detail of exiting computational models. The researchers have performed numerous efforts to identify such sites computationally in the recent past. The work presented by these researchers has been reviewed by few authors to ensure the effectiveness of the proposed prediction model to identify the N-linked sites (Shek, Kotidis & Betenbaugh, 2021; Alkuhlani et al., 2021; Audagnotto & Dal Peraro, 2017). The authors primarily focused on the feature set construction algorithm and training algorithm, and less or no focus on quality assessment criteria, performance metric evaluation and the type of species of the reviewed articles used to predict the N-linked sites. The proposed systematic review provides novel features such as targeting channel, quality assessment score, new classification criteria, and performance evaluation based on accuracy, sensitivity, and specificity metric after evaluating studies empirically. This SLR will help the medical scientists in the targeted identification of cancer, type I diabetic cell for treating the patients, and help the pharmacists in effective drug development by opting the accurate predictor of N-Linked sites. Furthermore, it will facilitate the researchers to develop more accurate and efficient predictive model by analyzing the techniques used by existing researchers. The proposed article is presented in the following sequence: the methodology adopted to conduct survey along with objectives and research questions is presented in “Survey methodology”. The analysis of the research question is described in “Assessment and discussion”. The “Discussion and future direction” presents synthesis of reviewed literature. Finally, the article has been concluded in “Conclusion”.

Survey methodology

The survey methodology consists of three phases: plan, conduct of review and conclusion as shown in Fig. 1.
Figure 1

Research strategy.

Review plan

The process involved to conduct the review is shown in Fig. 2.
Figure 2

Research strategy.

Review conduct

The steps involved to conduct the review were: (a) Search of relevant primary study from different search venues. (b) Selection of relevant research articles from searched articles obtained in previous step through predefined inclusion/exclusion criteria. (c) The selected articles were then assigned score based on their defined quality parameters. (d) Backward snowballing to include the important articles.

Automated search in digital library

The relevant research articles have been extracted through system search. Therefore, automatic, and manual search has been performed. The google scholar is used as digital venue to get the relevant research articles. Google Scholar (http://scholar.google.com/) IEEE Xplore (https://ieeexplore.ieee.org/search/advanced) Springer Link (https://link.springer.com/) Bioinformatics (https://academic.oup.com/bioinformatics) PLOS ONE (https://journals.plos.org/plosone/) To get appropriate and relevant search result, keyword based search has been applied on the digital venue. Based on the RQs mentioned in Table 2, keyword are selected for primary and secondary term. The Boolean operator ‘AND’ and ‘OR’ are used to build query string. The search query based on keyword is shown in Fig. 3. The search query is grouped into three groups where each group contain the similar keyword to ensure maximum relevant studies as mentioned in Table 1. Using the Boolean operators (OR, AND) final search query is designed in which AND operator is applied in different groups and OR operator is with in different keywords of a group.
Table 2

Research questions and objective.

RQResearch questionResearch objective/motivation
RQ1Which are the relevant publishing channel for N-Linked glycosylation research? Which channel type and geographical area target this research?To identify• High quality publishing venue.• Research published during 2017–till October-2021.• Scentometric analysis based on meta information including research type, approaches and validation methods.
RQ2Which are the exiting prediction model (tool) used for the identification of N-linked Glycosylation sites and for which kind of species these sites are identified?To help the researchers to identify diseases i.e., cancer detection, type 1 diabetic and also drug discoveries through cost effective and time saving approach.
RQ3Which algorithm or method are used to construct N-Linked feature vector?To understand the in-depth structure of protein sequences to extract useful information to train model.
RQ4Which algorithm or method are used to train N-Linked model?To develop efficient tool to predict the N-linked sites through computational approach.
RQ5How effective are the existing model to predict the N-Linked sites?By evaluating the1. Availability of data set.2. Availability of tool.3. Determining the Accuracy measure including Accuracy, Sensitivity and Specificity metrics.4. Result comparison with existing studies.
Figure 3

Keyword used to develop query string.

Listening 1 [“n linked” OR “Post translation modification”] AND [“Glycosylation sites” OR “Glycan”] AND [“prediction model” OR “Artificial Intelligence” OR “Neural Network” OR “Deep Learning”] Primary keywords were selected as a key identifier for N-linked prediction models. Primary keywords along with the secondary and additional keywords were chosen. Combination of keywords and Boolean operators have developed as mentioned in Table 3.
Table 3

Search group used for search query.

Digital librarySearch queryApplied filter
IEEE Xplore(“n linked” OR “Post translation modification”) AND (“prediction model” OR “Artificial Intelligence” OR “Neural Network” OR “Deep Learning”)2017–2021
Springer link(“n linked” OR ”Post translation modification”) AND (“Glycosylation sites” OR “Glycan”) AND (“prediction model” OR “Artificial Intelligence” OR “Neural Network” OR “Deep Learning”)2017–2021
Bioinformatics(n linked OR Post translation modification) AND (Glycosylation sites OR Glycan) AND (prediction model OR Artificial Intelligence OR Neural Network OR Deep Learning)2017–2021
PLOS ONE(“n linked”) AND (“Glycosylation”) AND (“Neural Network” OR “Deep Learning”)2017–2021
Google scholar(“n linked” OR “Post translation modification”) AND (“Glycosylation sites” OR “Glycan”) AND (“prediction model” OR “Artificial Intelligence” OR “Neural Network” OR “Deep Learning”)2017–2021
Inclusion Criteria The article included in review must contain prediction of N-linked glycosylation sites or Glycosylation sites. It must target any of the research question mentioned in Table 2. It is published in journal or in preprint repository since 2017. It should contain computation or semi computational approach for prediction. 2. Exclusion Criteria Eliminate articles that do not address the N-linked glycosylation or glycosylation. Eliminate articles that purely identify N-linked sites through biological experimentation. Eliminate the books appeared in the result of search query.

Quality assessment as selection criteria

The quality assessment (QA) is the major step to conducting any systematic review. In this study, questionnaire has been designed to measure the quality of selected articles. The score is computed on the following criteria: The study has awarded score (1) if N-linked predictive tool has developed, otherwise scored (0). The study has awarded score (2) if the method developed to extract feature from data based on computational approach, score (1) for hybrid approach and score (0) in-case of experimental approach. The study has awarded score (1) if the computation method for training has provided, otherwise scored (0). The score (1) has been awarded if the data set used is available otherwise scored (0). The score (1) has been awarded if the organism type is available otherwise scored (0). The studies were rated by taking conference and journal rating list into account. The possible score for publication is shown in Table 4.
Table 4

Possible rating for recognized and stable publication score.

Publication source+4+3+2+10
JournalsQ1Q2Q3Q4No JCR ranking
ConferenceCORE A*CORE ACORE BCORE CNot in CORE ranking
The resultant score has been calculated for each study by aggregating the points of all question. Article achieving minimum score (5) has been included in the review.

Selection based on snowballing

After performing the quality assessment, back-word snowballing to extract the relevant articles from the references of the selected articles. The articles by Kumar et al. (2020) and Ilyas et al. (2019) have been shortlisted after performing the inclusion exclusion criteria and quality assessment.

Review report

The glycosylation sites especially N-Linked identification is very important domain, therefore in this review, systematic and empirical method is adopted to extract the relevant article from the digital libraries mentioned in Table 3, using query string as shown in Listening 1. Almost 800 articles are left after removing the articles before 2017. The shortlisted articles are then filtered based on title, abstract, introduction and examined the full article if required for each search result. The article contains less than four pages and irrelevant articles were eliminated. The results of primary search, filtering and inspection phase, covering five digital libraries, are presented in Table 5.
Table 5

Selection phase and results.

PhaseSelectionSelection criteriaPLOS ONEBioinformaticsSpringer linkIEEE XploreGoogle scholarTotal articles
1SearchKeyword (Fig. 2)213474770845
2FilteringTitle153183212251
3FilteringAbstract103133160189
4FilteringIntroduction and conclusion6273125143
5InspectionFull article11226268
After the preprocessing of articles, inclusion/exclusion test has been performed and after that quality assessment score has been computed. The article having at least five score have included in this study and it is total of 70 in count as given in Table 6.
Table 6

Classification criteria

Sr. No.Ref. No.P.YearP.ChannelResearch type Empirical typeSpeciesPTM typeFeature set methodModel training algorithmModel(a)(b)(c) (d) (e)(f)SCORE
1 Akmal, Rasool & Khan (2017) 2017JournalSolutionComputationalHumanN-linkedPosition relative and Statistical MomentsANN/Back propagation-0211149
2 Chien et al. (2020) 2020JournalSolutionComputationalHuman and MouseN-linkedSequence, Structure and Function featureXGBOOSTN-GlycoGo1210149
3 Taherzadeh et al. (2019) 2019JournalSolutionComputationalHuman and MouseN-linked and O-linkedSequence and StructureDeep ANN and SVMSprint-Gly12111410
4 Tran, Pham & Ou (2021) 2021JournalSolutionComputationalHuman and MouseN-linkedWord embedding Vector TechniqueRM, KNN, SVM and XGBoost.-0211048
5 Liu et al. (2019) 2019JournalSolutionComputationalHumanN-linkedSequenceANNNetGlyco (Exiting)12111410
6 Li et al. (2019) 2019JournalSolutionComputationalHumanN-linked (and C/O-linked)Sequence and Structure FeaturePA2DE using AlphaMaxGlycoMine_PU12111410
7 Bojar et al. (2021b) 2021JournalSolutionHybridEukaryoteGlycosylationSequence featureRecurrent NN (LSTM)SweetOrigin0211149
8 Thomès, Burkholz & Bojar (2021) 2021JournalSolutionComputationalAnimalN-linked and O-linked--GlycoWork12111410
9 Carpenter et al. (2022) 2021bioRxivSolutionComputationalNot MentionGlycosylationFingerprint EncodingMNN (ADAM)GlyNet1210105
10 Pitti et al. (2019) 2019JournalSolutionComputationalHumanN-linkedSimilarity voiting and Gap PeptideSVMNGlyDE12111410
11 Lundstrøm et al. (2022) 2021bioRxivSolutionComputationalHumanGlycosylationProtein-Glycan Sequence FeatureGrpah CNNLectinOracle0211105
12 Burkholz, Quackenbush & Bojar (2021) 2021JournalSolutionComputationalHumanGlycosylationGraph and Statistical featureGraph NNSweetNet12111410
13 Kotidis & Kontoravdi (2020) 2020JournalSolutionHybridHumanN-linked-ANN/Kinetic Model-0011147
14 Lee et al. (2021) 2021JournalSolutionExperimentalMammalianGlycosylation-MS-0000145
15 Alkuhlani et al. (2021) 2021JournalReviewComputationalHumanGlycosylationComputationalAI-121149
16 Adolf-Bryfogle et al. (2021) 2021JournalSolutionComputationalNot MentionN-linkedKDEGlycan Tree ModlerRosetta Carbohydrate Framework1211005
17 Sha et al. (2019) 2019JournalSolutionExperimentalHumanN-linkedFlux Balance AnalysisKinetic-0101125
18 Zhang et al. (2021a) 2021JournalSolutionExperimentalHumanN-linkedMS--0100146
19 Park et al. (2019) 2019JournalSolutionComputationalHumanN-linked and O-linkedSequence and StructureClustringGlycan Reader and Modeler1210048
20 Zhang et al. (2021c) 2021JournalSolutionExperimentalHumanN-linked---0000145
21 Xiang, Zou & Zhao (2021) 2021JournalSolutionComputationalHumanGlycosylation (O-linked)feature set selected using SVM them mRmRSVM,RF and NBVPTMdb12111410
22 Antonakoudis et al. (2021) 2021JournalSolutionHybridHumanN-linkedStochiometricANN-0111148
23 Huang et al. (2017) 2017JournalSolutionExperimentalMammalianN-linked---0100146
24 Naseer et al. (2021b) 2021JournalSolutionComputationalNot MentionPTM (Amidation)PseAACCNNIAmideV-deep1211027
25 Hwang et al. (2020) 2020JournalSolutionHybridHumanN-linkedIQ-GPA human plazma proteinDNN-012047
26 Coff et al. (2020) 2020JournalSolutionComputationalHuman and AvianGlycosylationFrequent Subtree mining and mRMRRegression ClassifierCCARL12111410
27 Le, Sandag & Ou (2018) 2018JournalSolutionComputationalHumanPTM (including N linked)Statistical Moment and F scoreRBF NetworkPTM Transporter1211128
28 He, Wei & Zou (2019) 2019JournalReviewComputationalNot MentionN-linkedProvidedProvidedProvided1210048
29 Audagnotto & Dal Peraro (2017) 2017JournalReviewComputationalNot MentionN-linked-ProvidedProvided1011047
30 Krasnova & Wong (2019) 2019JournalReviewExperimentalHumanN-linked and O-linked---00145
31 Kellman & Lewis (2021) 2020JournalSolutionExperimentalHumanGlycan (including N)---1146
32 Huang et al. (2021) 2021JournalsolutionComputationalNot MentionGlycosylation (O-linked)Sequence featureRFOGP-Based1211049
33 Shek, Kotidis & Betenbaugh (2021) 2021JournalReviewComputationalNot MentionGlycosylation-ProvidedProvided1210048
34 Mondragon-Shem et al. (2020) 2020JournalsolutionHybridHumanN-linkedMS-Existing Tool1101148
35 Wilson et al. (2021) 2021JournalsolutionExperimentalHumanGlycosylation (including N)---0101125
36 Zhang et al. (2021b) 2021JournalsolutionComputationalMammalianN-linkedUnknown Parameter and StructureBaysen Network-0110147
37 Hua11 (2019) 2019ConferenceSolutionComputationalHumanProtein PredictionFrequency Feature of AA and EH MethodSVM and NNPPSNN1210105
38 Zhao et al. (2020) 2020JournalSolutionExperimentalHumanN-linked-MS-0010146
39 Wang et al. (2017) 2017JournalsolutionHybridhumanN-linkedCfsSubSetEvalSVM-0120148
40 Badgett et al. (2018) 2018JournalSolutionExperimentalHumanN-linked-MS-0011125
41 Suga, Nagae & Yamaguchi (2018) 2018JournalSolutionHybridHumanN-linkedStructural FeatureMaturation-0111148
42 Bao et al. (2019) 2019JournalSolutionComputationalHumanGlycosylation and PhosphorylationMembrane Buried, Confrontational and average Flexible IndicesNN+ELM+SVMCMSENN1210138
43 de Souza et al. (2019) 2019JournalSolutionExperimentalhumanN-linked-MS-0001146
44 Jiang et al. (2018) 2018JournalSolutionComputationalNot MentionGlycosylation (O-linked)KPCA and FUSRotation ForestOGLYCPred121149
45 Kuo-Chen (2019) 2019JournalReviewComputationalHumanNon-GlycosylationKC Chou’s 5 step-Povided1101115
46 Dimeglio et al. (2020) 2020JournalSolutionHybridHumanN-linkedStatistical MomentANNTHETA Model12111410
47 Dobson, Zeke & Tusnády (2021) 2021JournalSolutionHybridHumanProtein Traffic membrane (N and O)Topology and Putative SLiMsCNN with AdamPolarportPred1110148
48 Kumar et al. (2020) 2020ConferenceSolutionHybridHumanPTMPsycho-Chemical, structural and PTMML-0211105
49 Ilyas et al. (2019) 2019JournalsolutionComputationalHumanPTMChou’s 5-stepsANN-0211127
50 Yang & Han (2017) 2017JournalSolutionComputationalMammalianGlycosylation (O-linked)Protein factor base FeaturesKNN-0210126
51 Magaret et al. (2019) 2019JournalSolutionHybridNot MentionN-linkedSequencesRM, Super Learner and Glmnet-0111047
52 Sugár et al. (2021) 2021JournalSolutionExperimentalHumanN-linked-RanoLC-MS-1101148
53 Campbell (2017) 2017JournalreviewHybridHumanGlycosylationPartial MentionedPartial Mentioned-1110115
54 Jia, Zuo & Zou (2018) 2018JournalSolutionComputationalNot MentionGlycosylation (O-linked)FUS and KPCAKNN,RM,SVM and NB, SVM outperformrgb 0.141, 0.125, 0.129O-GlcNAcPRED-II1211049
55 Ferreira et al. (2021) 2021JournalSolutionExperimentalHumanN-linked-MS-0100146
56 Ye & Vakhrushev (2021) 2021JournalSolutionExperimentalHumanGlycosylation-MS-0000145
57 Bojar et al. (2021a) 2021bioRxivSolutionHybridHumanN-linkedSequenceML-0211105
58 Desaire, Patabandige & Hua (2021) 2021JournalSolutionHybridNot MentionGlycosylationMSSVM-0111047
59 Chen et al. (2021) 2021JournalSolutionComputationalHumanPTMBinary Encoding,AAC,EAAC and DipeptideDeep LearningCNNrgb12111410
60 Zou et al. (2017) 2017ConferenceSolutionComputationalHumanGlycosylation (O-linked)Vector WordSVMGLycoCell1211106
61 Perpetuo et al. (2021) 2021JournalSolutionComputationalHumanPTMSequencesAI-0210137
62 Li et al. (2020) 2020JournalSolutionComputationalNot MentionProteinAAC, PseAAC,NC, PseKNCadaboost and random forestPPAI1210048
63 Lei, Tang & Du (2017) 2017JournalSolutionHybridNot MentionPTM (S-sulfenylated)Psysciochemical and Clustring MethodEnsemble Classifier-0121015
64 Murad et al. (2021) 2021bioRxivSolutionComputationalNot MentionPTM (Ubiquitination)Statistical MomentRandom ForestUBISites-SRF1210105
65 Qiu et al. (2018) 2018JournalSolutionComputationalNot MentionPTM (Lipoylation)Biprofile Bayes EncodingSVMLipoPred1110036
66 Yang et al. (2019) 2019JournalSolutionHybridHumanPTMSNP-Awesome1111149
67 Liu et al. (2021) 2021JournalSolutionComputationalHumanPTMUbiSite-XGBoostExtreme gradient boosting classifierUbiSite=XGBoost1211139
68 Ruiz-Blanco et al. (2017) 2017JournalSolutionComputationalHumanN-linkedProDCalJrip ClassifierSequon0211149
69 Kumari, Kumar & Kumar (2018) 2018JournalSolutionComputationalHumanPalmitoylationPSSMSVMRAREPalm1210138
70 Huang & Li (2018) 2018JournalSolutionComputationalHumanPTMSequence, StructureKNN-0211149

Assessment and discussion

In this section, the research questions have been analyzed based on 70 primary studies.

Assessment of q1:

Which are the relevant publishing channel for N-linked glycosylation research? Which channel type and geographical area target this research?

To find the relevant publishing channel, channel type and geographical aspects for the N-linked glycosylation sites requires the meta information. To achieve this purpose, channel type, publishing year and demographical distribution is presented for the analysis of selected studies. The importance of selected topic can be evaluated from the yearly publication on the relevant domain. The 28 out of 70 articles has been published in 2021 which also of 40% of selected article as shown in Fig. 4.
Figure 4

Year wise distribution of publication.

It is clear from Fig. 5 that the maximum portion of studies belong to the recognized journal followed by international conferences.
Figure 5

Percentage of publication channel.

It is observed, 42 out of 70 studies have been published in the different regions of the Europe as shown in Fig. 6.
Figure 6

Demographical distribution of publication.

Quality assessment score for each finalized study awarded according to defined criteria in quality assessment score section, shown in Table 7. It is clearly observed that only studies qualifying minimum threshold are listed. The article published in Q1 quality journal achieve highest score, it will help researchers to find the relevant publishing venues for the N-linked and other glycosylation site prediction studies. Almost 50% of the studies achieve eight score or above which shows the relevancy of the selected studies through developed query string.
Table 7

Quality assessment score.

ReferenceQA ScoreTotal articles
Taherzadeh et al. (2019), Liu et al. (2019), Li et al. (2019), Thomès, Burkholz & Bojar (2021), Pitti et al. (2019), Burkholz, Quackenbush & Bojar (2021), Xiang, Zou & Zhao (2021), Coff et al. (2020), Dimeglio et al. (2020), Chen et al. (2021)1010
Akmal, Rasool & Khan (2017), Chien et al. (2020), Bojar et al. (2021b), Alkuhlani et al. (2021), Huang et al. (2021), Jiang et al. (2018), Jia, Zuo & Zou (2018), Yang et al. (2019), Liu et al. (2021), Ruiz-Blanco et al. (2017), Huang & Li (2018)911
Tran, Pham & Ou (2021), Park et al. (2019), Antonakoudis et al. (2021), Le, Sandag & Ou (2018), He, Wei & Zou (2019), Shek, Kotidis & Betenbaugh (2021), Mondragon-Shem et al. (2020), Wang et al. (2017), Suga, Nagae & Yamaguchi (2018), Bao et al. (2019), Dobson, Zeke & Tusnády (2021), Sugár et al. (2021), Li et al. (2020), Kumari, Kumar & Kumar (2018)814
Kotidis & Kontoravdi (2020), Naseer et al. (2021b), Hwang et al. (2020), Audagnotto & Dal Peraro (2017), Zhang et al. (2021b), Ilyas et al. (2019), Magaret et al. (2019), Desaire, Patabandige & Hua (2021), Perpetuo et al. (2021)79
Zhang et al. (2021a), Huang et al. (2017), Kellman & Lewis (2021), Zhao et al. (2020), de Souza et al. (2019), Yang & Han (2017), Ferreira et al. (2021), Zou et al. (2017), Qiu et al. (2018)69
Carpenter et al. (2022), Lundstrøm et al. (2022), Lee et al. (2021), Adolf-Bryfogle et al. (2021), Sha et al. (2019), Zhang et al. (2021c), Krasnova & Wong (2019), Wilson et al. (2021), Hua11 (2019), Badgett et al. (2018), Kuo-Chen (2019), Kumar et al. (2020), Campbell (2017), Ye & Vakhrushev (2021), Bojar et al. (2021a), Lei, Tang & Du (2017), Murad et al. (2021)517
The overall classification result and QA studies have presented in of Table 6. The finalized articles have classified based on seven parameters: research type (solution proposed or review article), empirical type (computational approach, experimental approach based on biological studies or hybrid approach based on computational and biological study), glycosylation type, specie type, method (used for feature extraction), Algorithm (used to train predictive model) and tool (developed for prediction). Furthermore, the sources of finalized studies, and total number/percentage of studies per publication source mentioned in Table 8.
Table 8

Percentage count of articles published in channel.

Publication sourceReferenceCount%age
Amino Acids Ruiz-Blanco et al. (2017) 11
Analytical and Bioanalytical Chemistry Desaire, Patabandige & Hua (2021) 11
BioinformaticsTaherzadeh et al. (2019), Jiang et al. (2018), Dimeglio et al. (2020), Dobson, Zeke & Tusnády (2021), Jia, Zuo & Zou (2018)57
bioRxivCarpenter et al. (2022), Lundstrøm et al. (2022), Adolf-Bryfogle et al. (2021), Bojar et al. (2021a), Murad et al. (2021)57
Biotechnology and Bioengineering Zhang et al. (2021b) 11
BMC BioinformaticsLi et al. (2019), Coff et al. (2020), Li et al. (2020)34
Briefings in Bioinformatics Xiang, Zou & Zhao (2021) 23
Briefings in Functional Genomics He, Wei & Zou (2019) 11
Cell Host Microbe Bojar et al. (2021b) 11
Cell Reports Burkholz, Quackenbush & Bojar (2021) 11
Chemometrics and Intelligent Laboratory SystemsBao et al. (2019), Qiu et al. (2018)23
Computational and Structural Biotechnology Journal Audagnotto & Dal Peraro (2017) 11
Computational Biology and ChemistryLe, Sandag & Ou (2018), Yang & Han (2017)23
Computers Chemical Engineering Antonakoudis et al. (2021) 11
Computers in Biology and Medicine Tran, Pham & Ou (2021) 11
Current BioinformaticsAlkuhlani et al. (2021), Huang & Li (2018)23
Current Genomics Ilyas et al. (2019) 11
Current Opinion in Chemical Engineering Shek, Kotidis & Betenbaugh (2021) 11
Environmental Microbiology Zhang et al. (2021c) 11
Expert Review of Proteomics Perpetuo et al. (2021) 11
Frontiers in Endocrinology Zhang et al. (2021c) 11
Fuzzy Systems and Data Mining Hua11 (2019) 11
Genomics, Proteomics BioinformaticsHuang et al. (2021), Ferreira et al. (2021)23
GlycobiologyThomès, Burkholz & Bojar (2021), Park et al. (2019), Suga, Nagae & Yamaguchi (2018)34
IEEE Access Chien et al. (2020) 11
IEEE International Conference on Machine Learning and Applied Network Technologies (ICMLANT) Kumar et al. (2020) 11
International Conference of Pioneering Computer Scientists, Engineers and Educators. Springer, Singapore, Zou et al. (2017) 11
Journal of Bimolecular Techniques Badgett et al. (2018) 11
Journal of Computational Biology Kumari, Kumar & Kumar (2018) 11
Journal of Molecular Graphics and Modelling Liu et al. (2021) 11
Journal of ProteomicsZhao et al. (2020), de Souza et al. (2019)23
Journal of the American Chemical SocietyHuang et al. (2017), Krasnova & Wong (2019)23
Letters in Organic Chemistry Lei, Tang & Du (2017) 11
Mathematical Bioscience Liu et al. (2019) 11
Metabolic Engineering Communications Kotidis & Kontoravdi (2020) 11
Molecular Cellular Proteomic Ye & Vakhrushev (2021) 11
Nature Communications Wang et al. (2017) 11
Nucleic Acids Research Yang et al. (2019) 11
PLoS Computational Biology Magaret et al. (2019) 11
PLOS ONE Akmal, Rasool & Khan (2017) 11
Processes Sha et al. (2019) 11
Scientific ReportsPitti et al. (2019), Hwang et al. (2020), Mondragon-Shem et al. (2020), Sugár et al. (2021)46
Symmetry Naseer et al. (2021b) 11
The American Journal of Human Genetics Wilson et al. (2021) 11
Trends Artifi. Intell Kuo-Chen (2019) 11
Trends in Biochemical Science Kellman & Lewis (2021) 11
Trends in Glycoscience and Glycotechnolog Campbell (2017) 11
Trends in Microbiology Lee et al. (2021) 11

Assessment of q2:

Which are the exiting prediction model (tool) for the identification of N-linked Glycosylation sites and for which kind of species these sites are identified?

The available tool to identify the N-Linked glycosylation sites and for which kind species it can identify the relevant site is the parameter of this study. There is hierarchy of N-Linked Glycan to PTM. Where PTM is classified into various type and Glycosylation in one of them and glycosylation is further classified into five group and N-linked is one of them. The summarized detail of eight is represented in Fig. 7. It is observed, there are 13 studies including (Chien et al., 2020; Taherzadeh et al., 2019; Liu et al., 2019; Li et al., 2019; Thomès, Burkholz & Bojar, 2021; Pitti et al., 2019; Adolf-Bryfogle et al., 2021; Park et al., 2019; He, Wei & Zou, 2019; Audagnotto & Dal Peraro, 2017; Mondragon-Shem et al., 2020; Dimeglio et al., 2020; Ruiz-Blanco et al., 2017) which have developed the tool specific to the N-Linked site identifications, few studied developed tool for glycosylation sites identification irrespective of the specific type including (Bojar et al., 2021b; Carpenter et al., 2022; Lundstrøm et al., 2022; Burkholz, Quackenbush & Bojar, 2021; Coff et al., 2020; Shek, Kotidis & Betenbaugh, 2021) and some authors (Le, Sandag & Ou, 2018; Liu et al., 2021; Yang et al., 2019; Campbell, 2017) develop tool without mentioning the type of PTM. These all tools have list down in the Table 9.
Figure 7

Tool available for N-linked sites identification.

Table 9

N-Linked glycosylation available tool.

Ref.P.YearSpeciesToolFinding
Bojar et al. (2021b) 2021EukaryoteSweetOriginThe model develop to identify Glycosylation sites using Hybrid approach on Eukaryotes.
Thomès, Burkholz & Bojar (2021) 2021AnimalGlycoWorkThe computational model used to identify both N and O-linked in Animal.
Carpenter et al. (2022) 2021Not mentionGlyNetThe computational model used to identify glycosylation protein sequences.
Lundstrøm et al. (2022) 2021HumanLectinOracleThe computational model used to identify glycosylation protein sequences for human.
Burkholz, Quackenbush & Bojar (2021) 2021HumanSweetNetThe computational model used to identify glycosylation protein sequences for human.
Adolf-Bryfogle et al. (2021) 2021Not mentionRosetta Carbohydrate FrameworkThe computational model used to identify N-linked sites and species are not mentioned.
Shek, Kotidis & Betenbaugh (2021) 2021Not mentionProvidedThe computational model used to identify glycosylation sites and species are not mentioned.
Chen et al. (2021) 2021HumanCNNrgbThe computational model used to identify PTM sites for human protein.
Liu et al. (2021) 2021HumanUbiSite = XGBoostThe computational model used to identify PTM sites for human protein.
Chien et al. (2020) 2020Human and MouseN-GlycoGoThe computational model used to identify N-Linked sites for human and mouse protein sequences.
Coff et al. (2020) 2020Human and avianCCARLThe computational model used to identify glycosylation sites for human and avian protein sequences.
Mondragon-Shem et al. (2020) 2020HumanExisting ToolThe hybrid model consists of both experimental and computational approach to develop N-linked site identification on human protein
Dimeglio et al. (2020) 2020HumanTHETA ModelThe hybrid model consists of both experimental and computational approach to develop N-linked site identification on human protein
Taherzadeh et al. (2019) 2019Human and MouseSprint-GlyThe computational model used to identify both N and O-linked in human and Mouse.
Liu et al. (2019) 2019HumanNetGlyco (Exiting)The computational model used to identify N-linked sites in human.
Li et al. (2019) 2019HumanGlycoMine_PUThe computational model used to identify N, O and C-linked in human.
Pitti et al. (2019) 2019HumanNGlyDEThe computational model used to identify N-linked in human.
Park et al. (2019) 2019HumanGlycan Reader and ModelerThe computational model used to identify both N and O-linked in human.
He, Wei & Zou (2019) 2019Not mentionProvidedThe computational model used to identify N-linked sites while specie is not mentioned.
Yang et al. (2019) 2019HumanAwesomeThe hybrid approach develop to identify PTM sites for human.
Le, Sandag & Ou (2018) 2018HumanPTM TransporterThe computational approach developed PTM sites including N-Linked sites for human.
Audagnotto & Dal Peraro (2017) 2017Not mentionProvidedThe computational model used to identify N-linked sites while specie type is missing.
Ruiz-Blanco et al. (2017) 2017HumanSequonComputational method to identify N-Linked sites for human.
It is important to specify for which kind of species these tools will be operating, therefore to achieve this purpose the information is also extracted from the selected studies. Some authors (He, Wei & Zou, 2019; Audagnotto & Dal Peraro, 2017; Shek, Kotidis & Betenbaugh, 2021; Carpenter et al., 2022) did not mention the organism type while other mentioned it and it is observed most of them use human data for site identification as mention in Table 9.

Assessment of q3:

Which algorithm or method are used to construct N-Linked feature vector?

The data is the major component to develop any machine learning model (Mahmood et al., 2020; Naseer et al., 2020a, 2020b; Khan et al., 2020b). In bioinformatics, there are two major sources of data on which model can be developed, one is existing repositories such as UniProt (protein repository), GenBank (nucleotide sequence) etc. and other is experimental data which obtain from specific biological experiments. The dataset obtained from any source needs preprocessing to construct the feature vector. The more accurate feature helps to develop efficient model (Barukab et al., 2019; Butt & Khan, 2019; Hussain, Rasool & Khan, 2020; Shah & Khan, 2020). For this purpose, feature method used to predict the N-Linked sites in the selected articles have taken as a parameter of this study. Most of the authors used the computational feature extraction approach while few used the experimental data obtained from mass spectrometry, human plasma and psycho-chemical method as mentioned in Table 10. It is observed, mostly researcher (Akmal, Rasool & Khan, 2017; Chien et al., 2020; Taherzadeh et al., 2019; Liu et al., 2019; Li et al., 2019; Bojar et al., 2021b; Lundstrøm et al., 2022; Park et al., 2019; Le, Sandag & Ou, 2018; Suga, Nagae & Yamaguchi, 2018; Dimeglio et al., 2020; Magaret et al., 2019; Kumar & Gilula, 1986; Perpetuo et al., 2021; Huang & Li, 2018) used the statistical moment method based on combination of protein sequence, structure and functions along with some other parameters like position relevance of sequences using the protein dataset to construct the feature matrix. The other computational method used to construct features selected article are word embedding vector technique, UbiSite-XGBoost, Similarity voting, CfsSubSetEval, Kernel Density Estimate, correlation subset and graph method as mentioned in Table 10.
Table 10

Feature methods for the N-linked sites identification.

Ref.GlyotypeMethod for feature Finding
Tran, Pham & Ou (2021) N-LinkedWord embedding Vector TechniqueWord embedding technique to efficiently predict N-linked glycosylation sites in ion channels.
Adolf-Bryfogle et al. (2021) N-LinkedKDEKernel Density Estimation based feature extracted.
Antonakoudis et al. (2021) N-LinkedStoichiometricHybrid method that used the experimental data using stoichiometric.
Zhang et al. (2021b) N-LinkedUnknown Parameter and StructureProtein structure feature and some undefined features used to construct feature vector.
Bojar et al. (2021a) N-LinkedSequenceSequence based features computed.
Chien et al. (2020) N-LinkedSequence, Structure and Function featuresequence, structure and function base feature set of human and mouse used to predict site on imbalance dataset.
Hwang et al. (2020) N-LinkedIQ-GPA human plasma proteinIQ-GPA procedure was used to obtain data from human plasma.
Mondragon-Shem et al. (2020) N-LinkedMSHybrid method based on Mass Spectrometry used data used for training.
Liu et al. (2019) N-LinkedSequenceSequence based protein sequences have computed.
Pitti et al. (2019) N-LinkedSimilarity voting and Gap PeptideSimilarity Voting method and gap peptide method used to construct features.
Magaret et al. (2019) N-LinkedSequencesSequence based protein sequences have computed.
Suga, Nagae & Yamaguchi (2018) N-LinkedStructural FeatureStructure based protein sequences have computed.
Akmal, Rasool & Khan (2017) N-LinkedPosition relative and Statistical MomentsPosition relative features and statistical moment based features have computed.
Wang et al. (2017) N-LinkedCfsSubSetEvalPatients with different drug responses
Ruiz-Blanco et al. (2017) N-LinkedProDCalProtDCal method used to get protein features.
Dimeglio et al. (2020) N-LinkedStatistical MomentStatistical Moments computed to construct feature vector.
Li et al. (2019) N-Linked (and C/O-Linked)Sequence and Structure FeatureSequence and structure based protein sequences have computed.
Taherzadeh et al. (2019) N-Linked and O-LinkedSequence and StructureSequence and structure based protein sequences have computed.
Park et al. (2019) N-Linked and O-LinkedSequence and StructureSequence and structure based protein sequences have computed.
Bojar et al. (2021b) GlycosylationSequence featureDevelop models for glycans that are trained on a curated dataset of 19,299 unique glycans and used sequence based features.
Carpenter et al. (2022) GlycosylationFingerprint EncodingFeature vector based on Fingerprint encoding method for Predicting Protein-Glycan Interaction
Lundstrøm et al. (2022) GlycosylationProtein-Glycan Sequence FeatureThe sequence feature of combined protein and glycan are used to extract feature vector based on sequence features.
Burkholz, Quackenbush & Bojar (2021) GlycosylationGraph and Statistical featureGraph algorithm and statistical moments are used to construct feature matrix for glycan.
Desaire, Patabandige & Hua (2021) GlycosylationMSHybrid method based on Mass Spectrometry used data used for training.
Coff et al. (2020) GlycosylationFrequent Subtree mining and mRMRfrequent subtree mining and mRMR used for feature vector construction.
Perpetuo et al. (2021) PTMSequencesSequence based features used for feature vector construction.
Liu et al. (2021) PTMUbiSite-XGBoostPseudo ACC, K-spaced Acid Pair, Adapted Normal Distribution bi-profile Bayes, AA Index, Encoding Based Group Weight, LASSO, SMOTE and eXtreme Gradient Boosting features methods are used.
Kumar et al. (2020) PTMPsycho-Chemical, structural and PTMPsycho-Chemical, structure moment of protein and PTM sequence features were used.
Ilyas et al. (2019) PTMChou’s 5-stepsChou’s 5-steps based feature vector was used.
Yang et al. (2019) PTMSNPSingle Nucleotide Polymorphism approach used to compute features.
Huang & Li (2018) PTMSequence, StructureSequence and Structure based protein sequences have computed.
Chen et al. (2021) PTMBinary Encoding, AAC, EAAC and DipeptideVarious features have extracted including binary encoding, Amino Acid Composition, Enhanced AAC and Dipeptide.
Le, Sandag & Ou (2018) PTM (including N Linked)Statistical Moment and F scoreStatistical moment used and then F-Score was computed

Assessment of q4:

Which algorithm or method are used to train N-Linked computation model?

The choice of algorithm to train any predictive model is most important factor which impact the performance of any model (Butt & Khan, 2019; Hussain, Rasool & Khan, 2020; Malebary & Khan, 2021). Therefore, it is required to know which type of algorithm are being used to develop the N-linked prediction model. For this purpose, algorithm used for training models in the selected article has been noted as the parameter of this review article as mentioned in Table 11.
Table 11

Training algorithm (method) used for N-linked model.

Ref.Model training algorithmPTM typeFinding
Akmal, Rasool & Khan (2017) ANN/Back propagationN-LinkedPrediction of N-linked glycosylation sites using position relative features and statistical moments through multilayered ANN using back propagation approach.
Chien et al. (2020) XGBOOSTN-LinkedExtreme Gradient Boost method was used to predict site on imbalance dataset.
Tran, Pham & Ou (2021) RF, KNN, SVM and XGBoostN-LinkedVarious classifiers were used for prediction including Random Forest, K-Nearest Neighbor, Support Vector Machine and XGBoost but RM outperform.
Liu et al. (2019) ANNN-LinkedArtificial Neural Network algorithm used to identify N-linked site in Influenza virus using existing model on dataset.
Pitti et al. (2019) SVMN-LinkedN-GlyDE: a two-stage N-linked glycosylation site prediction incorporating gapped dipeptides and pattern-based encoding using SVM after collecting feature vector through two stages.
Kotidis & Kontoravdi (2020) ANN/Kinetic ModelN-Linkedartificial neural networks and Kinetic model used for predicting protein glycosylation.
Adolf-Bryfogle et al. (2021) Glycan Tree ModlerN-Linkedprediction based on Tree method.
Sha et al. (2019) KineticN-Linkeda two-component modeling framework integrating FBA and glycosylation kinetic model was used for prediction.
Antonakoudis et al. (2021) ANNN-Linkedpredict N linked sites using features computed by stoichiometric and then train model using ANN with forward propagation.
Hwang et al. (2020) DNNN-LinkedN linked site using DNN which later used to classify fucosylation
Zhang et al. (2021b) Baysen NetworkN-LinkedProbabilistic model by Bayesian network for the prediction of antibody glycosylation in perfusion and fed-batch cell cultures
Wang et al. (2017) SVMN-LinkedDrug responses identified using SVM method.
Dimeglio et al. (2020) ANNN-LinkedNew genotypic approach for predicting HIV-1 CRF02-AG using ANN
Magaret et al. (2019) RM, Super Learner and GlmnetN-LinkedProtein sequence and biological data used to identify N-linked sites using super learner algorithm.
Bojar et al. (2021a) MLN-LinkedGuide to Lectin Binding: Machine-Learning Directed Annotation of 57 Unique Lectin Specificities
Ruiz-Blanco et al. (2017) Jrip ClassifierN-LinkedNovel “extended sequons” of human N-glycosylation sites improve the precision of qualitative predictions: an alignment-free study of pattern recognition using ProtDCal protein features.
Park et al. (2019) ClusteringN-O LinkedCHARMM-GUI Glycan Modeler for modeling and simulation of carbohydrates and glycoconjugates.
Dobson, Zeke & Tusnády (2021) CNN with AdamN-O LinkedNovel mechanism to collect dataset using polarization and then train on CNN model.
Taherzadeh et al. (2019) Deep ANN and SVMN-O LinkedPredicting N-and O-linked glycosylation sites of human and mouse proteins by using sequence and predicted structural properties through DNN and SVM
Li et al. (2019) PA2DE using AlphaMaxN-C-O LinkedPositive-unlabeled data set used to predict sites using AlphaMax algorithm
Bojar et al. (2021b) Recurrent NN (LSTM)Glycosylationdevelop deep-learning using Recurrent NN models used for glycans that are trained on a curated dataset of 19,299 unique glycans and can be used to study and predict glycan functions.
Carpenter et al. (2022) MNN (ADAM)GlycosylationA Multi-Task Neural Network using ADAM algorithm used for Predicting Protein-Glycan Interaction
Lundstrøm et al. (2022) Graph CNNGlycosylationLectinOracle, a model combining transformer-based representations for proteins and graph convolutional neural networks for glycans to predict their interaction.
Burkholz, Quackenbush & Bojar (2021) Graph NNGlycosylationsing graph convolutional neural networks to learn a representation for glycans.
Coff et al. (2020) Regression ClassifierGlycosylationfrequent subtree mining and mRMR used for feature selection then train on regression classifier for glycan motifs.
Desaire, Patabandige & Hua (2021) SVMGlycosylationThe local-balanced model for improved machine learning outcomes on mass spectrometry data sets and other instrumental data
Le, Sandag & Ou (2018) RBF NetworkPTMprediction of transport protein (including N linked) into three classes and six families using RBF Network.
Chen et al. (2021) Deep LearningPTMnhKcr: a new bioinformatics tool for predicting crotonylation sites on human non histone proteins based on deep learning
Lei, Tang & Du (2017) Ensemble ClassifierPTMPredicting S-sulfenylation sites using physicochemical properties difference and ensemble classifier.
Murad et al. (2021) Random ForestPTMUbiquitination Sites Prediction Using Statistical Moment with Random Forest Approach.
Kumar et al. (2020) MLPTMMachine Learning techniques to identify potential drug targets for Anti-epileptic drugs
Perpetuo et al. (2021) AIPTMartificial intelligence be used for peptidomics
Qiu et al. (2018) SVMPTMPredicting protein lysine methylation sites by incorporating single-residue structural features into Chou’s pseudo components.
Liu et al. (2021) Extreme gradient boosting classifierPTMPrediction of protein ubiquitination sites via multi-view features based on eXtreme gradient boosting classifier.
Huang & Li (2018) KNNPTMFeature extractions for computationally predicting protein post-translational modifications
It is observed from the selected articles that most of the authors (Akmal, Rasool & Khan, 2017; Taherzadeh et al., 2019; Liu et al., 2019; Lundstrøm et al., 2022; Burkholz, Quackenbush & Bojar, 2021; Kotidis & Kontoravdi, 2020; Antonakoudis et al., 2021; Hwang et al., 2020; Dimeglio et al., 2020; Dobson, Zeke & Tusnády, 2021; Ilyas et al., 2019; Chen et al., 2021) used the Artificial Neural Network (ANN) or the variant of ANN such as Deep ANN, Graph NN, Convolution NN and Recurrent NN. The second most used algorithm is Support Vector Machine (SVM) used by authors (Tran, Pham & Ou, 2021; Pitti et al., 2019; Wang et al., 2017; Desaire, Patabandige & Hua, 2021; Qiu et al., 2018) and remaining authors used Random Forest, XGBOOST, Baysen Network, Regression Classifier, Radial Base Function and some used customized method as mention in Table 11.

Assessment of q5:

How effective are the existing model to predict the N-Linked sites?

The result comparisons are used to present the performance to various based on which conclusion can be drawn with respect specific dimension. In this systematic review, the performance comparison of N-linked model in the selected articles has performed. The parameter used for the performance consists of (a) availability of data set. (b) accuracy metric (c) sensitivity metric. (d) specificity (e) availability of developed tool (f) comparison on independent data and the type of glycosylation as mentioned in Table 12. It is observed, most of the authors (Kotidis & Kontoravdi, 2020; Sha et al., 2019; Park et al., 2019; Antonakoudis et al., 2021; Zhang et al., 2021b; Wang et al., 2017; Kumar et al., 2020; Ilyas et al., 2019; Sugár et al., 2021; Bojar et al., 2021a; Perpetuo et al., 2021; Huang & Li, 2018) did not provide the results or they did not follow provided performance metrics in their research. The authors (Akmal, Rasool & Khan, 2017; Chien et al., 2020; Taherzadeh et al., 2019; Tran, Pham & Ou, 2021; Liu et al., 2019; Li et al., 2019; Pitti et al., 2019; Hwang et al., 2020; Le, Sandag & Ou, 2018; Dimeglio et al., 2020; Magaret et al., 2019; Ruiz-Blanco et al., 2017) mentioned most of the performance metrics specific to N-Linked sites identification and out of these, authors Chien et al. (2020) and Hwang et al. (2020) has not provide the data set on which experiments have performed.
Table 12

Performance comparison of N-linked models.

Ref.Glycosylation typeResult comparison onToolDatasetACC (%)SN (%)SP (%)Finding
Akmal, Rasool & Khan (2017) N-LinkedYesNoYes99.999.899.9Detail Comparison has perform and also present metrics but tool is not available
Chien et al. (2020) N-LinkedYesYesNo84.782.884.8Detail Comparison has performed and also present metrics. But data set is not available
Taherzadeh et al. (2019) N-Linked and O-LinkedYesYesYes97.598Detail Comparison has performed and also present metrics.
Tran, Pham & Ou (2021) N-LinkedYesNoYes93.498.692.8Detail comparison has perform and also present metrics but tool is not available
Liu et al. (2019) N-LinkedNoYesYes50Not compare the result properly.
Li et al. (2019) N-Linked (and C/O-Linked)YesYesYes88.6Detailed comparison has performed but SN and SP not computed
Bojar et al. (2021b) GlycosylationNoNoYes75Result not compare properly and also missing few metrics.
Carpenter et al. (2022) GlycosylationNoYesNo75Tool is available but data set is missing and did not perform all performance metric
Pitti et al. (2019) N-LinkedYesYesYes7449Detailed comparison has performed but SP.
Lundstrøm et al. (2022) GlycosylationNoNoYes72Result are not performed properly as missing metrics and tool.
Burkholz, Quackenbush & Bojar (2021) GlycosylationNoYesYes85Detailed comparison performed but missing few metrics
Kotidis & Kontoravdi (2020) N-LinkedNoNoYesDid not specify results.
Sha et al. (2019) N-LinkedNoNoYesDid not specify results.
Park et al. (2019) N-Linked and O-LinkedNoYesNoDid not specify results.
Antonakoudis et al. (2021) N-LinkedNoNoYesDid not specify results.
Hwang et al. (2020) N-LinkedNoNoNo99100Achieved almost full accuracy but result comparison with independent data set, data set and tool is missing.
Coff et al. (2020) GlycosylationNoYesYes89Achieve good result but comparison on independent data set is missing and glycosylation type is not specified.
Le, Sandag & Ou (2018) PTM (including N Linked)YesYesYes92Detailed comparison has performed and achieved good results but not specify the PTM type.
Zhang et al. (2021b) N-LinkedYesNoNoDid not specify results.
Wang et al. (2017) N-LinkedYesNoNoDid not specify results.
Dimeglio et al. (2020) N-LinkedYesYesYes888689Detailed comparison has performed and also achieved good results.
Kumar et al. (2020) PTMYesNoYesDid not specify results.
Ilyas et al. (2019) PTMYesNoYesDid not specify results.
Magaret et al. (2019) N-LinkedYesNoYes869739Detailed comparison has performed and also achieved good results.
Sugár et al. (2021) N-LinkedNoYesYesDid not specify results.
Bojar et al. (2021a) N-LinkedNoNoYesDid not specify results.
Desaire, Patabandige & Hua (2021) GlycosylationNoNoYes98Achieve good result but glycosylation type is not specified and missing few metrics
Chen et al. (2021) PTMYesYesYes856290Detailed comparison has performed and achieved good result, but it is generic for PTM as specific type was not mentioned
Perpetuo et al. (2021) PTMNoNoNoDid not specify results.
Liu et al. (2021) PTMNoYesYes97Detailed comparison has performed and achieved good result, but SN and SP are missing
Ruiz-Blanco et al. (2017) N-LinkedNoNoYes9982Detailed comparison has performed and achieved good result, but data set is missing.
Huang & Li (2018) PTMNoNoYesDid not specify results.

Discussion and future direction

This section summarizes and discuss the detail of this systematic literature review regarding the identification of N-linked sites.

Taxonomy hierarchy

The objective of this study was to analyze the current progress to identify the N-linked glycosylation sites. To achieve this objective, a taxonomy has built based on the coding scheme as mentioned in Table 13 after critically analyzing 70 articles, selected through a systematic approach. The coding developed on the various aspects related to this study such as: Feature set construction method, machine model training algorithm and performance evaluation. These aspects are further divided into the sub-level showing the depth of each aspect and their role in the efficient identification of N-linked sites. The coding scheme helped to construct the taxonomy as shown Fig. 8 to further investigate domain and sub-domains identified through it.
Table 13

Taxonomy coding scheme for SLR.

DomainCodeSubdomainReference
Feature set methodSMFStatistical Moment FeatureAkmal, Rasool & Khan (2017), Le, Sandag & Ou (2018), Murad et al. (2021), Burkholz, Quackenbush & Bojar (2021), Dimeglio et al. (2020)
SEFSequence Based FeatureChien et al. (2020), Taherzadeh et al. (2019), Liu et al. (2019), Li et al. (2019), Bojar et al. (2021b), Park et al. (2019), Huang & Li (2018), Lundstrøm et al. (2022)
SQFStructure Based FeatureChien et al. (2020), Taherzadeh et al. (2019), Li et al. (2019), Park et al. (2019), Huang & Li (2018), Zhang et al. (2021b), Suga, Nagae & Yamaguchi (2018), Kumar et al. (2020)
WEFWord Embedding Feature Tran, Pham & Ou (2021)
SVFSimilarity Voting Feature Pitti et al. (2019)
Machine training algorithmANNArtificial Neural NetworkAkmal, Rasool & Khan (2017), Liu et al. (2019), Kotidis & Kontoravdi (2020), Antonakoudis et al. (2021), Dimeglio et al. (2020), Ilyas et al. (2019)
SVMSupport Vector MachineAkmal, Rasool & Khan (2017), Taherzadeh et al. (2019), Tran, Pham & Ou (2021), Desaire Patabandige & Hua (2021), Qiu et al. (2018), Pitti et al. (2019), Wang et al. (2017)
DNNDeep Neural NetworkTaherzadeh et al. (2019), Hwang et al. (2020), Chen et al. (2021)
GNNGraph Neural NetworkBurkholz, Quackenbush & Bojar (2021), Lundstrøm et al. (2022)
RBFRadial Basis Function Le, Sandag & Ou (2018)
Performance metricACCAccuracyAkmal, Rasool & Khan (2017), Chien et al. (2020), Taherzadeh et al. (2019), Tran, Pham & Ou (2021), Liu et al. (2019), Li et al. (2019), Hwang et al. (2020), Magaret et al. (2019), Pitti et al. (2019), Dimeglio et al. (2020)
SPSpecificityAkmal, Rasool & Khan (2017), Chien et al. (2020), Magaret et al. (2019), Tran, Pham & Ou (2021), Dimeglio et al. (2020)
SNSensitivityAkmal, Rasool & Khan (2017), Chien et al. (2020), Taherzadeh et al. (2019), Hwang et al. (2020), Magaret et al. (2019), Ruiz-Blanco et al. (2017), Tran, Pham & Ou (2021), Pitti et al. (2019), Dimeglio et al. (2020)
Figure 8

Taxonomy of N-Linked site identification perspective.

General observation and future direction

Several possible observations can be made in the finding of this SLR based on the taxonomy as shown in Fig. 8. Various RQs were developed which plays a key factor in the identification of N-linked sites. The trends and finding can be observed while the identification of such sites. These include the following observation along with future direction. (a) Feature set construction method The performance of computational model deeply depends on the quality of feature set extracted from the data set which later used for training the machine learning model (Saeed, Mahmood & Khan, 2018; Khan et al., 2019; Naseer et al., 2021a). The discriminating features helps the model to learn proficiently and then perform the right prediction. Therefore, it is significant to discover the techniques which extract the useful information from the dataset. The various methods have been used by authors to construct the feature set, the widely used are: protein sequence feature, protein structure feature, statistical moments, word embedding technique and similarity voting. The majority of the authors (Liu et al., 2019; Bojar et al., 2021b; Magaret et al., 2019; Bojar et al., 2021a) only used the sequence based information of protein to train the model. It has also observed, the authors (Akmal, Rasool & Khan, 2017; Taherzadeh et al., 2019; Li et al., 2019; Park et al., 2019; Murad et al., 2021) applied the combination of multiple features such as sequence, structural and statistical to construct feature vector. More than 50% of the research article selected in this study, which got 10 points based on quality assessment score used combination of various features as mentioned above. The new techniques adopted in recent research articles are word embedding vector, graph statistical feature along with similarity voting and Chou’s five step method. The researchers can use these feature extraction techniques to improve the performance of N-linked prediction model or any PTM site identification model. (b) Machine training algorithm The most significant part of computational model after the feature extraction method is to develop the method to train the machine model (Hussain, Rasool & Khan, 2020; Barukab et al., 2022; Khan et al., 2020a). The performance of model impacted most by the technique used for training the machine. The appropriate learning algorithm along with fine feature extraction method, results highly adequate model that predicts the independent data with great accuracy. Therefore, the development of appropriate machine learning method is very much essential. The researchers proposed various methods to predict the N-linked sites accurately. The most widely used methods include: Artificial Neural Network (ANN), Support Vector Machine (SVM), Deep Neural Network (DNN), Graph Neural Network (GNN) and Radial Basis Function (RBF) Network. The research article published in Q1 journal according to the JCR, used the ANN (Akmal, Rasool & Khan, 2017; Liu et al., 2019; Dimeglio et al., 2020) widely along with SVM (Taherzadeh et al., 2019; Pitti et al., 2019) method. It has also been analysed the research article (Taherzadeh et al., 2019; Le, Sandag & Ou, 2018; Ruiz-Blanco et al., 2017) in which web server has provided and present the accuracy above 90% used the Jrip Classifier, DNN, SVM and RBF algorithm. The authors (Akmal, Rasool & Khan, 2017; Tran, Pham & Ou, 2021; Hwang et al., 2020; Magaret et al., 2019; Desaire, Patabandige & Hua, 2021) who proposed prediction model without providing the webserver and also have accuracy above 90% used ANN, SVM, DNN and RF algorithms. The researchers can use these algorithms to improve the performance of N-linked prediction model or any PTM site identification model. (c) Performance evaluation Once the model has trained, it then validated on the independent data to evaluate the performance. There are various techniques to measure the validity of model, the most significant metrics to evaluate the performance are Accuracy metric, Sensitivity and Specificity metric. The sensitivity test measures the true positive accuracy of a model while specificity measures the true negative accuracy of the model. In this study, the performance has evaluated on aforementioned metrics. Around 40% of the authors have not validated their model on any of above mentioned performance metrics. Only 20% of the authors have performed each of the defined performance metrics. The predictive models in which PTM type is specialized to N-linked have better accuracy as compared to those in which PTM type is not specified or are the generalized ones. The highest accuracy of −99% was achieved by author Akmal, Rasool & Khan (2017) based on these evaluation criteria. It also presents the sensitivity and specificity measures of the model which were 99.8% and 99.9% respectively, but it did not provide the web server. The author Hwang et al. (2020) claims the accuracy of 99% along with the sensitivity of 100%, but did not provide the working tool, dataset, and result comparisons with other predictors. The most efficient predictive models with available web server are Sequon model Ruiz-Blanco et al. (2017) and Sprint-Gly model Taherzadeh et al. (2019) with the accuracy of 97.5% and 97% respectively. The Sequon model has trained on the human protein sequence only while Sprint-Gly is equally effective for both human and rat species. Therefore, Sprint-Gly considered to be a reliable model out of the currently available web servers.

Future direction

Bioinformatics is an emerging filed, there are lot of problems that needs the computational solution over the experimental. As it was mentioned earlier, the researchers have identified almost ∼ 200 types of PTM which plays key role in various biological functions. Apart from N-linked glycosylation, the other types of glycosylation such as O-linked and C-linked also play vital role in protein functioning and various drug discovery techniques. Therefore, it is the opportunity for the researchers, pharmaceutical and academia to develop the efficient computational model to solve the problem that needs better computational solution. Few of the existing problem that needs to be addressed are given below (a) Identify the O-linked glycosylation sites for threonine and serine using ANN. (b) How the performance of C-linked glycosylation can be enhanced through exiting neural network classifiers. (c) Develop a comprehensive predictive model to classify the type of glycosylation. (d) How effective are the exiting classifier to predict the other PTM sites?

Conclusion

The significance of N-linked glycosylation promotes the discovery of such sites using computational methods instead of experimental method due to its limitations. In this systematic study, existing information to identify such sites was studied which covered the possible challenges and their solutions through systematic method. The research articles, related to the keywords associated with N-linked glycosylation were evaluated through five major digital libraries. In the result of search query applied to digital libraries, more than 800 articles have found and after filtering process 70 article were remained for further analysis. The results show that approximately 75% of the articles were published in recognized journals and rest belong to top conferences. It was observed that more than 40% of articles were published in the American journal followed by the Middle East with 20%. Most of the selected studies focused on the feature construction method and training algorithm, but less focused on the performance evaluation criteria and development of tool or web server. The major shortcomings of any SLR primarily are related to search strategy, poor classification, and inaccurate data extraction. In this SLR, these deficiencies were overcome by applying the search query on five major digital libraries to reduce biasness. The results of search queries were then filtered through well-defined inclusion/exclusion criteria.
  68 in total

1.  Residues Comprising the Enhanced Aromatic Sequon Influence Protein N-Glycosylation Efficiency.

Authors:  Yen-Wen Huang; Hwai-I Yang; Ying-Ta Wu; Tsui-Ling Hsu; Tzu-Wen Lin; Jeffery W Kelly; Chi-Huey Wong
Journal:  J Am Chem Soc       Date:  2017-09-06       Impact factor: 15.419

2.  iPhosS(Deep)-PseAAC: Identification of Phosphoserine Sites in Proteins Using Deep Learning on General Pseudo Amino Acid Compositions.

Authors:  Sheraz Naseer; Waqar Hussain; Yaser Daanial Khan; Nouman Rasool
Journal:  IEEE/ACM Trans Comput Biol Bioinform       Date:  2022-06-03       Impact factor: 3.710

3.  A Useful Guide to Lectin Binding: Machine-Learning Directed Annotation of 57 Unique Lectin Specificities.

Authors:  Daniel Bojar; Lawrence Meche; Guanmin Meng; William Eng; David F Smith; Richard D Cummings; Lara K Mahal
Journal:  ACS Chem Biol       Date:  2022-01-27       Impact factor: 5.100

4.  CHARMM-GUI Glycan Modeler for modeling and simulation of carbohydrates and glycoconjugates.

Authors:  Sang-Jun Park; Jumin Lee; Yifei Qi; Nathan R Kern; Hui Sun Lee; Sunhwan Jo; InSuk Joung; Keehyung Joo; Jooyoung Lee; Wonpil Im
Journal:  Glycobiology       Date:  2019-04-01       Impact factor: 4.313

5.  Probabilistic model by Bayesian network for the prediction of antibody glycosylation in perfusion and fed-batch cell cultures.

Authors:  Liang Zhang; MingLiang Wang; Andreas Castan; Håkan Hjalmarsson; Veronique Chotteau
Journal:  Biotechnol Bioeng       Date:  2021-05-03       Impact factor: 4.530

6.  Prediction of VRC01 neutralization sensitivity by HIV-1 gp160 sequence features.

Authors:  Craig A Magaret; David C Benkeser; Brian D Williamson; Bhavesh R Borate; Lindsay N Carpp; Ivelin S Georgiev; Ian Setliff; Adam S Dingens; Noah Simon; Marco Carone; Christopher Simpkins; David Montefiori; Galit Alter; Wen-Han Yu; Michal Juraska; Paul T Edlefsen; Shelly Karuna; Nyaradzo M Mgodi; Srilatha Edugupanti; Peter B Gilbert
Journal:  PLoS Comput Biol       Date:  2019-04-01       Impact factor: 4.475

7.  iMethylK_pseAAC: Improving Accuracy of Lysine Methylation Sites Identification by Incorporating Statistical Moments and Position Relative Features into General PseAAC via Chou's 5-steps Rule.

Authors:  Sarah Ilyas; Waqar Hussain; Adeel Ashraf; Yaser Daanial Khan; Sher Afzal Khan; Kuo-Chen Chou
Journal:  Curr Genomics       Date:  2019-05       Impact factor: 2.236

8.  Evaluating machine learning methodologies for identification of cancer driver genes.

Authors:  Sharaf J Malebary; Yaser Daanial Khan
Journal:  Sci Rep       Date:  2021-06-10       Impact factor: 4.379

9.  Insights into the salivary N-glycome of Lutzomyia longipalpis, vector of visceral leishmaniasis.

Authors:  Karina Mondragon-Shem; Katherine Wongtrakul-Kish; Radoslaw P Kozak; Shi Yan; Iain B H Wilson; Katharina Paschinger; Matthew E Rogers; Daniel I R Spencer; Alvaro Acosta-Serrano
Journal:  Sci Rep       Date:  2020-07-31       Impact factor: 4.379

10.  Identifying glycan motifs using a novel subtree mining approach.

Authors:  Lachlan Coff; Jeffrey Chan; Paul A Ramsland; Andrew J Guy
Journal:  BMC Bioinformatics       Date:  2020-02-04       Impact factor: 3.169

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.