Yingxi Yang1, Hui Wang2, Jun Ding1, Yan Xu3. 1. Department of Information and Computer Science, University of Science and Technology Beijing, Beijing, 100083, China. 2. Institute of Computing Technology, Chinese Academy of Sciences, Beijing, 100080, China. 3. Department of Information and Computer Science, University of Science and Technology Beijing, Beijing, 100083, China; Beijing Key Laboratory for Magneto-photoelectrical Composites and Interface Science, University of Science and Technology Beijing, Beijing, 100083, China. Electronic address: xuyan@ustb.edu.cn.
Abstract
MOTIVATION: Posttranslational modification (PTM) is a biological mechanism involved in the enzymatic modification of proteins after translation by ribosomes. Two or more modifications occurring at one residue can be transformed into a multi-label system. Two or more simultaneous modifications on a residue is more common than single PTMs. Lysine residues in proteins can be subjected to a variety of PTMs, such as ubiquitination, acetylation, sumoylation, methylation, and succinylation. Identification of uncharacterized sequences in proteins is a highly significant and state-of-the-art issue. Notably, in order to provide a method of processing multi-label sequences of lysine residues, it is highly desirable to develop computational methods to predict lysine acetylation and sumoylation modifications. RESULTS: In this paper, we first launched an integrated approach, known as the five-step prediction method (FSPM), to solve the problem effectively by (1) using one-sided selection (OSS) to deal with imbalanced data, (2) extracting binary features from protein sequences, (3) incorporating binary relevance, classifier chains and multi-class transformation methods to simplify multi-label problems, (4) constructing different classifiers, and (5) implementing cross-validation and evaluating these classifiers. In 10-fold cross-validation, FSPM achieved an accuracy of 61.49% and an absolute-true rate of 60.17%. The results showed that FSPM is accurate and could be used as a powerful engine in multi-label systems. We also conducted a variety of statistical analyses of the predicted results to discuss the biological functions of lysine acetylation and sumoylation.
MOTIVATION: Posttranslational modification (PTM) is a biological mechanism involved in the enzymatic modification of proteins after translation by ribosomes. Two or more modifications occurring at one residue can be transformed into a multi-label system. Two or more simultaneous modifications on a residue is more common than single PTMs. Lysine residues in proteins can be subjected to a variety of PTMs, such as ubiquitination, acetylation, sumoylation, methylation, and succinylation. Identification of uncharacterized sequences in proteins is a highly significant and state-of-the-art issue. Notably, in order to provide a method of processing multi-label sequences of lysine residues, it is highly desirable to develop computational methods to predict lysine acetylation and sumoylation modifications. RESULTS: In this paper, we first launched an integrated approach, known as the five-step prediction method (FSPM), to solve the problem effectively by (1) using one-sided selection (OSS) to deal with imbalanced data, (2) extracting binary features from protein sequences, (3) incorporating binary relevance, classifier chains and multi-class transformation methods to simplify multi-label problems, (4) constructing different classifiers, and (5) implementing cross-validation and evaluating these classifiers. In 10-fold cross-validation, FSPM achieved an accuracy of 61.49% and an absolute-true rate of 60.17%. The results showed that FSPM is accurate and could be used as a powerful engine in multi-label systems. We also conducted a variety of statistical analyses of the predicted results to discuss the biological functions of lysine acetylation and sumoylation.