Zengchao Mu1, Ting Yu2, Xiaoping Liu3, Hongyu Zheng4, Leyi Wei5, Juntao Liu6. 1. School of Mathematics and Statistics, Shandong University, Weihai, 264209, China. 2. Research Center for Mathematics and Interdisciplinary Sciences, Shandong University, Qingdao, 266237, China. 3. Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences, Beijing, China. 4. Department of Radiation Oncology, Qilu Hospital, Cheeloo College of Medicine, Shandong University, Jinan, 250012, China. 5. School of Software, Shandong University, Jinan, China. weileyi@sdu.edu.cn. 6. School of Mathematics and Statistics, Shandong University, Weihai, 264209, China. juntaosdu@126.com.
Abstract
BACKGROUND: Feature extraction of protein sequences is widely used in various research areas related to protein analysis, such as protein similarity analysis and prediction of protein functions or interactions. RESULTS: In this study, we introduce FEGS (Feature Extraction based on Graphical and Statistical features), a novel feature extraction model of protein sequences, by developing a new technique for graphical representation of protein sequences based on the physicochemical properties of amino acids and effectively employing the statistical features of protein sequences. By fusing the graphical and statistical features, FEGS transforms a protein sequence into a 578-dimensional numerical vector. When FEGS is applied to phylogenetic analysis on five protein sequence data sets, its performance is notably better than all of the other compared methods. CONCLUSION: The FEGS method is carefully designed, which is practically powerful for extracting features of protein sequences. The current version of FEGS is developed to be user-friendly and is expected to play a crucial role in the related studies of protein sequence analyses.
BACKGROUND: Feature extraction of protein sequences is widely used in various research areas related to protein analysis, such as protein similarity analysis and prediction of protein functions or interactions. RESULTS: In this study, we introduce FEGS (Feature Extraction based on Graphical and Statistical features), a novel feature extraction model of protein sequences, by developing a new technique for graphical representation of protein sequences based on the physicochemical properties of amino acids and effectively employing the statistical features of protein sequences. By fusing the graphical and statistical features, FEGS transforms a protein sequence into a 578-dimensional numerical vector. When FEGS is applied to phylogenetic analysis on five protein sequence data sets, its performance is notably better than all of the other compared methods. CONCLUSION: The FEGS method is carefully designed, which is practically powerful for extracting features of protein sequences. The current version of FEGS is developed to be user-friendly and is expected to play a crucial role in the related studies of protein sequence analyses.
Entities:
Keywords:
Feature extraction; Graphical representation; Physicochemical properties of amino acids; Protein similarity analysis; Statistical features