Jian Feng1, Daniel Q Naiman, Bret Cooper. 1. Department of Applied Mathematics and Statistics, The Johns Hopkins University, Baltimore, Maryland, USA.
Abstract
MOTIVATION: In proteomics, reverse database searching is used to control the false match frequency for tandem mass spectrum/peptide sequence matches, but reversal creates sequences devoid of patterns that usually challenge database-search software. RESULTS: We designed an unsupervised pattern recognition algorithm for detecting patterns with various lengths from large sequence datasets. The patterns found in a protein sequence database were used to create decoy databases using a Monte Carlo sampling algorithm. Searching these decoy databases led to the prediction of false positive rates for spectrum/peptide sequence matches. We show examples where this method, independent of instrumentation, database-search software and samples, provides better estimation of false positive identification rates than a prevailing reverse database searching method. The pattern detection algorithm can also be used to analyze sequences for other purposes in biology or cryptology. AVAILABILITY: On request from the authors. SUPPLEMENTARY INFORMATION: http://bioinformatics.psb.ugent.be/.
MOTIVATION: In proteomics, reverse database searching is used to control the false match frequency for tandem mass spectrum/peptide sequence matches, but reversal creates sequences devoid of patterns that usually challenge database-search software. RESULTS: We designed an unsupervised pattern recognition algorithm for detecting patterns with various lengths from large sequence datasets. The patterns found in a protein sequence database were used to create decoy databases using a Monte Carlo sampling algorithm. Searching these decoy databases led to the prediction of false positive rates for spectrum/peptide sequence matches. We show examples where this method, independent of instrumentation, database-search software and samples, provides better estimation of false positive identification rates than a prevailing reverse database searching method. The pattern detection algorithm can also be used to analyze sequences for other purposes in biology or cryptology. AVAILABILITY: On request from the authors. SUPPLEMENTARY INFORMATION: http://bioinformatics.psb.ugent.be/.
Authors: Nazrul Islam; Attila Nagy; Wesley M Garrett; Dan Shelton; Bret Cooper; Xiangwu Nou Journal: Appl Environ Microbiol Date: 2016-06-30 Impact factor: 4.792