Literature DB >> 29345009

Choosing non-redundant representative subsets of protein sequence data sets using submodular optimization.

Maxwell W Libbrecht1, Jeffrey A Bilmes2, William Stafford Noble1,3.   

Abstract

Selecting a non-redundant representative subset of sequences is a common step in many bioinformatics workflows, such as the creation of non-redundant training sets for sequence and structural models or selection of "operational taxonomic units" from metagenomics data. Previous methods for this task, such as CD-HIT, PISCES, and UCLUST, apply a heuristic threshold-based algorithm that has no theoretical guarantees. We propose a new approach based on submodular optimization. Submodular optimization, a discrete analogue to continuous convex optimization, has been used with great success for other representative set selection problems. We demonstrate that the submodular optimization approach results in representative protein sequence subsets with greater structural diversity than sets chosen by existing methods, using as a gold standard the SCOPe library of protein domain structures. In this setting, submodular optimization consistently yields protein sequence subsets that include more SCOPe domain families than sets of the same size selected by competing approaches. We also show how the optimization framework allows us to design a mixture objective function that performs well for both large and small representative sets. The framework we describe is the best possible in polynomial time (under some assumptions), and it is flexible and intuitive because it applies a suite of generic methods to optimize one of a variety of objective functions.
© 2018 Wiley Periodicals, Inc.

Entities:  

Keywords:  discrete optimization; diversity; protein sequence analysis; redundancy; representative subsets; submodular maximization

Mesh:

Substances:

Year:  2018        PMID: 29345009      PMCID: PMC5835207          DOI: 10.1002/prot.25461

Source DB:  PubMed          Journal:  Proteins        ISSN: 0887-3585


  24 in total

1.  Clustering of highly homologous sequences to reduce the size of large protein databases.

Authors:  W Li; L Jaroszewski; A Godzik
Journal:  Bioinformatics       Date:  2001-03       Impact factor: 6.937

2.  Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences.

Authors:  Weizhong Li; Adam Godzik
Journal:  Bioinformatics       Date:  2006-05-26       Impact factor: 6.937

3.  Clustering by passing messages between data points.

Authors:  Brendan J Frey; Delbert Dueck
Journal:  Science       Date:  2007-01-11       Impact factor: 47.728

4.  Selection of representative protein data sets.

Authors:  U Hobohm; M Scharf; R Schneider; C Sander
Journal:  Protein Sci       Date:  1992-03       Impact factor: 6.725

5.  Domain identification by clustering sequence alignments.

Authors:  X Guan; L Du
Journal:  Bioinformatics       Date:  1998       Impact factor: 6.937

6.  Structure, function and diversity of the healthy human microbiome.

Authors: 
Journal:  Nature       Date:  2012-06-13       Impact factor: 49.962

7.  Protein sequence redundancy reduction: comparison of various method.

Authors:  Kresimir Sikic; Oliviero Carugo
Journal:  Bioinformation       Date:  2010-11-27

8.  Choosing panels of genomics assays using submodular optimization.

Authors:  Kai Wei; Maxwell W Libbrecht; Jeffrey A Bilmes; William Stafford Noble
Journal:  Genome Biol       Date:  2016-11-15       Impact factor: 13.583

9.  Spectral clustering of protein sequences.

Authors:  Alberto Paccanaro; James A Casbon; Mansoor A S Saqi
Journal:  Nucleic Acids Res       Date:  2006-03-17       Impact factor: 16.971

10.  kClust: fast and sensitive clustering of large protein sequence databases.

Authors:  Maria Hauser; Christian E Mayer; Johannes Söding
Journal:  BMC Bioinformatics       Date:  2013-08-15       Impact factor: 3.169

View more
  4 in total

1.  Submodular Maximization via Gradient Ascent: The Case of Deep Submodular Functions.

Authors:  Wenruo Bai; William S Noble; Jeff A Bilmes
Journal:  Adv Neural Inf Process Syst       Date:  2018-12

2.  Enhancing statistical power in temporal biomarker discovery through representative shapelet mining.

Authors:  Thomas Gumbsch; Christian Bock; Michael Moor; Bastian Rieck; Karsten Borgwardt
Journal:  Bioinformatics       Date:  2020-12-30       Impact factor: 6.937

3.  Prioritizing transcriptomic and epigenomic experiments using an optimization strategy that leverages imputed data.

Authors:  Jacob Schreiber; Jeffrey Bilmes; William Stafford Noble
Journal:  Bioinformatics       Date:  2021-05-01       Impact factor: 6.937

4.  Reference flow: reducing reference bias using multiple population genomes.

Authors:  Nae-Chyun Chen; Brad Solomon; Taher Mun; Sheila Iyer; Ben Langmead
Journal:  Genome Biol       Date:  2021-01-04       Impact factor: 13.583

  4 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.