Warning: Undefined array key "mm" in /www/wwwroot/www.ai-bt.com/si.php on line 10 Deprecated: trim(): Passing null to parameter #1 ($string) of type string is deprecated in /www/wwwroot/www.ai-bt.com/si.php on line 10 Choosing non-redundant representative subsets of protein sequence data sets using submodular optimization.

Literature DB >> 29345009

Choosing non-redundant representative subsets of protein sequence data sets using submodular optimization.

Maxwell W Libbrecht¹, Jeffrey A Bilmes², William Stafford Noble^1,3.

Abstract

Selecting a non-redundant representative subset of sequences is a common step in many bioinformatics workflows, such as the creation of non-redundant training sets for sequence and structural models or selection of "operational taxonomic units" from metagenomics data. Previous methods for this task, such as CD-HIT, PISCES, and UCLUST, apply a heuristic threshold-based algorithm that has no theoretical guarantees. We propose a new approach based on submodular optimization. Submodular optimization, a discrete analogue to continuous convex optimization, has been used with great success for other representative set selection problems. We demonstrate that the submodular optimization approach results in representative protein sequence subsets with greater structural diversity than sets chosen by existing methods, using as a gold standard the SCOPe library of protein domain structures. In this setting, submodular optimization consistently yields protein sequence subsets that include more SCOPe domain families than sets of the same size selected by competing approaches. We also show how the optimization framework allows us to design a mixture objective function that performs well for both large and small representative sets. The framework we describe is the best possible in polynomial time (under some assumptions), and it is flexible and intuitive because it applies a suite of generic methods to optimize one of a variety of objective functions.

Entities: Chemical Disease Gene Species

Keywords: discrete optimization; diversity; protein sequence analysis; redundancy; representative subsets; submodular maximization

Mesh：

Substances：
Proteins

Year: 2018 PMID： 29345009 PMCID： PMC5835207 DOI： 10.1002/prot.25461

Source DB: PubMed Journal: Proteins ISSN： 0887-3585

24 in total

Choosing non-redundant representative subsets of protein sequence data sets using submodular optimization.

1. Clustering of highly homologous sequences to reduce the size of large protein databases.

2. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences.

3. Clustering by passing messages between data points.

4. Selection of representative protein data sets.

5. Domain identification by clustering sequence alignments.

6. Structure, function and diversity of the healthy human microbiome.

7. Protein sequence redundancy reduction: comparison of various method.

8. Choosing panels of genomics assays using submodular optimization.

9. Spectral clustering of protein sequences.

10. kClust: fast and sensitive clustering of large protein sequence databases.

1. Submodular Maximization via Gradient Ascent: The Case of Deep Submodular Functions.

2. Enhancing statistical power in temporal biomarker discovery through representative shapelet mining.

3. Prioritizing transcriptomic and epigenomic experiments using an optimization strategy that leverages imputed data.

4. Reference flow: reducing reference bias using multiple population genomes.