Literature DB >> 35255082

Constructing benchmark test sets for biological sequence analysis using independent set algorithms.

Samantha Petti1, Sean R Eddy2.   

Abstract

Biological sequence families contain many sequences that are very similar to each other because they are related by evolution, so the strategy for splitting data into separate training and test sets is a nontrivial choice in benchmarking sequence analysis methods. A random split is insufficient because it will yield test sequences that are closely related or even identical to training sequences. Adapting ideas from independent set graph algorithms, we describe two new methods for splitting sequence data into dissimilar training and test sets. These algorithms input a sequence family and produce a split in which each test sequence is less than p% identical to any individual training sequence. These algorithms successfully split more families than a previous approach, enabling construction of more diverse benchmark datasets.

Entities:  

Mesh:

Year:  2022        PMID: 35255082      PMCID: PMC8929697          DOI: 10.1371/journal.pcbi.1009492

Source DB:  PubMed          Journal:  PLoS Comput Biol        ISSN: 1553-734X            Impact factor:   4.475


  17 in total

Review 1.  Correct machine learning on protein sequences: a peer-reviewing perspective.

Authors:  Ian Walsh; Gianluca Pollastri; Silvio C E Tosatto
Journal:  Brief Bioinform       Date:  2015-09-26       Impact factor: 11.622

2.  Homology detection via family pairwise search.

Authors:  W N Grundy
Journal:  J Comput Biol       Date:  1998       Impact factor: 1.479

3.  A new generation of homology search tools based on probabilistic inference.

Authors:  Sean R Eddy
Journal:  Genome Inform       Date:  2009-10

Review 4.  Protein sequence comparison and fold recognition: progress and good-practice benchmarking.

Authors:  Johannes Söding; Michael Remmert
Journal:  Curr Opin Struct Biol       Date:  2011-03-31       Impact factor: 6.809

5.  DOME: recommendations for supervised machine learning validation in biology.

Authors:  Ian Walsh; Dmytro Fishman; Dario Garcia-Gasulla; Tiina Titma; Gianluca Pollastri; Jennifer Harrow; Fotis E Psomopoulos; Silvio C E Tosatto
Journal:  Nat Methods       Date:  2021-07-27       Impact factor: 28.547

6.  Improved protein structure prediction using predicted interresidue orientations.

Authors:  Jianyi Yang; Ivan Anishchenko; Hahnbeom Park; Zhenling Peng; Sergey Ovchinnikov; David Baker
Journal:  Proc Natl Acad Sci U S A       Date:  2020-01-02       Impact factor: 11.205

7.  Using deep learning to annotate the protein universe.

Authors:  Maxwell L Bileschi; David Belanger; Drew H Bryant; Theo Sanderson; Brandon Carter; D Sculley; Alex Bateman; Mark A DePristo; Lucy J Colwell
Journal:  Nat Biotechnol       Date:  2022-02-21       Impact factor: 68.164

8.  Unified rational protein engineering with sequence-based deep representation learning.

Authors:  Ethan C Alley; Grigory Khimulya; Surojit Biswas; Mohammed AlQuraishi; George M Church
Journal:  Nat Methods       Date:  2019-10-21       Impact factor: 28.547

9.  Remote homology search with hidden Potts models.

Authors:  Grey W Wilburn; Sean R Eddy
Journal:  PLoS Comput Biol       Date:  2020-11-30       Impact factor: 4.475

10.  The Pfam protein families database in 2019.

Authors:  Sara El-Gebali; Jaina Mistry; Alex Bateman; Sean R Eddy; Aurélien Luciani; Simon C Potter; Matloob Qureshi; Lorna J Richardson; Gustavo A Salazar; Alfredo Smart; Erik L L Sonnhammer; Layla Hirsh; Lisanna Paladin; Damiano Piovesan; Silvio C E Tosatto; Robert D Finn
Journal:  Nucleic Acids Res       Date:  2019-01-08       Impact factor: 16.971

View more
  1 in total

Review 1.  Progress and challenges for the machine learning-based design of fit-for-purpose monoclonal antibodies.

Authors:  Rahmad Akbar; Habib Bashour; Puneet Rawat; Philippe A Robert; Eva Smorodina; Tudor-Stefan Cotet; Karine Flem-Karlsen; Robert Frank; Brij Bhushan Mehta; Mai Ha Vu; Talip Zengin; Jose Gutierrez-Marcos; Fridtjof Lund-Johansen; Jan Terje Andersen; Victor Greiff
Journal:  MAbs       Date:  2022 Jan-Dec       Impact factor: 5.857

  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.