Literature DB >> 22539666

PSI-Search: iterative HOE-reduced profile SSEARCH searching.

Weizhong Li1, Hamish McWilliam, Mickael Goujon, Andrew Cowley, Rodrigo Lopez, William R Pearson.   

Abstract

UNLABELLED: Iterative similarity searches with PSI-BLAST position-specific score matrices (PSSMs) find many more homologs than single searches, but PSSMs can be contaminated when homologous alignments are extended into unrelated protein domains-homologous over-extension (HOE). PSI-Search combines an optimal Smith-Waterman local alignment sequence search, using SSEARCH, with the PSI-BLAST profile construction strategy. An optional sequence boundary-masking procedure, which prevents alignments from being extended after they are initially included, can reduce HOE errors in the PSSM profile. Preventing HOE improves selectivity for both PSI-BLAST and PSI-Search, but PSI-Search has ~4-fold better selectivity than PSI-BLAST and similar sensitivity at 50% and 60% family coverage. PSI-Search is also produces 2- for 4-fold fewer false-positives than JackHMMER, but is ~5% less sensitive.
AVAILABILITY AND IMPLEMENTATION: PSI-Search is available from the authors as a standalone implementation written in Perl for Linux-compatible platforms. It is also available through a web interface (www.ebi.ac.uk/Tools/sss/psisearch) and SOAP and REST Web Services (www.ebi.ac.uk/Tools/webservices).

Entities:  

Mesh:

Year:  2012        PMID: 22539666      PMCID: PMC3371869          DOI: 10.1093/bioinformatics/bts240

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


1 INTRODUCTION

PSI-BLAST (Altschul ) uses an iterative strategy to construct a protein profile, in the form of a position-specific score matrix (PSSM), which dramatically improves homology detection in diverse protein families. Improved versions of PSI-BLAST have more accurate statistics and more sensitive consensus profiles (Agrawal ; Altschul , 2009; Bhadra ; Li ; Przybylski and Rost, 2008; Stojmirović ), but the most common cause of PSI-BLAST errors is contamination of the PSSM by extension of an homologous domain into a non-homologous region (homologous over-extension, HOE) (Gonzalez and Pearson, 2010a). Even searches with a single well-defined domain do not guarantee uncontaminated profiles (Kim ). Some HOE errors can be reduced by ‘profile cleaning’; HangOut (Kim ) focuses on long insertions, but requires insertion boundaries to be specified by the user, thus assuming a priori knowledge of the domain structure of the query protein. Here we present PSI-Search, an iterated profile search application for identifying distantly related protein sequences. PSI-Search is similar to PSI-BLAST, but substitutes a rigorous Smith–Waterman local alignment (Smith and Waterman, 1981) search strategy (SSEARCH, Pearson, 1991) to produce optimal local alignment scores from the profile PSSM. PSI-Search includes an optional alignment boundary-masking procedure that reduces HOE errors in the PSSM profile. SCANPS (Walsh ) implements a similar iterative search strategy using Smith–Waterman alignments; however, it does not currently scale to large protein databases and does not include boundary masking.

2 METHODS

In PSI-Search, library searches are performed with ssearch, selected hit sequences from the result are processed with an automated sequence boundary-masking procedure, and PSSM profiles are built using blastpgp. The PSI-Search iteration workflow (Fig. 1a) iterates through search and alignment/PSSM construction steps:
Fig. 1.

(a) HOE-reduced PSI-Search iteration workflow. (b) Fraction of true-positives versus false-positives found by PSI-BLAST, PSI-BLAST HOE-reduced, PSI-Search, PSI-Search HOE-reduced, and JackHMMER. Weighted true-positives and false-positives are calculated as 1/500∑5001 tp (or fp)/total where tp (or fp) is the number of true positives (or false positives) at iteration 5 and total is the total number of homologs for query f in the RefProtDom benchmark database. Alignments containing HOEs with >50% of the alignment outside the homologous boundary are counted as both true and false positives

The initial iteration is a normal ssearch run with a sequence input. During the second iteration, aligned sequences with statistically significant scores from the previous search are retrieved using fastacmd; details of the alignment boundaries are stored; sequence regions outside the boundaries are masked with ‘X’s to remove potential HOE regions; masked sequences are formatted into BLAST indexes using formatdb with an additional 10 000 random protein sequences created by makeprotseq (Rice ); and a PSSM checkpoint constructed with a blastpgp search; finally ssearch is run with the input sequence, using the generated PSSM, to complete the second iteration and output alignments. Further iterations repeat Step (2). To avoid HOEs, PSI-Search always uses the alignment boundary information from the first significant alignment in which a library sequence appears. Thus, if the first significant alignment with a library sequence aligns residues 25–125 at iteration i, later alignment boundaries at iteration i+1 and beyond are ignored; only the initially aligned region (25–125) is used to form the PSSM. (a) HOE-reduced PSI-Search iteration workflow. (b) Fraction of true-positives versus false-positives found by PSI-BLAST, PSI-BLAST HOE-reduced, PSI-Search, PSI-Search HOE-reduced, and JackHMMER. Weighted true-positives and false-positives are calculated as 1/500∑5001 tp (or fp)/total where tp (or fp) is the number of true positives (or false positives) at iteration 5 and total is the total number of homologs for query f in the RefProtDom benchmark database. Alignments containing HOEs with >50% of the alignment outside the homologous boundary are counted as both true and false positives

3 RESULTS

Five iterative search strategies—PSI-BLAST (standard and HOE-reduced), PSI-Search (standard and HOE-reduced) and JackHMMER (Eddy, 2011)—were evaluated on the RefProtDom (Gonzalez and Pearson, 2010b) benchmark queries (500 sampled domain-embedded sequences) against the RefProtDom benchmark database using an E-value threshold of 0.001. JackHMMER is another iterative search tool that uses Hidden Markov Models (HMMs) (Johnson ) rather than a PSSM. The output alignments from the fifth iteration were classified into true positives (TPs) and false positives (FPs, Fig. 1b). At 50% family coverage, PSI-Search reduces the weighted fraction of errors from 4.5% (PSI-BLAST) to 2.9% (PSI-Search). Reducing HOE improves sensitivity even more, to 1.7% for HOE-reduced PSI-BLAST and 0.5% for HOE-reduced PSI-Search. At 50% coverage, JackHMMER performs very well using its statistical alignment envelope, producing only 1% weighted FPs, but its selectivity is worse than PSI-Search or HOE-reduced PSI-Search at 60% and 75% coverage. Overall, HOE-reduced PSI-Search is 9-fold more selective than PSI-BLAST. At the end of iteration 5, 78.3, 79.5, 77.3, 78.8 and 82.5% of weighted homologs are found by PSI-BLAST, PSI-Search, HOE-reduced PSI-BLAST, HOE-reduced PSI-Search and JackHMMER respectively. Thus, (i) HOE-reduction greatly improves search selectivity with a small cost in sensitivity in both PSI-BLAST and PSI-Search; (ii) Both PSI-Search and JackHMMER are more sensitive and selective than PSI-BLAST; (iii) HOE-reduced PSI-Search is more selective, but slightly less sensitive, than JackHMMER. JackHMMER is the most sensitive tool, but HOE-reduced PSI-Search is the most selective iterative tool.
  17 in total

1.  EMBOSS: the European Molecular Biology Open Software Suite.

Authors:  P Rice; I Longden; A Bleasby
Journal:  Trends Genet       Date:  2000-06       Impact factor: 11.639

Review 2.  Protein database searches using compositionally adjusted substitution matrices.

Authors:  Stephen F Altschul; John C Wootton; E Michael Gertz; Richa Agarwala; Aleksandr Morgulis; Alejandro A Schäffer; Yi-Kuo Yu
Journal:  FEBS J       Date:  2005-10       Impact factor: 5.542

Review 3.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.

Authors:  S F Altschul; T L Madden; A A Schäffer; J Zhang; Z Zhang; W Miller; D J Lipman
Journal:  Nucleic Acids Res       Date:  1997-09-01       Impact factor: 16.971

4.  Searching protein sequence libraries: comparison of the sensitivity and selectivity of the Smith-Waterman and FASTA algorithms.

Authors:  W R Pearson
Journal:  Genomics       Date:  1991-11       Impact factor: 5.736

5.  Accelerated Profile HMM Searches.

Authors:  Sean R Eddy
Journal:  PLoS Comput Biol       Date:  2011-10-20       Impact factor: 4.475

6.  Powerful fusion: PSI-BLAST and consensus sequences.

Authors:  Dariusz Przybylski; Burkhard Rost
Journal:  Bioinformatics       Date:  2008-08-04       Impact factor: 6.937

7.  Cascade PSI-BLAST web server: a remote homology search tool for relating protein domains.

Authors:  R Bhadra; S Sandhya; K R Abhinandan; S Chakrabarti; R Sowdhamini; N Srinivasan
Journal:  Nucleic Acids Res       Date:  2006-07-01       Impact factor: 16.971

8.  SCANPS: a web server for iterative protein sequence database searching by dynamic programing, with display in a hierarchical SCOP browser.

Authors:  Thomas P Walsh; Caleb Webber; Stephen Searle; Shane S Sturrock; Geoffrey J Barton
Journal:  Nucleic Acids Res       Date:  2008-05-24       Impact factor: 16.971

9.  The effectiveness of position- and composition-specific gap costs for protein similarity searches.

Authors:  Aleksandar Stojmirović; E Michael Gertz; Stephen F Altschul; Yi-Kuo Yu
Journal:  Bioinformatics       Date:  2008-07-01       Impact factor: 6.937

10.  PSI-BLAST pseudocounts and the minimum description length principle.

Authors:  Stephen F Altschul; E Michael Gertz; Richa Agarwala; Alejandro A Schäffer; Yi-Kuo Yu
Journal:  Nucleic Acids Res       Date:  2008-12-16       Impact factor: 16.971

View more
  15 in total

1.  Using EMBL-EBI Services via Web Interface and Programmatically via Web Services.

Authors:  Rodrigo Lopez; Andrew Cowley; Weizhong Li; Hamish McWilliam
Journal:  Curr Protoc Bioinformatics       Date:  2014-12-12

2.  β-Strand-mediated interactions of protein domains.

Authors:  Archana S Bhat; Lisa N Kinch; Nick V Grishin
Journal:  Proteins       Date:  2020-07-11

Review 3.  An introduction to sequence similarity ("homology") searching.

Authors:  William R Pearson
Journal:  Curr Protoc Bioinformatics       Date:  2013-06

4.  Combined alignments of sequences and domains characterize unknown proteins with remotely related protein search PSISearch2D.

Authors:  Minglei Yang; Wenliang Zhang; Guocai Yao; Haiyue Zhang; Weizhong Li
Journal:  Database (Oxford)       Date:  2019-01-01       Impact factor: 3.451

5.  Pinpointing disease genes through phenomic and genomic data fusion.

Authors:  Rui Jiang; Mengmeng Wu; Lianshuo Li
Journal:  BMC Genomics       Date:  2015-01-21       Impact factor: 3.969

6.  The EMBL-EBI bioinformatics web and programmatic tools framework.

Authors:  Weizhong Li; Andrew Cowley; Mahmut Uludag; Tamer Gur; Hamish McWilliam; Silvano Squizzato; Young Mi Park; Nicola Buso; Rodrigo Lopez
Journal:  Nucleic Acids Res       Date:  2015-04-06       Impact factor: 16.971

7.  Query-seeded iterative sequence similarity searching improves selectivity 5-20-fold.

Authors:  William R Pearson; Weizhong Li; Rodrigo Lopez
Journal:  Nucleic Acids Res       Date:  2017-04-20       Impact factor: 16.971

8.  Dfam: a database of repetitive DNA based on profile hidden Markov models.

Authors:  Travis J Wheeler; Jody Clements; Sean R Eddy; Robert Hubley; Thomas A Jones; Jerzy Jurka; Arian F A Smit; Robert D Finn
Journal:  Nucleic Acids Res       Date:  2012-11-30       Impact factor: 16.971

9.  The annotation-enriched non-redundant patent sequence databases.

Authors:  Weizhong Li; Bartosz Kondratowicz; Hamish McWilliam; Stephane Nauche; Rodrigo Lopez
Journal:  Database (Oxford)       Date:  2013-02-09       Impact factor: 3.451

10.  Prioritization Of Nonsynonymous Single Nucleotide Variants For Exome Sequencing Studies Via Integrative Learning On Multiple Genomic Data.

Authors:  Mengmeng Wu; Jiaxin Wu; Ting Chen; Rui Jiang
Journal:  Sci Rep       Date:  2015-10-13       Impact factor: 4.379

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.