| Literature DB >> 23505298 |
Thierry Schuepbach1, Marco Pagni, Alan Bridge, Lydie Bougueleret, Ioannis Xenarios, Lorenzo Cerutti.
Abstract
SUMMARY: The PROSITE resource provides a rich and well annotated source of signatures in the form of generalized profiles that allow protein domain detection and functional annotation. One of the major limiting factors in the application of PROSITE in genome and metagenome annotation pipelines is the time required to search protein sequence databases for putative matches. We describe an improved and optimized implementation of the PROSITE search tool pfsearch that, combined with a newly developed heuristic, addresses this limitation. On a modern x86_64 hyper-threaded quad-core desktop computer, the new pfsearchV3 is two orders of magnitude faster than the original algorithm.Entities:
Mesh:
Year: 2013 PMID: 23505298 PMCID: PMC3634184 DOI: 10.1093/bioinformatics/btt129
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Execution times to search the PROSITE profile PS50255 (CYTOCHROME_B5_2) against 16 544 936 UniProtKB sequences (5 358 014 649 residues)
| − | + | |||
|---|---|---|---|---|
| SSE2 | SSE4.1 | SSE2 | SSE4.1 | |
| pfsearch (v2.4) | 51m32s | n.a. | n.a. | n.a. |
| pfsearchV3 (1 core*) | 33m02s | 20m17s | 1m55s | 1m44s |
| pfsearchV3 (2 cores*) | 16m54s | 10m23s | 0m58s | 0m53s |
| pfsearchV3 (4 cores*) | 9m14s | 5m40s | 0m31s | 0m28s |
| pfsearchV3 (8 cores+) | 9m04s | 5m28s | 0m28s | 0m27s |
The pfsearch and pfsearchV3 programs have been compiled on a Gentoo Linux (-mtune = corei7 -march = corei7 -fomit-frame-pointer -O2) with gcc (4.6.3) and glibc (2.15) using the following compilation options: -O3 –enable-mmap –enable-thread-affinity, CFLAGS = ‘-mtune = corei7 -march = corei7 -ffast-math -mfpmath = sse’, FFLAGS = ‘-mtune = corei7 -march = corei7 -ffast-math -mfpmath = sse’. The static executable is available at the provided WEB address. All run times have been measured on a quad-core Intel® CoreTM i7-3770 CPU @ 3.40 GHz with 8 Gb RAM running on Linux 3.2.0-4-amd64. The number of cores, the selection of the SSE and the selection or otherwise of the heuristic where specified at runtime with options -t, -s and -C, respectively, of pfsearchV3. Both pfsearch and pfsearchV3 have been run to produce the same output alignment, options -fxzl and −o 2 respectively. (*) physical cores obtained with option -k and -t of pfsearchV3. (+) the default mode of pfsearchV3, which uses all available cores with hyper-threading for a total of eight cores in our testing machine (no options -t and -k are used). NB: pfsearchV3 was run using an indexed sequence database (option -i); selecting this option reduces the execution time by 7 s in all experiments using the specified set of protein sequences.
Fig. 1.Estimation of the heuristic score cut-off for the PROSITE profile PS50255 (CYTOCHROME_B5_2). The profile scores and heuristic scores are plotted for the matched sequence: (closed circle) sequences from the seed alignment; (multi symbol) shuffled UniProtKB/Swiss-Prot sequences; (open circle) simulated sequences derived from the seed alignment mutated at various PAM distances (see text for explanatory notes). The heuristic search scores and profile search scores of the simulated sequences (open circle) exhibit a strong positive correlation (R2 = 0.9). These scores are used to estimate the linear regression for the lower 5% quantile (black line) used to map the profile search scores to heuristic search scores. The standard linear regression is also plotted (dashed line)