| Literature DB >> 24669753 |
László Kaján, Thomas A Hopf, Matúš Kalaš, Debora S Marks, Burkhard Rost1.
Abstract
BACKGROUND: 20 years of improved technology and growing sequences now renders residue-residue contact constraints in large protein families through correlated mutations accurate enough to drive de novo predictions of protein three-dimensional structure. The method EVfold broke new ground using mean-field Direct Coupling Analysis (EVfold-mfDCA); the method PSICOV applied a related concept by estimating a sparse inverse covariance matrix. Both methods (EVfold-mfDCA and PSICOV) are publicly available, but both require too much CPU time for interactive applications. On top, EVfold-mfDCA depends on proprietary software.Entities:
Mesh:
Substances:
Year: 2014 PMID: 24669753 PMCID: PMC3987048 DOI: 10.1186/1471-2105-15-85
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
FreeContact command-line parameters
| --clustpc | BLOSUM-style sequence clustering percentage [0–100] |
| --cov20 | when true, one amino acid is left out when forming the covariance matrix, making it non-overdetermined [ |
| --density | target precision matrix density [0–1] |
| --estimate-ivcov | perform inverse covariance matrix estimation instead of matrix inversion [Boolean] |
| --apply-gapth | exclude alignment columns with a weighted gap frequency greater than --gapth from the covariance matrix [Boolean] |
| --gapth | weighted gap frequency threshold (0–1] |
| --icme-timeout | inverse covariance matrix estimation timeout in seconds [0-) |
| --mincontsep | minimum sequence separation (j - i ≥ arg) for reporting contacts [1-) |
| --pseudocnt | pseudo-count for sequence weighting [0-) |
| --pscount-weight | pseudo-count weight for sequence weighting [0–1] |
| --rho | initial value of GLASSO regularization parameter [0-) |
| --parprof | parameter profile selection [evfold|psicov|psicov-sd] |
FreeContact command-line parameters controlling contact prediction.
Figure 1Runtimes for FreeContact. We measured the runtime (logarithmic y-axis) for different program components (x-axis) on a single thread. The program components were: “seqw” – sequence weighting; “pairfreq” – pairwise residue frequencies; “shrink” – shrinking of covariance matrix; “inv” – sparse inverse covariance estimation/covariance matrix inversion. The different colors distinguish: the original PSICOV implementation (blue), our acceleration of PSICOV (FC.psicov, yellow), our acceleration of the faster PSICOV version “sensible default” (FC.psicov-fast, green), and our implementation of EVfold-mfDCA (FC.evfold, red). The whiskers on the box plots show the most extreme data point that is less than 1.5-times the interquartile range from the box. Outliers are not shown. Total runtime of all methods tested is dominated by the sparse inverse covariance estimation/covariance matrix inversion component.
Figure 2Speedup using multiple threads. A: Sequence weighting. Speed is calculated as: proteins in alignment2 length of target protein/runtime. B: Pairwise residue frequency calculation. Speed is calculated as: proteins in alignment length of target protein2/runtime. Dashed lines indicate linear correlation, extrapolated from one thread. The whiskers extend to the most extreme data point that is less than 1.5-times the interquartile range from the box. The surprisingly clear correlation between the number of threads and speed demonstrates how well our implementation scales for multi-threading.
Mean precision values [%]
| | ||||||||
|---|---|---|---|---|---|---|---|---|
| PSICOV | 46 | 60 | 73 | 78 | 42 | 58 | 71 | 77 |
| FC.psicov | 46 | 60 | 73 | 77 | 42 | 57 | 71 | 77 |
| FC.psicov-fast | 44 | 58 | 72 | 77 | 41 | 55 | 70 | 76 |
| FC.evfold | 45 | 57 | 67 | 73 | 57 | 69 | 75 | |
| | [j - i] > 11 | [j - i] > 23 | ||||||
| | L | L/2 | L/5 | L/10 | L | L/2 | L/5 | L/10 |
| PSICOV | 40 | 55 | 70 | 77 | 33 | 47 | 65 | 73 |
| FC.psicov | 40 | 55 | 70 | 76 | 33 | 47 | 65 | 73 |
| FC.psicov-fast | 39 | 53 | 68 | 76 | 32 | 45 | 63 | 71 |
| FC.evfold | 69 | 75 | 64 | 72 | ||||
Mean precision values [%] for the top-L/n, L = length of target protein, n = (1, 2, 5, 10) contacts divided by sequence separation ranges [j - i] > sep, j, i residue positions, sep = (4, 8, 11, 23), where the Cβ-Cβ distance (Cα-Cα for glycine) is less than 8 Å.