| Literature DB >> 25968113 |
Tristan Bitard-Feildel1, Carsten Kemena2, Jenny M Greenwood3, Erich Bornberg-Bauer4.
Abstract
BACKGROUND: Orthologous protein detection software mostly uses pairwise comparisons of amino-acid sequences to assert whether two proteins are orthologous or not. Accordingly, when the number of sequences for comparison increases, the number of comparisons to compute grows in a quadratic order. A current challenge of bioinformatic research, especially when taking into account the increasing number of sequenced organisms available, is to make this ever-growing number of comparisons computationally feasible in a reasonable amount of time. We propose to speed up the detection of orthologous proteins by using strings of domains to characterize the proteins.Entities:
Mesh:
Substances:
Year: 2015 PMID: 25968113 PMCID: PMC4443542 DOI: 10.1186/s12859-015-0570-8
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1ROC curves. ROC curves of the developed COS and MWM measures, and of the NC method against the SD − dataset (panel a), the SD + dataset (panel b) and the OB dataset (panel c). For each panel, the left plots correspond to the full ROC curves and the right plots to a zoomed in subsection along the x axis. C O S , C O S , M W M and M W M are evaluated with weighting (w) or without. The influence of the kinase family in the SD + dataset on the sequence similarity based method (NC) is clearly seen in panel b.
Figure 2Dotplot with domain visualisation of two proteins belonging to the PLUNC family (ENSRNOP00000052209 and ENSRNOP00000052216). The shadowed areas correspond to the sequence identity between the two sequences. Although they share the exact same DA, their sequence similarity is very low (20.8%). Run with the needle program of the EMBOSS package [35]. Dotplot was produced with the DoMosaics software [36].
AUC scores for all methods against the SD , the SD and the OB datasets
|
|
|
|
|
|---|---|---|---|
| NC | 0.993 | 0.844 | 0.919 |
|
| 0.979 | 0.987 | 0.994 |
|
| 0.971 | 0.978 | 0.992 |
|
| 0.98 | 0.987 | 0.996 |
|
| 0.971 | 0.973 | 0.996 |
|
| 0.98 | 0.982 | 0.996 |
|
| 0.972 | 0.974 | 0.996 |
|
| 0.98 | 0.981 | 0.996 |
|
| 0.972 | 0.969 | 0.997 |
The AUC scores are computed from the TPR and FPR of the different measures. The scores reflect the quality of the COS, MWM and NC measures for protein family classification. An AUC score of 1 corresponds to a perfect classification of the dataset. All methods produce a very good AUC score, a small general advantage can be observed for the methods using an order 1 parameter. Cosine methods have better performances on the SD + dataset and the MWM methods perform generally better on the SD − dataset and on the OB dataset. Using the weighted version of the COS or MWM measure only improves the performance on the OB dataset.
Figure 3Results of comparisons between porthoDom or proteinortho against the OrthoDB database. Different parameters are used for the domain content similarity step of porthoDom and the default parameters of proteinortho are used for both methods. The parameters are: a domain content similarity cut-off of 0.5, a domain content similarity of O1 corresponding to single domain comparisons, or O2 corresponding to the comparison of pairs of domains, and an option collapsing or not of tandem domain repeats. The different parameters have little influence on porthoDom due to the robustness of the domain content similarity method.
Number and percentage of clusters in the different evaluation groups for proteinortho and porthoDom
|
|
|
|
|
|
|
|---|---|---|---|---|---|
|
|
|
|
| ||
| Superset (%) | 5442 (26.13) | 5311 (28.53) | 5274 (28.26) | 5172 (27.58) | 5161 (27.4) |
| Subset (%) | 2639 (12.67) | 2054 (11.03) | 2077 (11.13) | 2104 (11.22) | 2171 (11.52) |
| Identical (%) | 1945 (9.34) | 1311 (7.04) | 1333 (7.14) | 1351 (7.21) | 1328 (7.05) |
| New (%) | 7419 (35.62) | 7396 (39.73) | 7445 (39.89) | 7580 (40.43) | 7622 (40.45) |
| Absent (%) | 3383 (16.24) | 2545 (13.67) | 2536 (13.59) | 2543 (13.56) | 2559 (13.58) |
The domain content similarity cut-off of porthoDom was set to 0.5 and different combination of parameters affecting order (O1, O2) and repeats (with or without) were tested.
Running time in minutes of proteinortho and porthoDom with and porthoDom without pfam_scan.pl for the domain annotation
|
|
|
|
|
|---|---|---|---|
|
|
| ||
| Run 1 | 1587 | 627 | 279 |
| Run 2 | 1588 | 649 | 272 |
| Run 3 | 1588 | 623 | 269 |
| Mean | 1587.6 | 633 | 273.3 |