| Literature DB >> 30598070 |
Yingying Wang1, Hongyan Wu2, Yunpeng Cai3.
Abstract
BACKGROUND: Protein sequence alignment analyses have become a crucial step for many bioinformatics studies during the past decades. Multiple sequence alignment (MSA) and pair-wise sequence alignment (PSA) are two major approaches in sequence alignment. Former benchmark studies revealed drawbacks of MSA methods on nucleotide sequence alignments. To test whether similar drawbacks also influence protein sequence alignment analyses, we propose a new benchmark framework for protein clustering based on cluster validity. This new framework directly reflects the biological ground truth of the application scenarios that adopt sequence alignments, and evaluates the alignment quality according to the achievement of the biological goal, rather than the comparison on sequence level only, which averts the biases introduced by alignment scores or manual alignment templates. Compared with former studies, we calculate the cluster validity score based on sequence distances instead of clustering results. This strategy could avoid the influence brought by different clustering methods thus make results more dependable.Entities:
Keywords: Benchmark; Multiple sequence alignment; Pair-wise sequence alignment
Mesh:
Substances:
Year: 2018 PMID: 30598070 PMCID: PMC6311937 DOI: 10.1186/s12859-018-2524-4
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Framework of this benchmark study This benchmark study is performed following four main steps including data generation, alignments, evaluation calculation, and significance analyses
Benchmark datasets list
| Reference Name | Dataset IDa | Number of sequencesb | Number of classesc | Average length |
|---|---|---|---|---|
| Reference1 | RV11 | 236 | 38 | 301.178 |
| RV12 | 382 | 44 | 392.6885 | |
| Reference2 | RV20 | 1706 | 41 | 384.3581 |
| Reference3 | RV30 | 1723 | 30 | 387.9745 |
| Reference4 | RV40 | 1113 | 49 | 480.0952 |
| Reference5 | RV50 | 443 | 16 | 516.6546 |
| Reference9 | RV911 | 423 | 29 | 701.5792 |
| RV912 | 228 | 28 | 454.0351 |
aDataset IDs are abbreviation for the datasets and are used to refer the corresponding dataset in this paper. bNumber of sequences means the number of sequences with only one class label in the raw datasets. cNumber of classes means the number of pre-defined protein clusters in each benchmark dataset
Fig. 2Cluster validation results based on SW score. a The SW score of benchmark dataset. b The SW scores of benchmark re-sampled benchmark dataset
Average SW and RS scores of Esprit and MUSCLE (default) in re-sampled benchmark datasets
| Re-sampled | Average SW score of Esprit | Average SW score of MUSCLE (default) | Average RS score of Esprit | Average RS score of MUSCLE (default) | ||
|---|---|---|---|---|---|---|
| RV11 | 0.014383 | 0.007808 | 0.008627 | 0.91387 | 0.933349 | 0.5101 |
| RV12 | 0.108044 | 0.003982 | < 2.2e-16 | 0.705558 | 0.707488 | 0.8436 |
| RV20 | 0.193411 | 0.009224 | < 2.2e-16 | 0.403657 | 0.365046 | 0.07816 |
| RV30 | 0.125547 | 0.007049 | < 2.2e-16 | 0.407991 | 0.357121 | 0.001885 |
| RV40 | 0.072819 | 0.000606 | < 2.2e-16 | 0.705149 | 0.708882 | 0.7688 |
| RV50 | 0.086507 | 0.006049 | < 2.2e-16 | 0.322656 | 0.331129 | 0.5612 |
| RV911 | 0.01329 | −0.00131 | 3.512e-12 | 0.839489 | 0.826965 | 0.1271 |
| RV912 | 0.167038 | 0.02381 | < 2.2e-16 | 0.487026 | 0.37473 | 4.733e-07 |
Fig. 3Cluster validation results based on RS score. a The RS score of benchmark dataset. b The RS scores of benchmark re-sampled benchmark dataset