| Literature DB >> 30453883 |
Eleftheria Polychronidou1, Ilias Kalamaras2, Andreas Agathangelidis3, Lesley-Ann Sutton4, Xiao-Jie Yan5, Vasilis Bikos6, Anna Vardi7, Konstantinos Mochament2, Nicholas Chiorazzi5, Chrysoula Belessi8, Richard Rosenquist4, Paolo Ghia9, Kostas Stamatopoulos3, Panayiotis Vlamos10, Anna Chailyan11, Nanna Overby12, Paolo Marcatili12, Anastasia Hatzidimitriou3, Dimitrios Tzovaras2.
Abstract
BACKGROUND: Although the etiology of chronic lymphocytic leukemia (CLL), the most common type of adult leukemia, is still unclear, strong evidence implicates antigen involvement in disease ontogeny and evolution. Primary and 3D structure analysis has been utilised in order to discover indications of antigenic pressure. The latter has been mostly based on the 3D models of the clonotypic B cell receptor immunoglobulin (BcR IG) amino acid sequences. Therefore, their accuracy is directly dependent on the quality of the model construction algorithms and the specific methods used to compare the ensuing models. Thus far, reliable and robust methods that can group the IG 3D models based on their structural characteristics are missing.Entities:
Keywords: 3D protein descriptors; CLL protein clustering; descriptor fusion
Mesh:
Substances:
Year: 2018 PMID: 30453883 PMCID: PMC6245605 DOI: 10.1186/s12859-018-2381-1
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Distance metrics that measure the average distance between the atoms of superimposed proteins
| Similarity metric | Method or software |
|---|---|
| RMSD | MAMMOTH [ |
| [ | |
| SAS score & GSAS score | [ |
| TM-score | TM-align, Fr-TM-align [ |
| S score | MatAlign [ |
| STRUCTAL score | LOVOalign [ |
| Q-score | SSM [ |
Fig. 1Block diagram illustrating the proposed methodology
Datasets description
| Dataset | Patients | Predefined subsets |
|---|---|---|
| D1 | 137 | 6 (D1.A ∼ D1.F) |
| D2 | 925 | N/A |
Subset size distribution in the annotated dataset
| Subset | Type | Size |
|---|---|---|
| 1 | IGHV clan I/IGKV1(D)-39 | 38 |
| 2 | IGHV321/IGLV3-21 | 42 |
| 4 | IGHV4-34/IGKV2-30 | 22 |
| 6 | IGHV1-69/IGKV3-20 | 12 |
| 7 | IGHV1-69/IGLV3-9 | 12 |
| 8 | IGHV4-39/IGKV1(D)-39 | 11 |
Comparison of clustering accuracy (Rand index) between TM-score and the various 3D descriptors (6 clusters) for the 137 protein structures
|
| K-medoids | Agglomerative | DBScan |
|---|---|---|---|
| TM-score | 85.40% | 58.25% | 71.23% |
| FPFH | 89.10% | 86.59% | 88.40% |
| 3DSC | 88.00% | 78.60% | 86.20% |
| RSD | 89.5% | 77.32% | 84.67% |
| VFH | 83.20% | 65.62% | 76.31% |
| Combined Silhouette Weights | 88.67% | ||
| Combined Equal Weights | 89.00% | 85.51% |
The highest accuracy is highlighted
Fig. 2Determination of optimal number of clusters for the FPFH descriptor
Comparison of clustering accuracy between TM-score and the various 3D descriptors (optimal number of clusters) for the 137 protein structures
| Method | Num. clusters | Rand index |
|---|---|---|
| TM-score | 8 | 89.7% |
| FPFH | 9 | 89.3% |
| 3DSC | 9 | 89.5% |
| RSD | 7 | 92.0% |
| VFH | 8 | 85.3% |
| Combined silhouette weights | 7 | |
| Combined equal weights | 7 | 90.2% |
The highest accuracy is highlighted
Fig. 3Clustering of the annotated protein dataset, using the combined descriptors method
Comparison of clustering accuracy between TM-score and the various 3D descriptors (optimal number of clusters) for the 925 protein structures
| Method | Num. clusters | Avg. silhouette width | Rand index |
|---|---|---|---|
| TM-score | 4 | 0.001 | 60.0% |
| FPFH | 14 | 0.070 | 88.9% |
| 3DSC | 13 | 0.057 | 89.3% |
| RSD | 9 | 0.056 | 83.9% |
| VFH | 7 | 0.006 | 76.3% |
| Combined silhouette weights | 15 | 0.071 | 90.2% |
| Combined equal weights | 14 | 0.069 |
The highest accuracy is highlighted
Fig. 4Clustering of both the combined annotated and unannotated protein dataset, using the combined descriptors method