| Literature DB >> 30014462 |
Sandeep K Dhanda1, Kerrie Vaughan1, Veronique Schulten1, Alba Grifoni1, Daniela Weiskopf1, John Sidney1, Bjoern Peters1,2, Alessandro Sette1,2.
Abstract
Epitopes identified in large-scale screens of overlapping peptides often share significant levels of sequence identity, complicating the analysis of epitope-related data. Clustering algorithms are often used to facilitate these analyses, but available methods are generally insufficient in their capacity to define biologically meaningful epitope clusters in the context of the immune response. To fulfil this need we developed an algorithm that generates epitope clusters based on representative or consensus sequences. This tool allows the user to cluster peptide sequences on the basis of a specified level of identity by selecting among three different method options. These include the 'clique method', in which all members of the cluster must share the same minimal level of identity with each other, and the 'connected graph method', in which all members of a cluster must share a defined level of identity with at least one other member of the cluster. In cases where it is not possible to define a clear consensus sequence with the connected graph method, a third option provides a novel 'cluster-breaking algorithm' for consensus sequence driven sub-clustering. Herein we demonstrate the tool's clustering performance and applicability using (i) a selection of dengue virus epitopes for the 'clique method', (ii) sets of allergen-derived peptides from related species for the 'connected graph method' and (iii) large data sets of eluted ligand, major histocompatibility complex binding and T-cell recognition data captured within the Immune Epitope Database (IEDB) with the newly developed 'cluster-breaking algorithm'. This novel clustering tool is accessible at http://tools.iedb.org/cluster2/.Entities:
Keywords: Allergy; Antigens/Peptides/Epitopes; Bioinformatics>; MHC/HLA; Viral
Mesh:
Substances:
Year: 2018 PMID: 30014462 PMCID: PMC6187223 DOI: 10.1111/imm.12984
Source DB: PubMed Journal: Immunology ISSN: 0019-2805 Impact factor: 7.397
Summary of the dengue virus data clustering using clique approach
| Feature | Counts |
|---|---|
| Total peptides | 363 |
| Unique peptides | 305 |
| Total cliques | 198 |
| Peptides covered in cliques with two or more peptides | 225 |
| Cliques with two or more peptides | 118 |
| Unique peptides selected in cliques with two or more peptides | 83 |
| Singleton cliques | 80 |
| Total peptides selected | 163 |
The clustering tool output with mouse and rat allergic data
| Cluster number | Peptide number | Alignment | Position | Description | Peptide |
|---|---|---|---|---|---|
| 1 | Consensus |
| – | – | – |
| 1 | 1 |
| 1 | Rat Pep17 | TFQLMVLYGRTKDLSSDIKE |
| 1 | 2 |
| 6 | Mus Pep3 | GLYGREPDLSSDIKERFA |
| 1 | 3 |
| 8 | Mus Pep7 | YGREPDLSLDIKEK |
| 1 | 4 |
| 9 | Rat Pep9 | GRTKDLSSDIKEKFAKLCEA |
| 2 | Consensus |
| – | – | – |
| 2 | 1 |
| 1 | Rat Pep19 | YDRYVMFHLINFKNGETFQL |
| 2 | 2 |
| 7 | Mus Pep9 | AHLINEKDGETFQLM |
| 2 | 3 |
| 9 | Rat Pep12 | LINFKNGETFQLMVLYGRTK |
| 2 | 4 |
| 11 | Mus Pep6 | NEKDGETFQLMGLY |
| 3 | Consensus |
| – | – | – |
| 3 | 1 |
| 1 | Rat Pep4 | EENGSMRVFMQHIDVLENSL |
| 3 | 2 |
| 4 | Mus Pep16 | GSMRVFVEHIHVLEN |
| 4 | Consensus |
| – | – | – |
| 4 | 1 |
| 1 | Rat Pep6 | FMQHIDVLENSLGFKFRIKE |
| 4 | 2 |
| 1 | Mus Pep2 | FVEHIHVLENSLAFK |
| 5 | Consensus |
| – | – | – |
| 5 | 1 |
| 1 | Rat Pep14 | RDNIIDLTKTDRCLQARG |
| 5 | 2 |
| 2 | Mus Pep17 | ENIIDLTKTNRCLKA |
| 6 | Consensus |
| – | – | – |
| 6 | 1 |
| 1 | Rat Pep8 | GDWFSIVVASNKREKIEENG |
| 6 | 2 |
| 2 | Mus Pep4 | EWFSILLASDKREKI |
| 7 | Consensus |
| – | – | – |
| 7 | 1 |
| 1 | Mus Pep10 | EEASSTGRNFNVQKINGEWHTIIL |
| 7 | 2 |
| 11 | Mus Pep13 | NVEKINGEWHTIIL |
| 8 | Consensus |
| – | – | – |
| 8 | 1 |
| 1 | Rat Pep7 | FVEYDGGNTFTILKTDYDRY |
| 8 | 2 |
| 5 | Mus Pep5 | DGFNTFTILKTDYDN |
| 9 | Singleton |
| – | Rat Pep18 | TFTILKTDYDRYVMFHLINF |
| 10 | Singleton |
| – | Mus Pep14 | GIYYLNYDGFNTFTI |
| 11 | Singleton |
| – | Rat Pep10 | KTPEDGEYFVEYDGGNTFTI |
| 12 | Singleton |
| – | Mus Pep8 | LENSLVLKFHTVRDE |
| 13 | Singleton |
| – | Mus Pep21 | LQSGFYSLSSLVTVP |
| 14 | Singleton |
| – | Rat Pep5 | ENSLGFKFRIKENGECRELY |
| 15 | Singleton |
| – | Mus Pep11 | EKALVSSVRQRMKCS |
| 16 | Singleton |
| – | Mus Pep1 | LEQIHVLENSLVL |
| 17 | Singleton |
| – | Mus Pep15 | DDVVASEALNSVWSGF |
| 18 | Singleton |
| – | Mus Pep12 | SRPFIFQEVIDLGGE |
| 19 | Singleton |
| – | Mus Pep20 | DKETLSLEELKALLL |
| 20 | Singleton |
| – | Mus Pep19 | IGGPDDGVITPWQSSF |
| 21 | Singleton |
| – | Rat Pep2 | DIKEKFAKLCEAHGITRDNI |
| 22 | Singleton |
| – | Rat Pep15 | RELYLVAYKTPEDGEYFVEY |
| 23 | Singleton |
| – | Mus Pep18 | ILGKLVKDYHLQFHR |
| 24 | Singleton |
| – | Mus Pep23 | TIFISLFLLSVCYSA |
| 25 | Singleton |
| – | Mus Pep22 | EELRRLAPITSDPTE |
| 26 | Singleton |
| – | Rat Pep13 | NLDVAKLNGDWFSIVVASNK |
| 27 | Singleton |
| – | Rat Pep11 | LCEAHGITRDNIIDLTKTDR |
| 28 | Singleton |
| – | Rat Pep16 | RIKENGECRELYLVAYKTPE |
| 29 | Singleton |
| – | Rat Pep1 | ASNKREKIEENGSMRVFMQH |
| 30 | Singleton |
| – | Rat Pep3 | EEASSTRGNLDVAKLNGDWF |
Statistics of peptides from MHCLE data set and their clustering at 70% sequence identity threshold
| Features | MHCLE class I | MHCLE class II |
|---|---|---|
| No. of peptides | 105 642 | 33 757 |
| No. of peptides clustered | 57 455 | 28 523 |
| No. of singletons | 48 187 | 5234 |
| Singletons (% peptides) | 46 | 16 |
| No. of clusters | 11 932 | 4683 |
| Average size of cluster | 4.82 | 6.09 |
| No. of cliques | 36 732 | 25 941 |
| Average size of cliques | 2.78 | 57.36 |
MHCLE, major histocompatibility complex ligand elution.
Number of peptides/cliques (one peptide can be present in several cliques).
‘#’ denotes the count of a particular feature.
Figure 1Plot representing the number of peptides in top 10 clusters from major histocompatibility complex class II ligand elution data.
Distribution of cluster size for MHCLE data set for class II with different approaches
| Cluster size | Number of clusters | ||
|---|---|---|---|
| Raw data | Length filtered data | Length filtered data | |
| ≥ 1000 | 1 | 0 | 0 |
| 100–999 | 10 | 11 | 10 |
| 50–99 | 21 | 17 | 26 |
| 30–49 | 55 | 50 | 49 |
| 10–29 | 439 | 452 | 492 |
| < 10 | 4157 | 4196 | 4335 |
| Total | 4683 | 4726 | 4912 |
MHCLE, major histocompatibility complex ligand elution.
Length filtered data: to obtain a final list of peptide data sets where short (< 8 amino acids) and long (> 25 residues) peptides have been removed.
Distribution of peptide length in both the classes of MHCLE, MHC binding and T‐cell assay data sets
| Length | MHCLE I (%) | MHCLE II (%) | MHC binding I (%) | MHC binding II (%) | CD8 T cell (%) | CD4 T cell (%) |
|---|---|---|---|---|---|---|
| < 8 | 0·63 | 0·20 | 0·16 | 0·60 | 0·09 | 0·24 |
| 8 | 7·04 | 1·10 | 6·33 | 0·25 | 8·24 | 0·49 |
| 9 | 50·92 | 1·37 | 61·51 | 2·59 | 44·30 | 1·13 |
| 10 | 18·38 | 2·00 | 24·85 | 3·54 | 20·70 | 2·33 |
| 11 | 11·74 | 3·35 | 4·24 | 2·56 | 4·73 | 1·22 |
| 12 | 4·74 | 5·85 | 0·44 | 2·91 | 0·64 | 6·57 |
| 13 | 2·72 | 10·60 | 0·19 | 5·50 | 0·33 | 2·76 |
| 14 | 1·36 | 15·08 | 0·16 | 3·01 | 1·28 | 2·86 |
| 15 | 0·87 | 16·49 | 1·01 | 56·55 | 14·61 | 40·69 |
| 16 | 0·38 | 14·54 | 0·18 | 4·07 | 0·45 | 6·11 |
| 17 | 0·30 | 10·07 | 0·06 | 3·43 | 0·34 | 3·89 |
| 18 | 0·22 | 6·33 | 0·06 | 3·13 | 1·09 | 4·52 |
| 19 | 0·17 | 3·91 | 0·02 | 1·56 | 0·11 | 2·12 |
| 20 | 0·12 | 2·68 | 0·72 | 7·62 | 2·54 | 18·42 |
| 21 | 0·09 | 1·84 | 0·05 | 1·20 | 0·05 | 1·64 |
| 22 | 0·07 | 1·15 | 0·01 | 0·30 | 0·02 | 0·56 |
| 23 | 0·05 | 0·94 | 0·00 | 0·14 | 0·08 | 0·44 |
| 24 | 0·04 | 0·65 | 0·01 | 0·21 | 0·05 | 0·46 |
| 25 | 0·02 | 0·45 | 0·00 | 0·30 | 0·06 | 1·05 |
| > 25 | 0·14 | 1·40 | 0·00 | 0·53 | 0·25 | 2·52 |
| ≥ 8 and ≤ 25 | 99·22 | 98·41 | 99·83 | 98·87 | 99·65 | 97·24 |
MHC, major histocompatibility complex; MHC I, MHC class I; MHC II, MHC class II; MHCLE, MHC ligand elution; CD8, T‐cell data recognized by MHC I; CD4, T‐cell data recognized by MHC II.
Figure 2Example cluster visualization before (a) and after (b) cluster‐breaking algorithms.
Distribution of cluster size for different data sets
| Cluster size | Number of clusters (length filtered data + cluster break algorithm) | |||||
|---|---|---|---|---|---|---|
| MHCLE II | MHCLE I | MHC binding I | MHC binding II | CD8 T cell | CD4 T cell | |
| ≥ 1000 | 0 | 0 | 0 | 0 | 0 | 0 |
| 100–999 | 10 | 0 | 6 | 9 | 0 | 6 |
| 50–99 | 26 | 8 | 24 | 14 | 12 | 23 |
| 30–49 | 49 | 9 | 60 | 32 | 58 | 77 |
| 10–29 | 492 | 277 | 507 | 202 | 431 | 431 |
| < 10 | 4335 | 15 842 | 5525 | 1933 | 4081 | 3421 |
| Total | 4912 | 16 136 | 6122 | 2190 | 4582 | 3958 |
MHC, major histocompatibility complex; MHC I, MHC class I; MHC II, MHC class II; MHCLE, MHC ligand elution; CD8, T‐cell data recognized by MHC I; CD4, T‐cell data recognized by MHC II.
Figure 3Plot representing the top 10 clusters in different data sets after applying cluster‐break algorithm.
Figure 4Analysis of overlapping clusters in major histocompatibility complex (MHC) binding, T‐cell and MHC ligand elution data. (a) H‐chart for overlapping clusters between MHC class I binding and CD8 T‐cell assays. (b) H‐chart for overlapping clusters between MHC class II binding and CD4 T‐cell assays. (c) Pie‐chart of overlapping clusters in MHC class I ligand elution data and CD8 T‐cell assays. (d) Pie‐chart of overlapping clusters in MHC class II ligand elution data and CD4 T‐cell assays. (e) Pie‐chart of overlapping clusters in MHC class I ligand elution and binding assays data. (f) Pie‐chart of overlapping clusters in MHC class II ligand elution and binding assays data.
Figure 5Screenshots from online tool. (a) Specify Sequence (Step 1), (b) Select clustering parameters (Step 2), (c) Choose clustering algorithm (Step 3), (d) Tabular output (Result page), (e) Graphical Output (Result page).
Features and applications of different clustering approaches implemented in cluster2
| Features | Clique | Connected clusters | Cluster‐break |
|---|---|---|---|
| Clear consensus sequence | Mostly clear, as peptides are fully interconnected | May have many ‘X’ at different ambiguous positions | Clusters are broken down to get maximum possible clear consensus |
| Large membership issue | No | Yes | Resolved |
| Redundancy in clusters | Yes | No | No |
| Same sequence identity threshold between peptides in clusters | Yes | No | No |
| Application | Generating mega pools | Cross‐reactivity of small data sets | Cross‐reactivity of large data sets |
Feature‐based comparison of different clustering algorithms
| Feature | GibbsCluster | PepServe | Hammock | UCLUST | Cluster1.0 | Cluster2.0 |
|---|---|---|---|---|---|---|
| Input sequences | Amino acid sequence | Amino acids subjected to retrieval | Amino acid sequence | Amino acid sequence | Amino acid sequence | Amino acid sequence |
| Freely available online tool | Yes | Yes | Yes (Galaxy) | No | Yes | Yes |
| Graphical visualization | Yes | Yes | No | No | No | Yes |
| Provides connectivity in a cluster | No | No | No | No | No | Yes |
| Cluster representative sequence | Motif | No | Main sequence | No | No | Consensus sequence |
| Large membership issue | NA | NA | NA | Yes | Yes | Resolved |
| Overhang sequences identity calculation | NA | NA | NA | Yes, but consider only aligned region | No | Yes |
| Clustering basis | Supervised | Unsupervised | Unsupervised | Unsupervised | Unsupervised | Unsupervised |
NA: feature cannot be compared.