| Literature DB >> 20875155 |
Zhiliang Chen1, Andrew M Collins, Yan Wang, Bruno A Gaëta.
Abstract
BACKGROUND: Clonal expansion of B lymphocytes coupled with somatic mutation and antigen selection allow the mammalian humoral immune system to generate highly specific immunoglobulins (IG) or antibodies against invading bacteria, viruses and toxins. The availability of high-throughput DNA sequencing methods is providing new avenues for studying this clonal expansion and identifying the factors guiding the generation of antibodies. The identification of groups of rearranged immunoglobulin gene sequences descended from the same rearrangement (clonally-related sets) in very large sets of sequences is facilitated by the availability of immunoglobulin gene sequence alignment and partitioning software that can accurately predict component germline gene, but has required painstaking visual inspection and analysis of sequences.Entities:
Year: 2010 PMID: 20875155 PMCID: PMC2946782 DOI: 10.1186/1745-7580-6-S1-S4
Source DB: PubMed Journal: Immunome Res ISSN: 1745-7580
PNG benchmark dataset
| Number of Clusters | Numbers of Sequences in a Cluster |
|---|---|
| 1 | 16 |
| 1 | 7 |
| 1 | 6 |
| 2 | 5 |
| 3 | 4 |
| 16 | 3 |
| 42 | 2 |
Comparison of distance measures for clustering immunoglobulin gene variable sequences
| Clustering method | Number of clusters below the threshold | Number of sequences in clusters below threshold | Number of clusters different from benchmark set | Number of incorrectly assigned sequences | Correctly clustered sequences (%) |
|---|---|---|---|---|---|
| (a) Expert inspection | 67 | 184 | 4 | 16 | 95.1 |
| (b) LD | 117 | 364 | 71 | 182 | 50.0 |
| (c) PNED | 93 | 258 | 36 | 76 | 70.5 |
| (d) NED | 78 | 211 | 15 | 29 | 85.9 |
| (e) NED_VJ | 70 | 190 | 4 | 8 | 95.8 |
Sequences in the benchmark PNG dataset were clustered using the following 4 methods (a) Expert inspection carried out by visual inspection of the partitioned gene segments without automated clustering. (b) LD: automated clustering based on pairwise Levenshtein Distance between CDR3 sequences. (c) PNED: automated clustering based on post-normalized edit distance. The Levenshtein Distances of each sequence pair is normalize by square root of the length of longer sequence in comparison. (d) NED: automated clustering based on the Normalized Edit Distance. (e) NED_VJ: automated clustering based on the Normalized Edit Distance, incorporating germline gene identity. Gap penalties of 3 were applied to each automated method. The resulting clusterings were evaluated relative to the “benchmark” clustering obtained by combination of automated clustering and visual inspection.
Figure 1Maximum distance (NED_VJ) within a cluster versus number of clusters processed for the PNG, PW99 and PW57 datasets. The red line corresponds to the distance threshold below which sequences are considered to be clonally-related.