| Literature DB >> 34764956 |
Xiujia Yang1,2,3,4, Yan Zhu1, Sen Chen3,4, Huikun Zeng1,2, Junjie Guan3,4, Qilong Wang1,2, Chunhong Lan4, Deqiang Sun5, Xueqing Yu2,6, Zhenhai Zhang1,2,3,4,7.
Abstract
Detailed knowledge of the diverse immunoglobulin germline genes is critical for the study of humoral immunity. Hundreds of alleles have been discovered by analyzing antibody repertoire sequencing (Rep-seq or Ig-seq) data via multiple novel allele detection tools (NADTs). However, the performance of these NADTs through antibody sequences with intrinsic somatic hypermutations (SHMs) is unclear. Here, we developed a tool to simulate repertoires by integrating the full spectrum features of an antibody repertoire such as germline gene usage, junctional modification, position-specific SHM and clonal expansion based on 2152 high-quality datasets. We then systematically evaluated these NADTs using both simulated and genuine Ig-seq datasets. Finally, we applied these NADTs to 687 Ig-seq datasets and identified 43 novel allele candidates (NACs) using defined criteria. Twenty-five alleles were validated through findings of other sources. In addition to the NACs detected, our simulation tool, the results of our comparison, and the streamline of this process may benefit further humoral immunity studies via Ig-seq.Entities:
Keywords: Ig-seq; antibody repertoire; high-throughput sequencing; novel allele; tools benchmarking
Mesh:
Substances:
Year: 2021 PMID: 34764956 PMCID: PMC8576399 DOI: 10.3389/fimmu.2021.739179
Source DB: PubMed Journal: Front Immunol ISSN: 1664-3224 Impact factor: 7.561
The basic information for 5 NADTs.
| NADTs | Year | # Citation* | Programming language(s) | Supported receptor type(s) | Supported chain type(s) | Supported gene type(s) | Nonhuman species supported | Comparison with other tools |
| Algorithm | Authors |
|---|---|---|---|---|---|---|---|---|---|---|---|
| TIgGER | 2015 | 104 | R | BCR | IGH, IGK, IGL | V | Yes | No | Yes | Mutation accumulation models | Gadala-Maria et al. ( |
| IMPre | 2016 | 20 | C, Perl | BCR, TCR | IGH, IGK, IGL, TRB, TRA | V, J | Yes | No | Yes | Seed_Clust | Zhang et al. ( |
| IgDiscover | 2016 | 81 | Python | BCR | IGH, IGK, IGL | V, D, J | Yes | No | No | Windowed cluster analysis, Linkage cluster analysis | Corcoran et al. ( |
| LymAnalyzer | 2016 | 41 | Java | BCR, TCR | IGH, IGK, IGL, TRB, TRA | V,J | Yes | No | No | Mismatch quality control | Yu et al. ( |
| Partis | 2019 | 12 | C, C++, Perl, Python | BCR | IGH, IGK, IGL | V | Yes | Yes | Yes | Mutation accumulation models | Ralph et al. ( |
*The citation statistics is obtained on 2020/4/13 according to google scholar (https://scholar.google.com/).
Figure 1Schematic overview of the study design. In this study, both in silico simulated and genuine Ig-seq dataset were employed as benchmark datasets that serve as the input of all five NADTs independently. The performances of these NADTs were then summarized and integrated, and translated into filtration criteria capable of facilitating the evaluation of NACs. Among all NACs reported based on the collected bulk sequencing dataset, we retained only those credible NACs passing the defined filtration criteria.
Figure 2IMPlAnts, an Integrated and Modular Pipeline for Antibody Repertoire Simulation. IMPlAnts consists of three consecutive steps: i) individual rearrangement simulation; ii) SHM and clonal expansion simulation; and iii) next generation sequencing simulation. In step i and ii, V(D)J gene usage, junctional modification and position-specific SHM were learned from a previous large-scale study encompassing 2152 high-quality Ig-seq datasets. After SHMs were simulated, the power law was used to simulate clonal size distribution. Finally, a NGS read simulator, ART, was exploited to produce sequencing reads (Illumina MiSeq, PE250).
Characterization of four simulated datasets.
| Dataset | Studied variable | # Repertoires | # Reads (million) | Gene expression | Minor allele frequency | # SNPs | SHM frequency |
|---|---|---|---|---|---|---|---|
| DEXPR | gene expression | 20 (5, 5, 5, 5) | 1, 1, 1, 1 | ~5%, ~1%, ~0.1%, ~0.01% | – | 1 | 0 |
| DALLELE | minor allele frequency | 20 (5, 5, 5, 5) | 0.1, 0.16, 0.5, 1 | ~5% | 50%, 30%, 10%, 5% | 1 | 0 |
| DSNP | # SNPs | 20 (5, 5, 5, 5) | 1, 1, 1, 1 | ~5% | – | 1, 3, 5, 7 | 0 |
| DSHM | SHM | 10 (5, 5) | 1, 1 | ~5% | – | 1 | 0, ~6% |
Each of the first three datasets above consists of 20 simulated repertoires, corresponding to four groups with equal sample size (n=5) varying from each other with respect to the studied variable. While DSHM contains 10 repertoires from two groups with equal size (n=5). Besides, repertoires from DEXPR, DSNP and DSHM do not contain allelic diversity and thus do not apply to the ‘minor allele frequency’ column. Comma-separated percentages or numbers in the last four columns describe the features of simulated novel alleles in repertoires of different groups within a certain dataset (see also Results).
Sensitivity and specificity of novel allele detection for 5 NADTs based on four simulated datasets.
| Type | Measurement | Tool | Dataset | |||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| DEXPR | DALLELE | DSNP | DSHM | |||||||||||||
| ~5% | ~1% | ~0.1% | ~0.01% | 50% | 30% | 10% | 5% | 1 | 3 | 5 | 7 | 0% | 6% | |||
|
|
| TIgGER | 1.00 | 0.76 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | 0.96 | 1.00 | 0.96 | 1.00 | 0.00 |
| IMPre | 0.28 | 0.92 | 0.44 | 0.00 | 0.52 | 0.56 | 0.40 | 0.52 | 0.40 | 0.44 | 0.16 | 0.28 | 0.20 | 0.52 | ||
| IgDiscover | 0.60 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.72 | 0.68 | 0.88 | 0.64 | 0.80 | 0.00 | ||
| LymAnalyzer | – | – | – | – | – | – | – | – | – | – | – | – | – | – | ||
| Partis | 1.00 | 1.00 | 0.32 | 0.00 | 0.48 | 0.20 | 0.20 | 0.20 | 1.00 | 0.28 | 0.04 | 0.00 | 1.00 | 0.92 | ||
|
| TIgGER | 1.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | 0.96 | 1.00 | 0.75 | 1.00 | 0.00 | |
| IMPre | 0.17 | 0.44 | 0.24 | 0.00 | 0.63 | 0.70 | 0.33 | 0.28 | 0.25 | 0.30 | 0.09 | 0.21 | 0.13 | 0.56 | ||
| IgDiscover | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | 1.00 | 0.97 | 1.00 | 0.00 | ||
| LymAnalyzer | – | – | – | – | – | – | – | – | – | – | – | – | – | – | ||
| Partis | 1.00 | 1.00 | 0.80 | 0.00 | 0.82 | 0.90 | 0.90 | 1.00 | 0.97 | 0.30 | 0.05 | 0.00 | 1.00 | 0.67 | ||
|
|
| TIgGER | 1.00 | 0.76 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | 0.99 | 1.00 | 0.99 | 1.00 | 0.00 |
| IMPre | 0.80 | 0.92 | 0.44 | 0.00 | 0.60 | 0.56 | 0.40 | 0.52 | 0.76 | 0.73 | 0.74 | 0.67 | 0.76 | 0.56 | ||
| IgDiscover | 0.60 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.72 | 0.68 | 0.88 | 0.64 | 0.80 | 0.00 | ||
| LymAnalyzer | 1.00 | 1.00 | 1.00 | 0.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | ||
| Partis | 1.00 | 1.00 | 0.32 | 0.00 | 0.48 | 0.20 | 0.20 | 0.20 | 1.00 | 0.44 | 0.30 | 0.04 | 1.00 | 0.92 | ||
|
| TIgGER | 1.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | 1.00 | 0.70 | 1.00 | 0.00 | |
| IMPre | 0.23 | 0.31 | 0.15 | 0.00 | 0.40 | 0.39 | 0.16 | 0.17 | 0.31 | 0.59 | 0.64 | 0.75 | 0.25 | 0.45 | ||
| IgDiscover | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | 1.00 | 0.94 | 1.00 | 0.00 | ||
| LymAnalyzer | 0.09 | 0.09 | 0.09 | 0.00 | 0.11 | 0.10 | 0.09 | 0.08 | 0.09 | 0.23 | 0.34 | 0.31 | 0.09 | 0.00 | ||
| Partis | 1.00 | 1.00 | 0.80 | 0.00 | 0.78 | 0.85 | 0.85 | 1.00 | 0.97 | 0.78 | 0.68 | 0.10 | 1.00 | 0.39 | ||
Figure 3Heatmap of double-tailed p-value of paired t-test between different subgroups with regard to sensitivity and specificity for different tools in four simulated datasets. Each row and column in the heatmap represents a subgroup. P-values below 0.05 are shown in red and p-values above 0.05 in blue.
NACs identified based on single naïve B cell sequencing dataset from 3 donors.
| Nearest known allelea | Known allele | # Supportive contigsb | Length (bp) | Start | End | SNP locic | Individual |
|---|---|---|---|---|---|---|---|
| IGHV7-4-1*02 | IGHV7-4-1*02 | 44 (136, 0.32) | 296 | 1 | 296 | G92A | Donor1 |
| IGHV3-30*18T | IGHV3-30*18 | 126 (492, 0.26) | 296 | 1 | 296 | C72G | Donor2 |
| IGHV3-7*03T | IGHV3-7*03 | 96 (108, 0.89) | 296 | 1 | 296 | G46A | Donor3 |
| IGHV3-53*04T,G | IGHV3-53*01 | 24 (126, 0.19) | 293 | 1 | 293 | T261C |
a, NACs identified by TIgGER using bulk sequencing of IgM sequences are marked with “T” while IgDiscover with “G”. b, The numbers in the parentheses denote the number of contigs supportive of its known germline variant in the second column and the ratio of the two germline variants. c, The indexes in SNP loci are 1-based. IGHV7-4-1*02_G92A is not included in the collected germline sequences (see ).
Sensitivity and specificity of novel allele detection for 5 NADTs based on genuine Ig-seq dataset.
| Type | Measurement | Tool | Dataset | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| GD-EXPR | GD-SNP | GD-SHM | |||||||||
| ~5% | ~1% | ~0.1% | 1 | 3 | 5 | 7 | IgM | IgG | |||
|
|
| TIgGER | 0.80 | 0.60 | 0.20 | 0.80 | 0.76 | 0.80 | 0.60 | 0.80 | 0.00 |
| IMPre | 0.40 | 0.64 | 0.04 | 0.40 | 0.36 | 0.08 | 0.40 | 0.40 | 0.00 | ||
| IgDiscover | 1.00 | 1.00 | 0.40 | 1.00 | 1.00 | 1.00 | 0.84 | 1.00 | 0.80 | ||
| LymAnalyzer | – | – | – | – | – | – | – | – | – | ||
| Partis | 0.56 | 0.48 | 0.00 | 0.60 | 0.00 | 0.00 | 0.00 | 0.60 | 0.32 | ||
|
| TIgGER | 0.57 | 0.75 | 0.14 | 0.57 | 0.41 | 0.52 | 0.31 | 0.57 | 0.00 | |
| IMPre | 0.27 | 0.33 | 0.03 | 0.28 | 0.21 | 0.04 | 0.23 | 0.27 | 0.00 | ||
| IgDiscover | 1.00 | 1.00 | 0.33 | 1.00 | 0.63 | 0.81 | 0.70 | 1.00 | 1.00 | ||
| LymAnalyzer | – | – | – | – | – | – | – | – | – | ||
| Partis | 0.17 | 0.12 | 0.00 | 0.18 | 0.00 | 0.00 | 0.00 | 0.17 | 0.43 | ||
|
|
| TIgGER | 0.80 | 0.60 | 0.40 | 0.80 | 0.79 | 0.80 | 0.63 | 0.80 | 0.00 |
| IMPre | 0.40 | 0.76 | 0.04 | 0.48 | 0.71 | 0.18 | 0.73 | 0.40 | 0.00 | ||
| IgDiscover | 1.00 | 1.00 | 0.60 | 1.00 | 1.00 | 1.00 | 0.84 | 1.00 | 0.80 | ||
| LymAnalyzer | 1.00 | 1.00 | 0.72 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 0.96 | ||
| Partis | 0.56 | 0.48 | 0.00 | 0.60 | 0.17 | 0.11 | 0.07 | 0.60 | 0.32 | ||
|
| TIgGER | 0.57 | 0.75 | 0.18 | 0.57 | 0.49 | 0.67 | 0.53 | 0.57 | 0.00 | |
| IMPre | 0.07 | 0.03 | 0.00 | 0.02 | 0.10 | 0.02 | 0.12 | 0.05 | 0.00 | ||
| IgDiscover | 1.00 | 1.00 | 0.43 | 1.00 | 0.60 | 0.86 | 0.79 | 1.00 | 1.00 | ||
| LymAnalyzer | 0.00 | 0.00 | 0.00 | 0.00 | 0.01 | 0.01 | 0.02 | 0.00 | 0.00 | ||
| Partis | 0.08 | 0.08 | 0.00 | 0.09 | 0.08 | 0.11 | 0.07 | 0.08 | 0.34 | ||
Figure 4Quantitative characterization of Ig-seq datasets and NACs identified by 4 NADTs. (A) Composition of IgM (SHM-sparse) and IgG (SHM-enriched) datasets. (B) Density of Ig-seq datasets with different number of reads. (C). Correlation between the number of NACs for each dataset and the number of reads. Note that only dataset reported with NACs by a certain tool is included.
Quantitative summary of NACs identified from Ig-seq datasets by 4 NADTs.
| NADTs | # Datasets (IgG) | # Datasets (IgM) | # Unique novels (IgG) | # Unique novels (IgM) | # Datasets (total) | # Unique novels (total) |
|---|---|---|---|---|---|---|
| TIgGER | 1 (0.3) | 68 (77.3) | 6 (0.8) | 57 (4.8) | 69 (16.3) | 57 (4.8) |
| IMPre | 213 (63.4) | 88 (100.0) | 740 (96.1) | 1033 (86.5) | 301 (71.0) | 1033 (86.5) |
| IgDiscover | 15 (4.5) | 62 (70.5) | 16 (2.1) | 50 (4.2) | 77 (18.2) | 50 (4.2) |
| Partis | 4 (1.2) | 65 (73.9) | 12 (1.6) | 101 (8.5) | 69 (16.3) | 101 (8.5) |
|
|
|
|
|
|
|
|
The number in each parentheses indicates the corresponding percentage (%) of each item. For columns indicating number of datasets, the associated percentages were calculated based on the total number of datasets (or of a specific type, see ).
Figure 5Twenty-four NACs identified from 424 Ig-seq datasets amplified using RACE protocol. The top bar graph shows the number of supportive samples and donors. The bottom scatter plot shows the set of tools identifying a typical NAC. Numbers in the x-axis labels are 1-based positions of SNPs (refer also to ).
Figure 6Characterization of 43 unique NACs identified from Ig-seq datasets. (A) Overlap between genes identified with NACs and 52 core genes defined in a previous study (Yang et al., 2021). (B) Correlation between the number of NACs and the number of known alleles (from IMGT/GENE-DB) for a typical gene. The Pearson correlation coefficient is 0.43. Note only 48 of 52 core genes were included in the germline reference sequences. (C) Comparisons of R/S (Replacement/Silent) ratio of SNPs in the framework regions (FRs), complementarity-determining regions (CDRs) and both kinds of regions (all). Numbers at the top of the doughnut chart denote R/S ratios.
Scheme 1Schematic diagram of true positive and false positive in novel allele detection. The top sequence in bold represents the genuine novel sequence while the bottom sequences represent the partial/full-length sequences discovered by NADTs. The nucleotides marked in green represent the genuine SNPs while those in red are mismatches with the genuine novel sequence either in SNP loci or non-SNP loci. An identified sequence is accepted as a true positive only when it covers all the genuine SNPs and contains no mismatch with the genuine novel sequence in all other loci.