| Literature DB >> 35003093 |
Qilong Wang1,2, Huikun Zeng1,2, Yan Zhu1,2, Minhui Wang3, Yanfang Zhang3,4, Xiujia Yang3,4, Haipei Tang1,2, Hongliang Li5, Yuan Chen1,2, Cuiyu Ma4, Chunhong Lan1,2,3,4, Bin Liu5, Wei Yang6, Xueqing Yu2,7, Zhenhai Zhang1,2,3,4,8.
Abstract
Antibody repertoire sequencing (Rep-seq) has been widely used to reveal repertoire dynamics and to interrogate antibodies of interest at single nucleotide-level resolution. However, polymerase chain reaction (PCR) amplification introduces extensive artifacts including chimeras and nucleotide errors, leading to false discovery of antibodies and incorrect assessment of somatic hypermutations (SHMs) which subsequently mislead downstream investigations. Here, a novel approach named DUMPArts, which improves the accuracy of antibody repertoires by labeling each sample with dual barcodes and each molecule with dual unique molecular identifiers (UMIs) via minimal PCR amplification to remove artifacts, is developed. Tested by ultra-deep Rep-seq data, DUMPArts removed inter-sample chimeras, which cause artifactual shared clones and constitute approximately 15% of reads in the library, as well as intra-sample chimeras with erroneous SHMs and constituting approximately 20% of the reads, and corrected base errors and amplification biases by consensus building. The removal of these artifacts will provide an accurate assessment of antibody repertoires and benefit related studies, especially mAb discovery and antibody-guided vaccine design.Entities:
Keywords: antibody repertoire; chimera; rep-seq; sequencing error; unique molecular identifier (UMI)
Mesh:
Substances:
Year: 2021 PMID: 35003093 PMCID: PMC8727365 DOI: 10.3389/fimmu.2021.778298
Source DB: PubMed Journal: Front Immunol ISSN: 1664-3224 Impact factor: 7.561
Figure 1Characterization of intra- and inter-project “shared clones”. (A) Proportion distribution of between-donor “shared clones” in 14 published Rep-seq projects where 83 donors were grouped based on the projects (marked by different colors on the top and left) and arranged in the same order on the X and Y axes. The color represents the proportion of shared clones between the corresponding donors on the X-axis and Y-axis to the total clones in the donor at the X-axis. The right panel shows that the proportions of intra-project shared clones (n = 690) is much greater than their inter-project counterparts (n = 3996). ***P < 0.001 (unpaired t-test, mean ± s.e.m.). (B) Linear fitting of the number of “shared clones” as a function of the products of clone numbers of their corresponding sample pairs. (C) Composition of the shared clones. (D) Proportion of “shared clones” consisting of singletons (clones with only one read). **P < 0.01 (unpaired t-test, mean ± s.e.m.). (E) SHM rates of the intra-project (n = 54447) and inter-project (n = 4858) shared clones. ***P < 0.001 (unpaired t-test). (F) Distribution of Rep-seq reads with correct and incorrect barcode pairs after library construction using six cycles of PCR amplification on 14 and 15 pooled samples labeled with dual barcodes at both ends. (mean ± s.e.m.). (G) Proportion of singletons of inter-sample chimeras and sequences with correct barcode pairs in 14 and 15 pooled samples. **P < 0.01 (paired t-test).
Figure 2A large proportion of intra-sample chimeras are generated by Rep-seq. (A) Schematic diagram of the experimental design for simulating library preparation of an antibody repertoire using 100 synthetic antibody sequences (Experimental Section). (B) Distribution of reads with correct and incorrect barcode pairs in Rep-seq reads amplified from 100 synthetic antibody sequences (n = 10). (C) Distribution of Rep-seq reads by number of mismatches relative to the best assigned input sequences, for reads with correct and incorrect barcode pairs (n = 10). (D) Frequency of chimera formation at each position of V genes. The average and total frequencies of chimera formation in different regions were shown as the top (red) and bottom (black), respectively. (E) Changes in clone rank caused by PCR amplification. (F) Relationship between the proportions of intra-sample chimeras (red) and the sequence identities (blue) of V gene pairs analyzed using linear and nonlinear models. Forty-six functional V genes are shown in the same order from top to bottom on the Y-axis and from left to right on the X-axis.
Figure 3DUMPArts labels each molecule with a unique UMI pair. (A) Schematic diagram of DUMPArts for labeling each sample with identical dual barcodes and each molecule with unique dual UMIs. (B) Bioinformatics pipeline of DUMPArts for removing chimeras and building consensus sequences. The inter-sample chimeras were removed using dual barcodes during sample splitting, and the intra-sample chimeras were removed using the number of reads per UMI pair (RPUP). (C) Number of variable regions, unique variable regions, CDR3s, and unique CDR3s in each UMI pair of Rep-seq data from naïve B cells of 4 donors (D1, D2, D3, and D4). Three replicates from each donor were amplified and analyzed after DUMPArts correction. (D) Representative schematic diagram of multiple sequence alignment of the antibody reads in the same UMI pair. The colored dots represent the various types of mismatches. The numbers on the left indicate the abundance of each unique read, and the number on the bottom indicate the base position on the variable region. (E) Distribution of ratio of Levenshtein distance to read length with rank of sequence abundance in UMI pair in D2_2. Sequences with lower rank are more abundant in the group. The Levenshtein distance to the most abundant read was calculated for each unique read.
Figure 4Characterizations of inter- and intra-sample chimera via DUMPArts. (A) Proportion of inter-sample chimeras in two Rep-seq libraries (Experimental Section). (B) Proportion of “shared clones” among samples (n = 12) and inter-sample (n = 12) chimeras. The two separate libraries of four donors (D1, D2, D3, and D4) are shown with red and blue lines on the X- and Y-axes, respectively. The intra-library, inter-sample chimeras were pooled together. ***P < 0.001 (unpaired t-test). (C) Distribution of RPUP in both inter-sample chimeras and sequences with correct barcode pairs. (D) Proportion distribution of RPUP in sequences with correct barcode pairs. The dotted line represents the trend line of this distribution. (E) Proportion of intra-sample chimeras in sequences with correct barcode pairs (n = 12). (F) Number of mutations calculated using total reads, intra-sample chimeras, and non-chimeras. (G) Mutation rates in the CDR1, FR2, CDR2, and FR3 regions calculated using total reads, intra-sample chimeras, and non-chimeras. ***P < 0.001 (unpaired t-test, mean ± s.e.m.). (H) Proportion of V gene replacement in the total reads (n = 12) vs non-chimeras (n = 12). ***P < 0.001 (paired t-test, mean ± s.e.m.). (I) Numbers of clones in total reads (n = 9) and non-chimeras (n = 9). ***P < 0.001 (paired t-test).
Figure 5Accurate antibody repertoires are acquired with DUMPArts. (A) Jensen-Shannon divergence (JSD) of the top 10 clone compositions in multiple biological replicates calculated using total reads (n = 9) and consensus sequences (n = 9) after DUMPArts correction. ***P < 0.001 (paired t-test). (B) Numbers of mutations calculated using total reads and consensus sequences. ***P < 0.001 (unpaired t-test, mean ± s.e.m.). (C) Mutation profiling (left) and average mutation rate (right) of IGHV4-39*01 of D3_1 using total reads, non-chimeras, and consensus sequences. X-axis represents the position of the V gene segment, Y-axis represents the mutation rate of each base, and the curves were smoothed by Gaussian filter (left). ***P < 0.001 (unpaired t-test, mean ± s.e.m.). (D) Heatmap diagram of the percentages of different mutation patterns from germline V genes (X-axis) to Rep-seq sequences (Y-axis) in D1_1. (E) Distribution of the number of unique sequences (lower panel) and their fold-change (upper panel) before and after DUMPArts correction.