Literature DB >> 35832623

Reconstruction of full antibody sequences in NGS datasets and accurate V_L:V_H coupling by cluster coordinate matching of non-overlapping reads.

Jorge Moura-Sampaio^1,2, André F Faustino¹, Remi Boeuf³, Miguel A Antunes¹, Stefan Ewert³, Ana P Batista¹.

Abstract

Next-generation sequencing (NGS) is an indispensable tool in antibody discovery projects. However, the limits on NGS read length make it difficult to reconstruct full antibody sequences from the sequencing runs, especially if the six CDRs are randomized. To overcome that, we took advantage of Illumina's cluster mapping capabilities to pair non-overlapping reads and reconstruct full Fab sequences with accurate VL:VH pairings. The method relies on in silico cluster coordinate information, and not on extensive in vitro manipulation, making the protocol easily deployable and less prone to PCR-derived errors. This work maintains the throughput necessary for antibody discovery campaigns, and a high degree of fidelity, which potentiates not only phage-display and synthetic library-based discovery methods, but also the NGS-driven analysis of naïve and immune libraries.

Entities: Chemical

Keywords: CDR; Diversity; Fab; Next-generation Sequencing; Phage-display; Randomization; Synthetic Libraries

Year: 2022 PMID： 35832623 PMCID： PMC9168528 DOI： 10.1016/j.csbj.2022.05.054

Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN： 2001-0370 Impact factor: 6.155

Introduction

Antibody discovery has been potentiated using in vitro display methods – such as yeast- and phage-display – which answer to the limitations of in vivo antibody discovery and provide great experimental control over antigen presentation. Whereas previously the immunoglobulin repertoires of selection outputs were analyzed by colony picking and Sanger sequencing, we can now employ Next-generation Sequencing (NGS) to analyze the outcome of in vitro selection outputs.[1], [2], [3], [4] The former allowed the identification of the most dominant clones, but provided a small snapshot of the selection output and did not guarantee the discovery of the best candidates. The latter allows a much deeper inspection of candidate pools after the final round of panning, by retrieving up to 107 sequences.[1], [5], [6] The bigger depth of NGS analysis allows for the selection of a more diverse candidate dataset by providing alternatives to the dominant “top clones”, which arise from careful inspection of patterns across conditions and controls. Additionally, NGS has also proven to be a useful tool for the quality control of antibody libraries, of both natural and synthetic origin, in terms of their CDR length distribution, germline frequencies and clone redundancy.[7], [8], [9] The number of CDRs diversified on a library will impact the NGS strategy employed. Although HCDR3 is widely recognized as the most important CDR for antigen binding, the interplay of all six CDRs is necessary for the antibody-antigen interaction.[10], [11], [12] Primary library strategies have increasingly relied on the randomization of other CDRs to discover antibodies against antigens. Ylanthia (randomized in LCDR3 and HCDR3) and HuCAL PLATINUM® (all six CDRs) constitute clear examples of state-of-the-art synthetic libraries that employ such strategies. However, while all NGS platforms can sequence the HCDR3 in its entirety (12–105 bp), most wońt be able to capture the full CDR diversity of some libraries. This is especially true if diversity is simultaneously located on the separate variable heavy (VH) and variable light (VL) chains and will be exacerbated if a Fab library is used, due to the additional CL domain in between. PacBio has the longest read lengths available and can effectively capture VL and VH diversity, but can only do so at the expense of throughput, which may turn it unsuitable for most library-based methods. Currently, the longest read length available with reasonable throughput is provided by Illumina, which allows the paired-end sequencing of 600 bp reads yielding around 107 sequences. Still, 600 bp does not allow to recover information on full VL and VH segments simultaneously on a single read, nor does it allow the overlapping of paired-end reads from those segments in Fab amplicons. The lack of clear alternatives to sequence VL and VH fragments means that researchers need to opt between randomizing less CDRs, or to perform analysis of VL and VH separately and match them afterwards based on frequency alone. On this work, we take advantage of Illumina’s cluster mapping capabilities to match non-overlapping reads from Fab amplicons and effectively capture diversity from far-apart CDRs that would not otherwise be sequenced simultaneously using most NGS applications. Here, paired-end sequencing of VL (forward read) and VH (reverse read) is performed after amplifying the whole region of interest. Then, information on cluster coordinates is used to match the forward and reverse reads in silico, providing a trustworthy VL:VH pairing methodology that is less prone to mismatches than clone frequency-based approaches.

Results

Cluster coordinates allow the interrogation of VL and VH simultaneously

Currently, the longest reads produced on an Illumina platform is accomplished by paired-end sequencing on the MiSeq Reagent Kit v3 (600-cycle), which produces a forward read (R1) and a reverse read (R2) whose lengths sum up to a total of 600 bp. 300 bp is enough to cover the distance between the first residue of CDR1 and the last residue of CDR3, for both VL and VH, even when considering an extra 12–15 bp in both 5′ and 3′ ends (used for primer annealing during the amplification step and for query purposes during the data analysis step). Thus, if R1 and R2 are correctly assigned to each other, a full Fab fragment can be reconstructed with the information on all six CDRs. Illumina systems make use of Bridge-PCR to replicate the desired amplicons into a cluster of identical amplicons, as a way to increase the signal. If conditions are optimized correctly, each cluster should be perfectly distinguished from each other as to provide a confident signal to the sequencing equipment. On that regard, we tested two different DNA loading concentrations, and saw that 7.2 pM generated an adequate cluster quality without compromising the total number of reads (Table S1). On each image taken, all clusters will have a specific ID that indicate their lane, tile and XY coordinates inside the composite image generated by the imaging software. As such, we have developed a simple methodology which confidently pairs and concatenates non-overlapping R1 and R2 reads that share the same cluster ID (Fig. 1).

Fig. 1

Schematic representation of paired-end reading and sequence coordinate matching to retrieve correlated V. Read Generation: Big Fab amplicons comprise both VL and VH, with CL in between. In the case of scFv sequences, the total length of the amplicon decreases to about 900 bp. The R1 and R2 reads add up to a maximum of 600 bp (using Illumina’s MiSeq system), with R1 shorter than R2 in this case due to the bigger HCDR2 and HCDR3 loops than LCDRs. Read Matching: During a sequencing run, clusters are identified on the MiSeq flow cell images and their coordinates are stored on the first row of each information block on the FASTQ raw data files, as follows: :::. The second line of the information block identifies the nucleotide sequence and the third line (after “+”) indicates the Q-score of each sequenced nucleotide. Final Read: The raw outputs from R1 and R2 runs are matched according to their < lane>::: coordinates and used to generate the R1 + R2 reads, which are composed by R1 and the reverse complement of R2. These are united by a string of 30 N nucleotides, to provide a clear separation between the last nucleotides of R1 and the first nucleotides of R2.

Coordinate matching reveals hidden VL:VH pairs and avoids mispairings that arise from clone frequency-based analysis.

We applied the coordinate matching method to analyze the outputs of 23 different affinity maturation projects, diversified in LCDR1, LCDR3, HCDR1 and HCDR2. Because these libraries did not have diversity in the HCDR3, we opted for the shorter MiSeq V2 500 bp PE kit instead of the longer MiSeq V3 600 bp PE kit, since the method itself is directly transferable in any Illumina’s method that retains cluster coordinates. We compared the results against the traditional frequency-based analysis, which tries to infer VL:VH pairs by matching sequences from the R1 and R2 reads (which retrieves information on VL and VH, respectively), in descending order, based on their occurrences (i.e. the most frequent clone of R1 is matched with the most frequent clone of R2 and so forth. Table S3). The poor similarity between both approaches highlights the relevance of correctly matching VL:VH pairs (Fig. 2, Tables S2 and S3). Only 5% of the Top 100 frequency-inferred clones corresponded to the concatenated sequences, with the remaining clones being artificial sequences that are not found originally in the sample. This effect improved only slightly when the Top50, Top25 and Top10 were compared (Fig. 2, Table S2). The mispairing effects can also be seen when looking exclusively at the top clone of each dataset. We searched the top hits of each dataset within the frequency-inferred dataset and found that in 13 of the projects, the Top1 candidate could have not been found if sequence coordinate matching had not been performed. In the remaining projects, there was a match between the top clones in 9 of the datasets (Top1) and, in another project, there was a match of the top clone of the concatenated dataset with the fifth clone of the inferred dataset (Table S4).

Fig. 2

Intersecting sequences between concatenated and frequency-inferred datasets. Each of the 23 affinity maturation projects was analyzed and their VL and VH reads matched by the sequence coordinate matching method and by the frequency-based method. Independently analyzed R1 and R2 were combined to generate VL:VH pairs based on the frequency of each clone (inferred dataset – grey circles). The concatenated and frequency-based datasets were ordered by the occurrences of each sequence and, compared to discover identical sequences on the Top10, 25, 50 and 100, for each affinity maturation project (see also Tables S2 and S3).

Discussion

Capturing information on VL and VH fragments is instrumental for reconstructing full antibody sequences after an antibody discovery campaign, in both in vivo and in vitro settings. While the widespread use of single-cell technologies has allowed VL:VH pairings to be retrieved from in vivo campaigns,[16], [17] the issue has only been partially tackled for the commonly-used phage-display technique. Here, researchers are required to chose between read length and sequencing depth. On one hand, a long-read NGS technology such as PacBio will allow the sequencing of both variable chains, but with a relatively low number of reads (up to ∼ 8 × 104). Despite being sufficient to sample high frequency (and likely, high-affinity) clones near the top of the dataset, low-depth sequencing will not allow the capturing of less-frequent rarer clones, which often show equivalent or sometimes higher affinities. This is also incompatible with the deep inspection of motifs and patterns within the dataset. On the other hand, the high-depth NGS platforms available do not have the read length necessary to cover both variable chains, which means the analysis of phage-displayed Fab/scFv libraries or selection outputs is usually restricted to VL or VH in those cases. In this regard, Barreto et al. achieved high depths and VL:VH pairing by successfully coupling Ion Torrent sequencing with Kunkel mutagenesis to physically concatenate CDRs in vitro and remove unwanted segments between CDRs. While effective, this approach may increase the accumulation of errors that stem from extensive PCR-based manipulations and increases the experimental load as higher amounts of DNA are required to fulfill the PCR, electrophoresis and purification steps. Both these drawbacks are exacerbated when working in antibody library and phage-display settings, where diversity, throughput and quality-control are adamant.[20], [21] It is also likely that this strategy works best in defined-single frameworks, and that more complex samples with multiple V-families may difficult the site-directed mutagenesis process even more. Inversely, our approach leverages existing data related to cluster mapping from Illumina’s sequencing files to match forward and reverse reads and concatenate VL and VH in silico, which does not require further processing of output DNA. Additionally, our approach is easily compatible with multiple frameworks via primer-barcoding. Special attention must be given nonetheless to the total amplicon length. Cluster quality will tend to decrease with increasing amplicon lengths, since longer amplicons will lead to clusters with larger diameters, with a higher probability to overlap. Our approach generates 1200 bp amplicons which are near the upper limit of MiSeq’s capabilities (≈1500 bp) and as such, an optimization of the concentration of loaded DNA was warranted. We found that a loading concentration of 7.2 pM was sufficient to yield good overall cluster quality without compromising the total number of reads and their fidelity. A total of 1.68 × 107 reads were obtained, with 81.7% having a Phred score (hereinafter referred to as Q-score) above 30 (Table S1). This method allowed us to extract information on diversified positions (in this case, 8 different amino acid positions across 4 CDRs), with a Q-score > 30 for each nucleotide inspected (24 nt), for at least 100.000 reads per library. In this paper we have also shown how frequency-based VL:VH matching fails to reconstruct trustworthy sequences, regardless of their relative rank within the dataset (Fig. 2, Table S3/S4), which highlights the relevance of our approach to analyze both high- and low-frequency clones. While very high frequency clones may have a chance to have their VL and VH correctly matched by frequency-based inference, it is very likely that such a method does not retrieve correct sequences throughout the entire dataset. This is especially relevant considering that the more sophisticated NGS studies require a deep inspection of datasets, such as when looking for motifs and clusters within the dataset, when looking for mutations of interest that were predicted by in silico tools2, or when looking for rare clones that were enriched throughout the selection rounds. However, it is also important to note that this analysis and its results are inextricably linked to the types of libraries/sequences assessed, which have 3 randomized positions in R1 and 5 randomized positions in R2. Strong enrichment of certain aminoacids in low diversity reads (e.g. R1) can reduce the ability for a given read to be informative about a clone’s identity – i.e. a very enriched sequence in R1 will be shared among many clones in R2 and correctly assigned to only one clone in R2, and incorrectly for the remainder of the clones. It is likely that sequences with more diversified positions will lead to a better outcome when matched based on frequency. In that line of thought, frequency-based matching is expected to decrease as selections become more stringent and certain sequences start dominating the datasets, which further advocates for the use of alternative VL:VH matching methods. Despite being ideal for synthetic Fab and scFv libraries, this work also provides a suitable VL:VH pairing methodology for other extensively diversified immunoglobulin repertoires, as long as the aforementioned amplicon length limitation is respected and the correct primer-barcoding strategy is employed. Importantly, the proposed approach tackles several challenges of previous VL:VH pairing approaches, on the most widely used NGS platform – Illumina’s MiSeq –, with high sequencing depth (>107 reads), high read fidelity, and without increasing the experimental burden.

Methods

Plasmid DNA was isolated directly from the phage-infected cells from the selection round of interest using the GeneJET Plasmid Miniprep Kit (Thermo Scientific™, K0502). Isolated dsDNA was quantified on the Qubit 3.0 fluorometer using the Qubit® dsDNA HS kit (Invitrogen™ Q32851). The generation of VL:VH amplicons for sequencing was performed through two PCRs. To amplify the region of interest and to insert the adapter regions for the NGS, the initial PCR utilized a forward primer specific to the vector leader sequence prior to LCDR1 and, since we did not need HCDR3 information in our affinity maturation setting, a reverse primer downstream of HCDR2 was employed. The second PCR inserted the TruSeq universal adapter and the indexes, which are used to distinguish between different samples (i.e. libraries). Samples were quantified in Qubit 3.0, pooled in equimolar proportions, and ran on an electrophoresis gel. Bands with the appropriate size were excised, purified using the Wizard SV Gel and PCR Clean Up System (Promega, A9281) and quantified on Qubit 3.0. The pool was diluted to a final concentration of 4 nM, spiked with 20% PhiX (Illumina; FC-110–3001), denatured for 5 min in 0.1 N of NaOH (5 μL of DNA + PhiX at 4 nM mixed with 5 μL 0.2 N of NaOH), diluted in HT buffer (provided on the NGS kit; kit details, ahead) to 7.2 pM and sequenced on the Illumina MiSeq platform using the 500 cycle v2 kit (Illumina; MS-102–2003). The forward read was 270 bp in length while the reverse read was 230 bp. R1 retrieves information on LCDR1, LCDR2 (non-diversified), and LCDR3. R2 retrieves information on HCDR1 and HCDR2. Note: we performed a sequencing between LCDR1 and HCDR2 with a 500 cycle v2 kit, but the procedure is directly transferrable (to our best knowledge) to a sequencing between LCDR1 and HCDR3 with a 600 cycle v3 kit (Illumina, MS-102–3003) by sequencing 270 + 330 bp (as explained on Fig. 1). An R2 of 330 bp starting at the end of the VH domain will allow to sequence the entirety of HFR4 and up to the last residues of HFR1, with the coverage on HFR1 depending on framework and HCDR3 length essentially. Longer HCDR3 lengths will require that R2 covers less residues of HFR1 or that HFR4 is less sequenced, which can be achieved by designing primers for that specific purpose. : The data analysis of the NGS FastQ output files was performed as described previously. For the panning output of each library, 105 sequences were analyzed using the fixed-by-design flanking sequences on the boundary of the diversified positions as template to locate and segment out mutations. We queried a total of 8 amino acid positions, specifically: 1aa in LCDR1, 2aa in LCDR3, 2aa in HCDR1 and 3aa in HCDR2. Sequences that contained at least one nucleotide with a Q-score < 30 on the positions of interest were discarded, allowing for 105 sequences for each library with a high degree of quality and confidence. Full CDR sequences were reconstructed by coupling the regions fixed-by-design with the information of diversified regions. To reconstruct a full Fab sequence with VL and VH information, we compared two methodologies (as explained on the second section of Results). The first methodology generates a dataset containing correlated VL:VH information after concatenation of R1 and R2 using their sequence’s coordinates (i.e. the new method proposed here), as described in the Results section. The second methodology tries to infer VL:VH pairs from sequence’s frequency counts and relative position within the dataset (i.e. the most frequent clone of R1 is matched with the most frequent clone of R2 and so forth). Sequences in both datasets that only had one occurrence were removed from the analysis. Three additional Methods’ sections can be found on the Supplementary Information, which are important for the generation of the libraries, but non-essential for the purpose of this manuscript.

Additional information

Data Availability:

The data used for this work is generated after the paired-end sequencing of amplicons, produced from the amplification of Fab sequences rescued from phage-display selections. In total, 23 fastq files (1 per library) were generated for the forward R1 reads – which sequenced VL regions –, plus 23 fastq files (1 per library) were generated for the reverse R2 reads – which sequenced VH regions. Additional 23 fastq files were originated from the merging of forward and reverse reads. The data from these files were analyzed in 23 different Excel spreadsheets to generate the data found on Fig. 2 and Table S2 and Table S3. All of this data are available upon request, including the VL:VH pair matching script.

Author contributions

JMS and AFF conceived and designed the experiments. JMS performed the experiments, analyzed the results, and drafted the manuscript. AFF critically discussed the data and helped to draft the manuscript. RB devised the in-house NGS analysis tool used for CDR determination. MAA programmed the sequence matching protocol. APB and SE coordinated the study, critically discussed the data, and helped to draft the manuscript. All authors read and approved the final manuscript.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

22 in total

1. A new clustering of antibody CDR loop conformations.

Authors: Benjamin North; Andreas Lehmann; Roland L Dunbrack
Journal: J Mol Biol Date: 2010-10-28 Impact factor: 5.469

2. Bias and artifacts in multitemplate polymerase chain reactions (PCR).

Authors: Takahiro Kanagawa
Journal: J Biosci Bioeng Date: 2003 Impact factor: 2.894

Review 3. Overview of Next-Generation Sequencing Technologies.

Authors: Barton E Slatko; Andrew F Gardner; Frederick M Ausubel
Journal: Curr Protoc Mol Biol Date: 2018-04

4. Synthetic antibodies designed on natural sequence landscapes.

Authors: Wenwu Zhai; Jacob Glanville; Markus Fuhrmann; Li Mei; Irene Ni; Purnima D Sundar; Thomas Van Blarcom; Yasmina Abdiche; Kevin Lindquist; Ralf Strohner; Dilduz Telman; Guido Cappuccilli; William J J Finlay; Jan Van den Brulle; David R Cox; Jaume Pons; Arvind Rajpal
Journal: J Mol Biol Date: 2011-07-23 Impact factor: 5.469

5. High-throughput retrieval of physical DNA for NGS-identifiable clones in phage display library.

Authors: Jinsung Noh; Okju Kim; Yushin Jung; Haejun Han; Jung-Eun Kim; Soohyun Kim; Sanghyub Lee; Jaeseong Park; Rae Hyuck Jung; Sang Il Kim; Jaejun Park; Jerome Han; Hyunho Lee; Duck Kyun Yoo; Amos C Lee; Euijin Kwon; Taehoon Ryu; Junho Chung; Sunghoon Kwon
Journal: MAbs Date: 2019-02-12 Impact factor: 5.857

6. A fully synthetic human Fab antibody library based on fixed VH/VL framework pairings with favorable biophysical properties.

Authors: Thomas Tiller; Ingrid Schuster; Dorothée Deppe; Katja Siegers; Ralf Strohner; Tanja Herrmann; Marion Berenguer; Dominique Poujol; Jennifer Stehle; Yvonne Stark; Martin Heßling; Daniela Daubert; Karin Felderer; Stefan Kaden; Johanna Kölln; Markus Enzelberger; Stefanie Urlinger
Journal: MAbs Date: 2013-04-09 Impact factor: 5.857

7. HuCAL PLATINUM, a synthetic Fab library optimized for sequence diversity and superior performance in mammalian expression systems.

Authors: Josef Prassler; Stefanie Thiel; Catrin Pracht; Andrea Polzer; Solveig Peters; Marion Bauer; Stephanie Nörenberg; Yvonne Stark; Johanna Kölln; Andreas Popp; Stefanie Urlinger; Markus Enzelberger
Journal: J Mol Biol Date: 2011-08-12 Impact factor: 5.469

8. Accuracy of Next Generation Sequencing Platforms.

Authors: Edward J Fox; Kate S Reid-Bayliss; Mary J Emond; Lawrence A Loeb
Journal: Next Gener Seq Appl Date: 2014

9. Next-generation sequencing enables the discovery of more diverse positive clones from a phage-displayed antibody library.

Authors: Wonjun Yang; Aerin Yoon; Sanghoon Lee; Soohyun Kim; Jungwon Han; Junho Chung
Journal: Exp Mol Med Date: 2017-03-24 Impact factor: 8.718

Review 10. Next-Generation Sequencing of Antibody Display Repertoires.

Authors: Romain Rouet; Katherine J L Jackson; David B Langley; Daniel Christ
Journal: Front Immunol Date: 2018-02-02 Impact factor: 7.561