Literature DB >> 35852318

CompAIRR: ultra-fast comparison of adaptive immune receptor repertoires by exact and approximate sequence matching.

Torbjørn Rognes1,2,3, Lonneke Scheffer1,3, Victor Greiff4, Geir Kjetil Sandve1,3.   

Abstract

MOTIVATION: Adaptive immune receptor (AIR) repertoires (AIRRs) record past immune encounters with exquisite specificity. Therefore, identifying identical or similar AIR sequences across individuals is a key step in AIRR analysis for revealing convergent immune response patterns that may be exploited for diagnostics and therapy. Existing methods for quantifying AIRR overlap scale poorly with increasing dataset numbers and sizes. To address this limitation, we developed CompAIRR, which enables ultra-fast computation of AIRR overlap, based on either exact or approximate sequence matching.
RESULTS: CompAIRR improves computational speed 1000-fold relative to the state of the art and uses only one-third of the memory: on the same machine, the exact pairwise AIRR overlap of 104 AIRRs with 105 sequences is found in ∼17 minutes, while the fastest alternative tool requires 10 days. CompAIRR has been integrated with the machine learning ecosystem immuneML to speed up commonly used AIRR-based machine learning applications. AVAILABILITY: CompAIRR code and documentation are available at https://github.com/uio-bmi/compairr. Docker images are available at https://hub.docker.com/r/torognes/compairr. The code to replicate the synthetic datasets, scripts for benchmarking and creating figures, and all raw data underlying the figures are available at https://github.com/uio-bmi/compairr-benchmarking. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
© The Author(s) 2022. Published by Oxford University Press.

Entities:  

Year:  2022        PMID: 35852318      PMCID: PMC9438946          DOI: 10.1093/bioinformatics/btac505

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.931


1 Introduction

Adaptive immune receptor (AIR) repertoires (AIRRs) record past immune encounters. High-throughput sequencing now enables millions of AIR sequences to be determined at a cost that facilitates adaptive immunity-based association studies on large patient cohorts (Emerson ; Liu ). It has been previously shown that shared immune states give rise to identical or similar AIR sequences across individuals, enabling the use of AIRR-seq for diagnostics and therapeutic research (Arnaout ; Greiff ). Computation of cross-individual AIRR intersections, i.e. the number of matching AIR sequences across AIRRs, is thus a foundational computational task performed in nearly all AIRR analyses. However, since the number of pairwise AIRR comparisons grows asymptotically quadratically with the number of AIRRs considered, where each pairwise AIRR comparison typically involves millions of individual AIRs, computational efficiency is crucial for performing AIR sequence matching at scale. We here present CompAIRR, a tool that allows to compute AIRR intersections up to 1000-fold faster than current implementations (Nazarov ; Shugay ; Weber ). In contrast to existing tools, CompAIRR supports both exact and approximate sequence matching between AIRs when determining AIRR overlap. The CompAIRR implementation is available both as a stand-alone command-line tool, and as a component integrated with the machine learning ecosystem immuneML (Pavlović ) (from immuneML version 2.1.0 onward) to accelerate the computation of AIRR similarity matrices, and to accelerate an AIRR-based immune state classifier (Emerson ) that is implemented in the immuneML system (Supplementary Fig. S1).

2 CompAIRR description

CompAIRR is based on a sequence comparison strategy developed for the nucleotide sequence clustering tool Swarm (Mahé ). A Bloom filter (Putze ) and a hash table are used to quickly look up similar AIR sequences across AIRR sets. For each AIR sequence (nucleotide or amino acid), a 64-bit hash value is generated using a Zobrist hash function (Zobrist, 1970), a form of tabulation hashing that can be computed very efficiently and updated incrementally. When approximate matching is enabled, the hashes of all possible variants of a query sequence (with 1–2 substitutions or indels) are also generated. This search strategy identifies all matching sequences without compromising on accuracy. CompAIRR version 1.7.0 or later also supports a larger number of substitutions by using a simpler all-versus-all algorithm. Matches are optionally restricted by V and J gene. Multi-threading may be enabled to further speed up comparisons (see Fig. 1d). For the comparison of n AIRRs, CompAIRR produces an n × n matrix where each cell contains the sum of matching AIR frequencies with flexible summary statistics (product, min, max, mean or ratio of the two compared AIR frequencies), or the Morisita-Horn or Jaccard index between AIRRs. Alternatively, CompAIRR can query n AIRRs against m reference AIRs and produce an n × m sequence presence table. While AIR matching is only supported at the single chain level, two n × m sequence presence tables for complementary (paired) AIR chains (single-cell data) can easily be merged. For the analysis of a single AIRR, CompAIRR can perform single-linkage clustering of AIRs. CompAIRR can optionally output the list of (approximately) matching AIRs as an AIRR-compliant TSV file, and adheres to the AIRR standard for software tools (Vander Heiden ).
Fig. 1.

Overview of CompAIRR features and performance. (a) CompAIRR has configurable AIR matching criteria and output formats. (b) CompAIRR calculates pairwise AIRR overlap up to 1000-fold faster than currently available tools. (c) The maximum RAM usage of CompAIRR is below one-third of the most memory-efficient alternative. (d) The CompAIRR running time increases when allowing more AIR sequence mismatches, but multithreading helps reduce this running time. (b–d) Data shown are mean with error bars showing min/max values across three replicate runs. For the largest dataset, only CompAIRR was run three times, and VDJtools failed to run due to memory limitations. Unless otherwise specified, datasets consist of 1000 AIRRs containing 105 OLGA-generated sequences (Sethna ) (default human IgH CDR3 model)

Overview of CompAIRR features and performance. (a) CompAIRR has configurable AIR matching criteria and output formats. (b) CompAIRR calculates pairwise AIRR overlap up to 1000-fold faster than currently available tools. (c) The maximum RAM usage of CompAIRR is below one-third of the most memory-efficient alternative. (d) The CompAIRR running time increases when allowing more AIR sequence mismatches, but multithreading helps reduce this running time. (b–d) Data shown are mean with error bars showing min/max values across three replicate runs. For the largest dataset, only CompAIRR was run three times, and VDJtools failed to run due to memory limitations. Unless otherwise specified, datasets consist of 1000 AIRRs containing 105 OLGA-generated sequences (Sethna ) (default human IgH CDR3 model)

3 CompAIRR performance benchmarking

CompAIRR (1.3.1) was benchmarked against VDJtools (1.2.1) (Shugay ), immunarch (0.6.5) (Nazarov ) and immuneREF (0.5.0) (Weber ) by calculating the pairwise AIRR overlap of datasets ranging from 10 to 104 AIRRs. Each AIRR consisted of 105 amino acid AIR sequences generated using OLGA (1.2.2) (Sethna ) with the default human IgH CDR3 model. Figure 1b and c, respectively, shows the running time and maximum RAM usage of each tool. CompAIRR is consistently faster, particularly for large datasets: with 104 AIRRs of 105 sequences, CompAIRR ran in 17 min while immunarch took 10 days, immuneREF took 23 days and VDJtools failed to complete due to memory constraints. The computational complexity appears to have been reduced from approximately quadratic to almost linear. Furthermore, the maximum RAM usage of CompAIRR is below one-third of that of competing tools. The running time and memory usage as a function of the AIRR size (104–106 sequences) is shown in Supplementary Figure S2. In addition, Figure 1d shows how the CompAIRR running time is affected by approximate sequence matching, which is not at all supported by the existing tools. The benefit of multi-threading becomes more apparent when the degree of sequence mismatching is increased, since with exact matching the running time is dominated by disk access (Supplementary Fig. S3).

4 Conclusion

The identification of shared AIRs across AIRRs from different individuals is a core computational task in AIRR analysis. We have here presented CompAIRR, which calculates AIRR overlap up to 1000-fold faster while its peak memory usage is below one third compared to currently available tools. We validated that CompAIRR easily scales to datasets of 104 AIRRs of 105 sequences each, which surpass the largest available experimental datasets (Liu ; Nolan ). Furthermore, a novel feature of CompAIRR is efficient identification of approximately matching AIR sequences across AIRRs or to reference databases, which may be a biologically meaningful way to increase the number of matches between AIRRs when the exact overlap is low (Supplementary Fig. S4). Complementary to sequence-level clustering tools ClusTCR (Valkiers ) and GIANA (Zhang ), or comparison of AIRR subsets (Yohannes ), CompAIRR can be used for ultrafast similarity-based comparison of complete AIRRs. Due to flexible specification of summary statistics and output, CompAIRR is easily integrated with any tool capable of reading in either (i) a pairwise distance matrix containing cross-AIRR matches, (ii) a matrix showing individual AIR presence in one or more AIRRs or (iii) an AIRR-compliant TSV file containing (approximately) matching AIRs between AIRRs. This allows accelerating a variety of analyses where AIRR comparison is a core computational component, including AIRR similarity (Weber ) and clustering (Rempała and Seweryn, 2013; Shugay ), phylogenetic clustering (Hoehn ), graph analysis (Madi ; Miho ; Pogorelyy ) and immune state classification (Emerson ). Click here for additional data file.
  16 in total

1.  Immunosequencing identifies signatures of cytomegalovirus exposure history and HLA-mediated effects on the T cell repertoire.

Authors:  Ryan O Emerson; William S DeWitt; Marissa Vignali; Jenna Gravley; Joyce K Hu; Edward J Osborne; Cindy Desmarais; Mark Klinger; Christopher S Carlson; John A Hansen; Mark Rieder; Harlan S Robins
Journal:  Nat Genet       Date:  2017-04-03       Impact factor: 38.330

2.  T cell receptor β repertoires as novel diagnostic markers for systemic lupus erythematosus and rheumatoid arthritis.

Authors:  Xiao Liu; Wei Zhang; Ming Zhao; Longfei Fu; Limin Liu; Jinghua Wu; Shuangyan Luo; Longlong Wang; Zijun Wang; Liya Lin; Yan Liu; Shiyu Wang; Yang Yang; Lihua Luo; Juqing Jiang; Xie Wang; Yixin Tan; Tao Li; Bochen Zhu; Yi Zhao; Xiaofei Gao; Ziyun Wan; Cancan Huang; Mingyan Fang; Qianwen Li; Huanhuan Peng; Xiangping Liao; Jinwei Chen; Fen Li; Guanghui Ling; Hongjun Zhao; Hui Luo; Zhongyuan Xiang; Jieyue Liao; Yu Liu; Heng Yin; Hai Long; Haijing Wu; Huanming Yang; Jian Wang; Qianjin Lu
Journal:  Ann Rheum Dis       Date:  2019-05-17       Impact factor: 19.103

3.  ClusTCR: a Python interface for rapid clustering of large sets of CDR3 sequences with unknown antigen specificity.

Authors:  Sebastiaan Valkiers; Max Van Houcke; Kris Laukens; Pieter Meysman
Journal:  Bioinformatics       Date:  2021-06-16       Impact factor: 6.937

4.  Methods for diversity and overlap analysis in T-cell receptor populations.

Authors:  Grzegorz A Rempala; Michal Seweryn
Journal:  J Math Biol       Date:  2012-09-25       Impact factor: 2.259

5.  T cell receptor repertoires of mice and humans are clustered in similarity networks around conserved public CDR3 sequences.

Authors:  Asaf Madi; Asaf Poran; Eric Shifrut; Shlomit Reich-Zeliger; Erez Greenstein; Irena Zaretsky; Tomer Arnon; Francois Van Laethem; Alfred Singer; Jinghua Lu; Peter D Sun; Irun R Cohen; Nir Friedman
Journal:  Elife       Date:  2017-07-21       Impact factor: 8.140

6.  AIRR Community Standardized Representations for Annotated Immune Repertoires.

Authors:  Jason Anthony Vander Heiden; Susanna Marquez; Nishanth Marthandan; Syed Ahmad Chan Bukhari; Christian E Busse; Brian Corrie; Uri Hershberg; Steven H Kleinstein; Frederick A Matsen Iv; Duncan K Ralph; Aaron M Rosenfeld; Chaim A Schramm; Scott Christley; Uri Laserson
Journal:  Front Immunol       Date:  2018-09-28       Impact factor: 7.561

7.  Detecting T cell receptors involved in immune responses from single repertoire snapshots.

Authors:  Mikhail V Pogorelyy; Anastasia A Minervina; Mikhail Shugay; Dmitriy M Chudakov; Yuri B Lebedev; Thierry Mora; Aleksandra M Walczak
Journal:  PLoS Biol       Date:  2019-06-13       Impact factor: 8.029

8.  OLGA: fast computation of generation probabilities of B- and T-cell receptor amino acid sequences and motifs.

Authors:  Zachary Sethna; Yuval Elhanati; Curtis G Callan; Aleksandra M Walczak; Thierry Mora
Journal:  Bioinformatics       Date:  2019-09-01       Impact factor: 6.937

9.  Phylogenetic analysis of migration, differentiation, and class switching in B cells.

Authors:  Kenneth B Hoehn; Oliver G Pybus; Steven H Kleinstein
Journal:  PLoS Comput Biol       Date:  2022-04-25       Impact factor: 4.779

10.  Swarm v3: towards tera-scale amplicon clustering.

Authors:  Frédéric Mahé; Lucas Czech; Alexandros Stamatakis; Christopher Quince; Colomban de Vargas; Micah Dunthorn; Torbjørn Rognes
Journal:  Bioinformatics       Date:  2021-07-09       Impact factor: 6.937

View more
  2 in total

1.  Reference-based comparison of adaptive immune receptor repertoires.

Authors:  Cédric R Weber; Teresa Rubio; Longlong Wang; Wei Zhang; Philippe A Robert; Rahmad Akbar; Igor Snapkov; Jinghua Wu; Marieke L Kuijjer; Sonia Tarazona; Ana Conesa; Geir K Sandve; Xiao Liu; Sai T Reddy; Victor Greiff
Journal:  Cell Rep Methods       Date:  2022-08-22

2.  AIRRscape: An interactive tool for exploring B-cell receptor repertoires and antibody responses.

Authors:  Eric Waltari; Saba Nafees; Krista M McCutcheon; Joan Wong; John E Pak
Journal:  PLoS Comput Biol       Date:  2022-09-20       Impact factor: 4.779

  2 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.