Literature DB >> 32853330

Decombinator V4: an improved AIRR compliant-software package for T-cell receptor sequence annotation?

Thomas Peacock^1,2, James M Heather³, Tahel Ronel^1,4, Benny Chain^1,2.

Abstract

MOTIVATION: Analysis of the T-cell receptor repertoire is rapidly entering the general toolbox used by researchers interested in cellular immunity. The annotation of T-cell receptors (TCRs) from raw sequence data poses specific challenges, which arise from the fact that TCRs are not germline encoded, and because of the stochastic nature of the generating process.
RESULTS: In this study, we report the release of Decombinator V4, a tool for the accurate and fast annotation of large sets of TCR sequences. Decombinator was one of the early Python software packages released to analyse the rapidly increasing flow of T-cell receptor repertoire sequence data. The Decombinator package now provides Python 3 compatibility, incorporates improved sequencing error and PCR bias correction algorithms, and provides output which conforms to the international standards proposed by the Adaptive Immune Receptor Repertoire Community.
AVAILABILITY AND IMPLEMENTATION: The entire Decombinator suite is freely available at: https://github.com/innate2adaptive/Decombinator. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Chemical

Mesh：

Substances：

Year: 2021 PMID： 32853330 PMCID： PMC8098023 DOI： 10.1093/bioinformatics/btaa758

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

The analysis of T-cell and B-cell antigen-receptor sequences is challenging because the final somatic DNA sequences from which the receptors are transcribed and then translated are produced after a series of cell intrinsic recombination events which irreversibly change the somatic genome. Decombinator was one of the early Python software packages released to process the rapidly increasing flow of new T-cell receptor repertoire sequence data, and to infer the precise set of recombination events which give rise to each sequence read (Thomas ). The strategy underlying Decombinator is to use a finite state machine to classify individual TCR sequences using a set of molecular tags uniquely matching individual V and J regions of the T-cell locus. It is based on the Aho-Corasick algorithm (Aho ) which remains the fastest way to execute exact multiple keyword searches. The algorithm uses a pre-constructed keyword trie to match a set of queries to targets. The speed scales linearly with the number of keywords and the length of the target string, thus substantially outperforming commonly used local alignment algorithms, especially when the number of keywords becomes large. Decombinator has been freely available on GitHub since its original publication, and has been frequently extended and modified. The name Decombinator originally referred specifically to the Python module which inferred the underlying recombination events giving rise to each read. However, the same name has now been extended to refer to a whole package of modules, of which the original Decombinator is one. The package now incorporates additional modules providing multiple sample demultiplexing based on sample-specific dual barcoding (Demultiplexor), unique molecular identifier (UMI)-based functions to correct for sequencing error and PCR bias (Collapsinator), and translation modules to provide CDR and full-length TCR amino acid sequences (CDR3Translator). Although the package has been used in several published studies, e.g. (Gkazi ; Heather ; Joshi ; Oakes ; Wong et al., 2018), none of the changes to the original Decombinator have been formally published. In this short report, we document some major changes to the Decombinator package, which we have released on GitHub as Decombinator V4 (https://github.com/innate2adaptive/Decombinator). Decombinator V4 has been ported to Python 3, since support for Python 2 ended in 2019. We have redesigned the error correction algorithms in Collapsinator, resulting in significantly improved robustness to PCR/sequencing error correction based on UMI clustering. This is described in more detail below. Importantly, we have completely reconfigured the output of the CDR3Translator module, so the final output is now fully compliant with the International Adaptive Immune Receptor Repertoire (AIRR) Community recommendations (https://docs.airr-community.org/en/latest/). This greatly improves the utility of Decombinator, as it ensures the output is compatible with the growing number of secondary repertoire analysis tools which are being released.

2. Results and discussion

Full details describing installation and use of the four Decombinator modules are available on GitHub (https://github.com/innate2adaptive/Decombinator), together with some test datasets. Instead, we focus on evaluation of the most significant development since our original publication, the UMI-based PCR bias and sequencing error correction function of the Collapsinator module, and we briefly outline the output fields produced by CDR3Translator. We first compared the output of Decombinator on a simulated ‘ground truth’ set of TCRs generated using IGoR (Marcou ) as described in Supplementary Methods. Decombinator V4 correctly annotated 92.7±0.5% (mean±standard deviation) of the sequences. CDR3 sequences were correctly identified in 98.1±0.5% of the sequences. The majority of miss-annotations occurred in relation to very similar sub-families of V region (e.g. TRBV6-1, TRBV6-2, TRBV6-3; or TRB12-1, TRBV12-2) between which the Decombinator tags do not distinguish. A pseudocode description of the Collapsinator module is shown in Figure 1, and is discussed in more detail in Supplementary Methods. We evaluated the error correction performance of Collapsinator by simulating PCR amplification and Illumina sequencing of the ‘ground truth’ TCR repertoires, and then measuring how well Collapsinator recovered the sequences and their abundances present before PCR (see Supplementary Methods). From 10,000 sequences present before PCR/sequencing, all but 2±1 (mean±standard deviation) were recovered by Collapsinator. A small number of sequences (109±15) were introduced by PCR or sequencing error and were erroneously retained by Collapsinator even after error correction. The correlation coefficient between the abundances of the original set of sequences, and the output of Collapsinator was 0.97±0.7. By comparison, Collapsinator V3 (the previous version) introduced 5855±335 new ‘error’ sequences, and the correlation coefficient was 0.86±0.006.

Fig. 1.

Pseudocode outlining the main functionality of the Collapsinator script. TCR data in the Decombinator format is read into the program and initially grouped by barcode. Each of these groups undergo pairwise comparison, whereby the barcode (bci) and the most frequent TCR sequence (TCRi) of group i is compared to the barcode (bcj) and the most frequent TCR sequence (TCRj) of group j. If barcodes bci and bcj are similar relative to the barcode threshold (th_bc), and sequences TCRi and TCRj are similar relative to the sequence threshold (th_tcr), then groups i and j are merged. The merged groups are here referred to as clusters. Similarity measures are taken as the Levenshtein distance for barcodes, and a percentage-based Levenshtein distance for TCR sequences (Levenshtein distance weighted by length of sequence). The two thresholds are user-configurable. Once every group has been clustered, the TCR identifying classifier (V gene, J gene, no. of V deletions, no. of J deletions, insert sequence) of each TCR in the biological sample is output to file, accompanied by the number of times that TCR was found in the sample (TCR count) and the mean cluster size (BC count) associated with that TCR The CDR3Translator module has also been substantially rewritten. The output is a tab separated file, in the AIRR Community format (Vander Heiden ), in which each row represents a unique DNA sequence defined by Decombinator. Since the same amino acid sequences can be coded for by different DNA sequences, the amino acid sequences encoded by each row are not necessarily unique. Mandatory AIRR columns include V and J genes, using IMGT nomenclature (but note that the current Decombinator version does not distinguish alleles), the CDR3 sequence (or more specifically the ‘junction’ rather than CDR3), which includes the conserved bracketing C and F residues (Lefranc, 2014), and the number of times the TCR was recorded in the initial dataset. Additional columns include CDR1 and CDR2 sequences, as defined by IMGT, and the complete DNA and protein sequence excluding the leader sequences. A False/True flag identifies all the non-productive sequences identified, which are included in the output file by default. The format permits additional non-required fields, which we use to output information such as the traditional five-part Decombinator classifier, facilitating comparisons with the output of previous versions. Finally, each TCR is associated with a mean cluster size (BC count, see Fig. 1), which can be used to estimate the robustness of the data for that particular sequence. In conclusion, we report the release of Decombinator V4 for the rapid and accurate annotation of TCR sequence data. Although designed to work optimally on data obtained by the experimental pipeline library preparation protocol we developed (Oakes Uddin ), Decombinator is broadly applicable to a variety of TCR sequencing protocols. Furthermore, compliance with AIRR Community output standards will ensure that the data produced by Decombinator can be readily used by the growing number of TCR analysis tools now available to the Immunological community. Click here for additional data file.

10 in total

1. Decombinator: a tool for fast, efficient gene assignment in T-cell receptor sequences using a finite state machine.

Authors: Niclas Thomas; James Heather; Wilfred Ndifon; John Shawe-Taylor; Benjamin Chain
Journal: Bioinformatics Date: 2013-01-09 Impact factor: 6.937

2. Quantitative analysis of the T cell receptor repertoire.

Authors: Imran Uddin; Annemarie Woolston; Thomas Peacock; Kroopa Joshi; Mazlina Ismail; Tahel Ronel; Connor Husovsky; Benny Chain
Journal: Methods Enzymol Date: 2019-06-20 Impact factor: 1.600

3. Spatial heterogeneity of the T cell receptor repertoire reflects the mutational landscape in lung cancer.

Authors: Kroopa Joshi; Marc Robert de Massy; Mazlina Ismail; James L Reading; Imran Uddin; Annemarie Woolston; Emine Hatipoglu; Theres Oakes; Rachel Rosenthal; Thomas Peacock; Tahel Ronel; Mahdad Noursadeghi; Virginia Turati; Andrew J S Furness; Andrew Georgiou; Yien Ning Sophia Wong; Assma Ben Aissa; Mariana Werner Sunderland; Mariam Jamal-Hanjani; Selvaraju Veeriah; Nicolai J Birkbak; Gareth A Wilson; Crispin T Hiley; Ehsan Ghorani; José Afonso Guerra-Assunção; Javier Herrero; Tariq Enver; Sine R Hadrup; Allan Hackshaw; Karl S Peggs; Nicholas McGranahan; Charles Swanton; Sergio A Quezada; Benny Chain
Journal: Nat Med Date: 2019-10-07 Impact factor: 53.440

4. Immunoglobulin and T Cell Receptor Genes: IMGT(®) and the Birth and Rise of Immunoinformatics.

Authors: Marie-Paule Lefranc
Journal: Front Immunol Date: 2014-02-05 Impact factor: 7.561

5. High-throughput immune repertoire analysis with IGoR.

Authors: Quentin Marcou; Thierry Mora; Aleksandra M Walczak
Journal: Nat Commun Date: 2018-02-08 Impact factor: 14.919

6. AIRR Community Standardized Representations for Annotated Immune Repertoires.

Authors: Jason Anthony Vander Heiden; Susanna Marquez; Nishanth Marthandan; Syed Ahmad Chan Bukhari; Christian E Busse; Brian Corrie; Uri Hershberg; Steven H Kleinstein; Frederick A Matsen Iv; Duncan K Ralph; Aaron M Rosenfeld; Chaim A Schramm; Scott Christley; Uri Laserson
Journal: Front Immunol Date: 2018-09-28 Impact factor: 7.561

7. Dynamic Perturbations of the T-Cell Receptor Repertoire in Chronic HIV Infection and following Antiretroviral Therapy.

Authors: James M Heather; Katharine Best; Theres Oakes; Eleanor R Gray; Jennifer K Roe; Niclas Thomas; Nir Friedman; Mahdad Noursadeghi; Benjamin Chain
Journal: Front Immunol Date: 2016-01-11 Impact factor: 7.561

8. Quantitative Characterization of the T Cell Receptor Repertoire of Naïve and Memory Subsets Using an Integrated Experimental and Computational Pipeline Which Is Robust, Economical, and Versatile.

Authors: Theres Oakes; James M Heather; Katharine Best; Rachel Byng-Maddick; Connor Husovsky; Mazlina Ismail; Kroopa Joshi; Gavin Maxwell; Mahdad Noursadeghi; Natalie Riddell; Tabea Ruehl; Carolin T Turner; Imran Uddin; Benny Chain
Journal: Front Immunol Date: 2017-10-12 Impact factor: 7.561

9. Clinical T Cell Receptor Repertoire Deep Sequencing and Analysis: An Application to Monitor Immune Reconstitution Following Cord Blood Transplantation.

Authors: Athina Soragia Gkazi; Ben K Margetts; Teresa Attenborough; Lana Mhaldien; Joseph F Standing; Theres Oakes; James M Heather; John Booth; Marlene Pasquet; Robert Chiesa; Paul Veys; Nigel Klein; Benny Chain; Robin Callard; Stuart P Adams
Journal: Front Immunol Date: 2018-11-05 Impact factor: 7.561

10. Urine-derived lymphocytes as a non-invasive measure of the bladder tumor immune microenvironment.

Authors: Yien Ning Sophia Wong; Kroopa Joshi; Pramit Khetrapal; Mazlina Ismail; James L Reading; Mariana Werner Sunderland; Andrew Georgiou; Andrew J S Furness; Assma Ben Aissa; Ehsan Ghorani; Theres Oakes; Imran Uddin; Wei Shen Tan; Andrew Feber; Ursula McGovern; Charles Swanton; Alex Freeman; Teresa Marafioti; Timothy P Briggs; John D Kelly; Thomas Powles; Karl S Peggs; Benjamin M Chain; Mark D Linch; Sergio A Quezada
Journal: J Exp Med Date: 2018-09-26 Impact factor: 14.307

10 in total

2 in total

1. Stitchr: stitching coding TCR nucleotide sequences from V/J/CDR3 information.

Authors: James M Heather; Matthew J Spindler; Marta Herrero Alonso; Yifang Ivana Shui; David G Millar; David S Johnson; Mark Cobbold; Aaron N Hata
Journal: Nucleic Acids Res Date: 2022-07-08 Impact factor: 19.160

2. Rapid synchronous type 1 IFN and virus-specific T cell responses characterize first wave non-severe SARS-CoV-2 infections.

Authors: Aneesh Chandran; Joshua Rosenheim; Gayathri Nageswaran; Leo Swadling; Gabriele Pollara; Rishi K Gupta; Alice R Burton; José Afonso Guerra-Assunção; Annemarie Woolston; Tahel Ronel; Corinna Pade; Joseph M Gibbons; Blanca Sanz-Magallon Duque De Estrada; Marc Robert de Massy; Matthew Whelan; Amanda Semper; Tim Brooks; Daniel M Altmann; Rosemary J Boyton; Áine McKnight; Gabriella Captur; Charlotte Manisty; Thomas Alexander Treibel; James C Moon; Gillian S Tomlinson; Mala K Maini; Benjamin M Chain; Mahdad Noursadeghi
Journal: Cell Rep Med Date: 2022-03-04

2 in total