Literature DB >> 35484495

TCR-L: an analysis tool for evaluating the association between the T-cell receptor repertoire and clinical phenotypes.

Meiling Liu¹, Juna Goo², Yang Liu³, Wei Sun¹, Michael C Wu¹, Li Hsu¹, Qianchuan He⁴.

Abstract

BACKGROUND: T cell receptors (TCRs) play critical roles in adaptive immune responses, and recent advances in genome technology have made it possible to examine the T cell receptor (TCR) repertoire at the individual sequence level. The analysis of the TCR repertoire with respect to clinical phenotypes can yield novel insights into the etiology and progression of immune-mediated diseases. However, methods for association analysis of the TCR repertoire have not been well developed.
METHODS: We introduce an analysis tool, TCR-L, for evaluating the association between the TCR repertoire and disease outcomes. Our approach is developed under a mixed effect modeling, where the fixed effect represents features that can be explicitly extracted from TCR sequences while the random effect represents features that are hidden in TCR sequences and are difficult to be extracted. Statistical tests are developed to examine the two types of effects independently, and then the p values are combined.
RESULTS: Simulation studies demonstrate that (1) the proposed approach can control the type I error well; and (2) the power of the proposed approach is greater than approaches that consider fixed effect only or random effect only. The analysis of real data from a skin cutaneous melanoma study identifies an association between the TCR repertoire and the short/long-term survival of patients.
CONCLUSION: The TCR-L can accommodate features that can be extracted as well as features that are hidden in TCR sequences. TCR-L provides a powerful approach for identifying association between TCR repertoire and disease outcomes.

Entities: Chemical

Keywords: Association test; CDR3; Clinical phenotypes; T cell receptors; TCR homology; TCR repertoire

Mesh：

Substances：

Year: 2022 PMID： 35484495 PMCID： PMC9052542 DOI： 10.1186/s12859-022-04690-2

Source DB: PubMed Journal: BMC Bioinformatics ISSN： 1471-2105 Impact factor: 3.307

Introduction

With rapid progress in sequencing technologies, TCR repertoire profiling is emerging as one of the most powerful tools for studying the immune system and its functions [1]. TCRs represent a group of highly polymorphic receptors expressed on the surface of T cells and plays a key role in recognizing antigens presented by the major histocompatibility complex [2]. The high polymorphism of TCRs enables the recognition of a virtually infinite number of antigens and hence is critical to the flexibility of the adaptive immune system. The sequencing of TCRs allows dissection of the TCR repertoire at single-nucleotide resolution and provides unprecedented opportunities to study immune-mediated diseases [3]. For example, the profiling of the TCR repertoire has yielded novel insights into tumor biology and has the potential to answer the current most pressing questions in cancer immunotherapy [4]. TCRs consist of two chains (typically the alpha and beta chains), and the highly diverse repertoire of TCR is the result of the somatic V(D)J recombination mechanism, where V(D)J represents the variable (V), diversity (D), and joining (J) genes. The beta chain is more diverse than the alpha chain due to the involvement of the D gene. The antigen-specificity of TCRs is mainly determined by the complementarity-determining region 3 (CDR3), and the analysis of TCRs is primarily focused on the beta chain’s CDR3 sequences. A snapshot of the TCR data is shown in Table 1. A number of bioinformatic tools have been developed in recent years for processing and analyzing the TCR data, such as the miTCR [5], GLIPH [6], TCR-dist [7], ImmunoMap [8], TCRMatch [9], TRUST4 [10], and AutoCAT [11]. These tools are tremendously important to the analysis of the TCR data and can handle a wide range of tasks, such as the retrieval of TCR sequences from raw data, calculation of V(D)J gene usage within each repertoire, and TCR clustering and epitope prediction. On the other hand, few methods have been developed for conducting the regression analysis for the TCR data, i.e., linking the TCR repertoire with clinical outcomes to interrogate the potential genetic associations. Currently, the most common approach for analyzing the TCR is perhaps to calculate a diversity score for the TCR repertoire, such as the Shannon entropy, and then use this diversity score to conduct the association test with clinical outcomes [12, 13]. However, the diversity score captures only the frequencies of TCR sequences, while a large amount of information, such as the composition of the amino acids and the homology of TCR sequences, is simply ignored. This inefficient use of information can potentially lead to the loss of statistical power and the inability to identify the associations between the TCR repertoire and disease outcomes.

Table 1

Data structure of the TCR chain’s CDR3 region

Subj . ID	Nucleotide sequence	Amino acid seq.	Abundance	V segment	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\cdots$$\end{document}⋯
1	TGTGCCAGCAGCTTAGGTCGGGGCAAAGCTTTCTTT	CASSLGRGKAFF	1	TRBV7-9, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\dots$$\end{document}⋯, TRBV11-3	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\cdots$$\end{document}⋯
1	TGTGCCAGCAGTTGGTTAATTGGCTACACCTTC	CASSWLIGYTF	1	TRBV6-4, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\dots$$\end{document}⋯, TRBV6-1	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\cdots$$\end{document}⋯
1	TGTGCCAGCAGCTTAGGACGGGCTGAAGCTTTCTTT	CASSLGRAEAFF	9	TRBV7-9, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\dots$$\end{document}⋯, TRBV11-3	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\cdots$$\end{document}⋯
1	TGTGCCAGCAGCTTGGGTCGATCACCCCTCCACTTT	CASSLGRSPLHF	4	TRBV11-2, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\dots ,$$\end{document}⋯, TRBV11-3	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\cdots$$\end{document}⋯
1	TGTGCCAGCAGCCACGGACGAGCTGAAGCTTTCTTT	CASSHGRAEAFF	2	TRBV4-2	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\cdots$$\end{document}⋯
2	TGTGCCAGCAGGGACAGGCAAGAGACCCAGTACTTC	CASRDRQETQYF	1	TRBV6-4, ..., TRBV6-1	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\cdots$$\end{document}⋯
2	TGTACCTGGAAGGTATTTTTT	CTWKVFF	1	TRBV7-2	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\cdots$$\end{document}⋯
3	TGCAGTGCTAGAGAGCGAGGCGAGCAGTACTTC	CSARERGEQYF	3	TRBV20-1	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\cdots$$\end{document}⋯
4	TGTGCTGTGAGTCAAAACGGTGCCAGACTCATGTTT	CAVSQNGARLMF	1	TRAV8-4, TRAV8-6	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\cdots$$\end{document}⋯
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\vdots$$\end{document}⋮	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\vdots$$\end{document}⋮	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\vdots$$\end{document}⋮	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\vdots$$\end{document}⋮	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\vdots$$\end{document}⋮	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\cdots$$\end{document}⋯

Data structure of the TCR chain’s CDR3 region Association analysis of the TCR repertoire is challenging for several reasons. First, there is a large number of TCR sequences in the data and many of the sequences are of low-to-moderate abundance. Hence the data are inherently high-dimensional and sparse. Second, different individuals usually carry a different number of TCR sequences and many of the sequences differ from each other. That is, there are virtually no common features among different subjects. As such, there is no structured data matrix for analysis (where n is the sample size and p is the dimension), and traditional statistical methods often do not apply. Third, akin to natural language context, TCR sequences contain many potential features that are difficult to extract, and how to effectively utilize the embedded information is not clear. In addition to the TCR sequences, other biological information, such as biochemical properties of the amino acids, may also facilitate the association analysis with disease outcomes. It is desirable to incorporate external biological information to enhance the power of the TCR repertoire analysis. In this article, we develop a new analysis tool, TCR-L, for evaluating the association between the TCR repertoire and clinical phenotypes. Our approach is developed under a mixed effect modeling framework, where the fixed effect represents features that can be explicitly extracted from TCR sequences (such as the amino acid composition) while the random effect represents features that are hidden in TCR sequences and are difficult to extract. One advantage of such modeling is that prior biological information can be incorporated into the fixed effect to facilitate the analysis. Another advantage is that the sequence information of TCR repertoire can be utilized. To harness the sequence information, we conduct amino acid sequence alignment for each pair of TCR repertoires and then develop a metric, TCR homology (TCRhom), to characterize the variance matrix of the random effect. Then we adapt the MiST framework [14] to test the fixed effect and the random effect independently. Finally, the p values from the two types of effects are combined to yield an overall p value for assessing the association between the TCR repertoire and clinical phenotypes. Our approach is motivated by the cancer genome atlas (TCGA)’s study on the immune landscape of cancer [15], where the TCR CDR3 sequences were recovered from RNA-sequencing data of tumor tissues. We give a detailed description of the proposed approach in the next section. In the Simulation section, we conduct simulations to evaluate the performance of the TCR-L method and compare it to possible alternative approaches. In the Real data analysis section, we apply our proposed approach to a skin cancer dataset from TCGA and show that the TCR-L is able to identify an association between the TCR repertoire and patients’ survival.

Methods

Feature extraction

Assume that the total number of subjects is n. For , let be the ith subject’s TCR repertoire, and be the number of unique amino acid sequences in . For let denote the kth unique amino acid sequence, and denote the abundance of The total number of amino acid sequences for is thus . To extract features from the TCR data, it is natural to consider the proportions of amino acids, as these proportions represent basic composition of the TCR repertoire and are potentially related to protein structures. Let p denote the total number of amino acids that are presented in the entire TCR dataset. In general, since there are 20 amino acids and they all tend to show up in the dataset. Given , let be the jth amino acid’s frequency within the ith subject’s TCR repertoire, for . Then, represents the proportion of the jth amino acid for . Subsequently, for we define the following feature vector for the ith subject:In other words, is a feature vector where each element represents an amino acid’s proportion. Example of when and Figure 1 gives an demonstration on the calculation of . Here, the subject i contains three unique sequences, The amino acid A (Alanine) occurs once in sequence , thrice in , and twice in . Accounting for the abundance for , one can calculate and accordingly. Following the depiction in Fig. 1, we can extract p features for n subjects and further construct a feature matrix where each row represents the distribution of p amino acids’ proportions corresponding to the subject’s TCR repertoire. Besides the amino acid composition, other potential features, such as the V and J gene usage, can be extracted in a similar manner.

Fig. 1

Example of when and

Amino acid sequence homology

Now suppose that we have one amino acid sequence, , from the ith subject’s TCR repertoire, and another sequence, , from the jth subject’s TCR repertoire. We wish to align the two sequences and measure the homology between them. The simplest way of aligning sequences and computing the sequence homology is based on the identity matrix which assigns 1 when two amino acids are identical at the aligned position and 0 otherwise. However, this approach does not reflect related, but not identical, amino acids that can be possibly aligned. Alternatively, an amino acid substitution matrix, such as the widely used BLOSUM62 or PAM250 [16], can be used to align a pair of amino acid sequences. The amino acid substitution matrix scores matches and/or mismatches between aligned amino acids more dynamically than the identity matrix. We perform pairwise alignment via the Needleman-Wunsch algorithm. Default gap penalties were chosen for the alignment algorithm. Let be the sum of values on the substitution matrix minus the affine gap penalty for the aligned and sequences. Alternatively, the can be calculated by other approaches, such as the TCR-dist [7]. Once is obtained, the homology between the aligned and sequences can be computed asThe maximum value of is 1 when and are identical sequences. can fall below zero when , when the sum of negative substitution values and gap penalties dominate the sum of positive substitution values. Here, a positive value of the substitution matrix means that the aligned amino acid pair is observed more than expected by chance, whereas a negative one means the opposite [16].

TCR repertoire homology

Next, we propose a metric, TCR repertoire homology (TCRhom), to measure the homology between two subjects’ TCR repertoires based on . The algorithm of the TCRhom is illustrated in Algorithm 1. Starting with the first amino acid sequence of the ith subject’s TCR repertoire, i.e., , pairwise alignments are performed with sequences of the jth subject’s TCR repertoire. We compute sequence homology for given the substitution matrix and gap penalties. The maximal value of these is identified. Then we multiply the maximal sequence homology by the abundance that corresponds to . Next, we shift our focus to the second amino acid sequence of the ith subject’s TCR repertoire, i.e., . Similarly, pairwise alignments are conducted between and , sequences of the jth subject’s TCR repertoire. Among the alignments, the maximal sequence homology is identified and then is multiplied by the abundance accordingly. We repeat this process until every sequence of the ith subject’s repertoire has found its maximal homology in the jth subject’s repertoire. Once this is completed, we now switch to the jth subject’s repertoire . Given a sequence in , we find ’s maximal homology in and multiply the maximal homology by the abundance . We repeat this procedure until every sequence of has found its maximal homology in . A detailed example of the TCRhom calculation is shown in the Additional file 1: Section 1. Regardless of which amino acid substitution matrix is used for calculating the homology, the TCRhom preserves two useful properties. First, the TCRhom is symmetric, i.e., for Second, when (i.e., for the same person), we have . Thus, we have for , indicating a perfect match between and . To ensure that the TCRhom matrix S is positive semi-definite, we use a low-rank approximation via the eigen-decomposition of S: where are the non-negative eigenvalues of S, and are the corresponding eigenvectors. In what follows, S denotes the symmetric and positive semi-definite TCRhom matrix for n subjects.

Test the association between the TCR repertoire and clinical phenotypes

We adapt the Mixed effects Score Test (MiST) framework [14] to test whether the TCR repertoire is associated with a clinical phenotype. We first introduce some notations. For the ith subject, , let be a vector that consists of an intercept and q confounding variables, e.g., patient age, sex, etc. Let be a link function. For continuous traits, is the identity function, and for binary traits, . We consider a random vector which is assumed to follow , where is a scale parameter and S can be estimated by the TCRhom matrix. Recall that denotes the ith subject’s TCR repertoire, and represents the extracted features of the repertoire, such as the proportions of the 20 amino acids. Assume a generalized linear mixed model for a clinical phenotype with its meanwhere is a vector of regression coefficients, and is a vector of regression coefficients. Here, is a matrix (with ) and is introduced to accommodate r known biological characteristics for . For example, hydrophobicity is an important biochemical property of amino acids, and is critical for determining proteins’ structures. To accommodate this property in the proposed model, W can be specified as a vector whose elements represent the hydrophobicity scores of the 20 amino acids. We are interested in testing the null hypothesis and that is, neither nor has any influence on the clinical phenotype. Under , the score for can be derived aswhere for a continuous trait and for a binary trait, and denotes the maximum likelihood estimate (MLE) of under Let be a diagonal matrix where the ith diagonal element is for the continuous trait and for the binary trait. Let be a matrix and . The asymptotic covariance of can be estimated bywhere is the (i, j) element of . Then asymptotically follows a chi-square distribution with r degrees of freedom. One may also obtain the score for under , however, because and are correlated, it is difficult to derive the covariance between the two scores. In line with the MiST approach [14], we consider the null hypothesis without constraining . Let denote MLE of under . Let , and let denote a matrix, where . For continuous trait, the score statistic for under can be derived aswhere and for . Let . Then it can be shown that . An estimate of is . Furthermore, we can derive thatThen we define the testing statistic to beIt can be shown that asymptotically follows a standard normal distribution under . For binary trait, it is known that the maximization of the log-likelihood is equivalent to iteratively reweighted least squares [17], thus we derive a similar statistic for logistic regression under such a linearization framework. Let be a diagonal matrix where the ith diagonal element is , where . Let denote the working response for . Let and , then the score statistic for under can be derived aswhere . Let . Following the argument in the linear regression, we have . An estimate for is . Note thatLet . It can be shown that asymptotically follows a standard normal distribution under . One can show that is independent of , and the proof of the independence is provided in the Additional file 1: Section 2. Thus, the p value based on and the p value based on are independent. Then one can use Fisher’s test statistic to combine the two p values. If the combined p value is less than the significance level of , then we conclude that the TCR repertoire is associated with the clinical phenotype being studied. We name the proposed approach the TCR-L. Since the S matrix can be constructed based on BLOSUM62 or PAM250, this yields two variations for the TCR-L approach, TCRL-B62 and TCRL-P250.

Simulation

In simulations, we generated TCR sequences of lengths from 10 to 18. We simulated the first four amino acids (i.e., the head segment) of a TCR sequence to be CASS, CASR, CSAR, or CAST, with the probability of 0.7, 0.1, 0.1, or 0.1. We simulated the last two amino acids (i.e., the tail segment) of a TCR sequence to be YF, FF, HF, or TF, with the probability 0.4, 0.4, 0.1, or 0.1. These patterns and probabilities were chosen based on the real data analyzed in the next section. The middle segment was generated by randomly sampling 4–10 amino acids from the 20 amino acids with replacement. Finally, the three segments were concatenated to yield one TCR sequence. The diversity of the TCR repertoire is partly due to the P/N nucleotides addition mechanism [18]. Following this mechanism, we randomly introduced 0, 1, or 2 additional amino acids into the middle segment to increase the diversity. An example of the simulation process is given in Fig. 2. To generate the TCR repertoire for the ith subject, we randomly drew 2–25 unique TCR sequences for a given subject, and the abundance of each unique sequence was randomly chosen to be 1–5.

Fig. 2

Simulation scheme for generating TCR CDR3 sequences. The yellow boxes represent added amino acids. The blue boxes represent the head segment while the green boxes represent the tail segment of the CDR3 sequence Besides the proposed TCR-L approach, we considered two alternative analysis strategies as follows: (I) only the fixed effect (i.e., the effect of the extracted features) was considered; (II) only the random effect (i.e., the effect of the ) was considered. For (I), we fit the following modeland then used a score statistic that is similar to to test the nullity of ; we call this approach the Ext. features (i.e., extracted features). For (II), we fit the following modeland then used a statistic similar to and to test the effect of for the continuous and the binary traits, accordingly. Depending on whether the S matrix was derived from BLOSUM62 or PAM250, this random-effect-based approach was named as the Seq.-B62 or Seq.-P250. We simulated two confounding variables: followed and followed N(0, 1) for . We considered the Kyte hydrophobicity of the 20 amino acids for W. Then represented the weighted sum of the 20 amino acids’ hydrophobic scores for the ith subject. The random effect was generated from , where the variance-covariance matrix S was specified based on either the BLOSUM62 or the PAM250. We considered . A total of 10,000 and 1,000 replicates were generated for examining type I error and power, respectively. In all simulation studies, the type I error and power of the test were evaluated at the significance level of .

Type I error

To evaluate the type I error for each of the methods, we generated a binary trait bywhere we set , , and . The random noise was generated from N(0, 1). Table 2 shows that the proposed TCRL-P250 and TCRL-B62 were able to control type I error at the nominal level.

Table 2

Empirical type I errors () of the TCRL (TCRL-B62 and TCRL-P250) and alternative methods for binary or continuous outcomes

Approach	Binary trait			Continuous trait
Approach	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n=350$$\end{document}n=350	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n=400$$\end{document}n=400	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n=450$$\end{document}n=450	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n=350$$\end{document}n=350	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n=400$$\end{document}n=400	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n=450$$\end{document}n=450
Ext. features	1.118	1.050	1.008	1.022	1.042	0.956
Seq.-B62	0.980	0.988	0.958	0.938	0.922	0.916
Seq.-P250	0.934	0.944	0.964	0.908	0.890	0.920
TCRL-B62	1.056	1.030	1.050	1.040	1.052	0.980
TCRL-P250	1.014	1.034	1.066	1.028	1.008	0.932

Empirical type I errors () of the TCRL (TCRL-B62 and TCRL-P250) and alternative methods for binary or continuous outcomes

Power

In all power studies, we fixed , , and . For the binary trait, we simulated the data such that the sample size per group was no less than 10% of n to mitigate the class imbalance problem. In Table 3, we considered the situation that the model contained only fixed effects, that is, we simulated the data under the model (2). For the binary trait, we set , and for the continuous trait, we set . Under this situation, because there was no contribution of the random effect for generating the trait, both the Seq.-P250 and the Seq.-B62 had low power as expected. The power for the TCR-L models was reasonably high, but lower than that of the Ext. features.

Table 3

Power comparison of the TCRL and alternative methods for binary or continuous outcomes when only the fixed effect was considered

Approach	Binary trait			Continuous trait
Approach	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n=350$$\end{document}n=350	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n=400$$\end{document}n=400	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n=450$$\end{document}n=450	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n=350$$\end{document}n=350	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n=400$$\end{document}n=400	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n=450$$\end{document}n=450
Ext.features	0.906	0.920	0.938	0.937	0.950	0.980
Seq.-B62	0.120	0.099	0.114	0.138	0.125	0.159
Seq.-P250	0.112	0.104	0.106	0.126	0.123	0.154
TCRL-B62	0.811	0.842	0.895	0.875	0.911	0.953
TCRL-P250	0.810	0.845	0.892	0.881	0.911	0.951

Power comparison of the TCRL and alternative methods for binary or continuous outcomes when only the fixed effect was considered In Table 4, the binary and continuous traits were simulated under the model (3), where the random effect was generated based on BLOSUM62. We set for the binary trait, and for the continuous trait. As expected, the power for the Seq.-B62 was the highest at each sample size. The power for the Ext. features was the lowest since the fixed effect was not used to generate the trait in this situation. Moreover, we found that the TCRL-P250 maintained good power despite that the true random effect was generated based on the BLOSUM62. This indicates that the proposed approaches are fairly robust to the choice of the amino acid substitution matrix for S. Similar results were observed in Table 5 where the random effect was generated based on the PAM250.

Table 4

Power comparison of the TCRL and alternative methods for binary or continuous outcomes when only the random effect was considered (S was based on BLOSUM62)

Approach	Binary trait			Continuous trait
Approach	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n=350$$\end{document}n=350	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n=400$$\end{document}n=400	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n=450$$\end{document}n=450	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n=350$$\end{document}n=350	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n=400$$\end{document}n=400	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n=450$$\end{document}n=450
Ext.features	0.094	0.115	0.109	0.092	0.131	0.118
Seq.-B62	0.858	0.919	0.960	0.821	0.891	0.932
Seq.-P250	0.792	0.857	0.917	0.761	0.819	0.884
TCRL-B62	0.822	0.887	0.925	0.780	0.864	0.899
TCRL-P250	0.746	0.818	0.878	0.722	0.783	0.856

Table 5

Power comparison of the TCRL and alternative methods for binary or continuous outcomes when only the random effect was considered (S was based on PAM250)

Approach	Binary trait			Continuous trait
Approach	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n=350$$\end{document}n=350	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n=400$$\end{document}n=400	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n=450$$\end{document}n=450	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n=350$$\end{document}n=350	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n=400$$\end{document}n=400	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n=450$$\end{document}n=450
Ext.features	0.123	0.119	0.127	0.099	0.121	0.107
Seq.-B62	0.804	0.881	0.919	0.778	0.861	0.908
Seq.-P250	0.866	0.929	0.962	0.866	0.917	0.949
TCRL-B62	0.765	0.851	0.886	0.744	0.824	0.860
TCRL-P250	0.824	0.907	0.936	0.816	0.890	0.926

Power comparison of the TCRL and alternative methods for binary or continuous outcomes when only the random effect was considered (S was based on BLOSUM62) Power comparison of the TCRL and alternative methods for binary or continuous outcomes when only the random effect was considered (S was based on PAM250) Next, we simulated the traits from the model (1), where both the fixed and random effects contributed to the disease outcome. For the binary trait, we set and , and for the continuous trait, we set and . In Table 6, the random effect was generated based on the BLOSUM62, and in Table 7, the random effect was based on the PAM250. Under this setting, the overall effect was driven by both the fixed effect and the random effect. It can be seen that the power for the TCR-L approaches was among the highest at each sample size in Tables 6 and 7. Similar to the observation in Tables 4 and 5, the TCR-L approaches were quite robust to the choice of the amino acid substitution matrix for S. In contrast, the Ext. features, Seq.-B62, and Seq.-P250 had reduced power because they accounted for only one of the two considered effects.

Table 6

Power comparison of the TCRL and alternative methods for binary or continuous outcomes when both fixed and random effects were considered (S was based on BLOSUM62)

Approach	Binary trait			Continuous trait
Approach	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n=350$$\end{document}n=350	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n=400$$\end{document}n=400	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n=450$$\end{document}n=450	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n=350$$\end{document}n=350	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n=400$$\end{document}n=400	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n=450$$\end{document}n=450
Ext.features	0.281	0.285	0.324	0.465	0.481	0.528
Seq.-B62	0.809	0.883	0.933	0.751	0.823	0.878
Seq.-P250	0.735	0.801	0.882	0.697	0.758	0.818
TCRL-B62	0.820	0.861	0.922	0.807	0.870	0.916
TCRL-P250	0.743	0.812	0.888	0.777	0.836	0.874

Table 7

Power comparison of the TCRL and alternative methods for binary or continuous outcomes when both fixed and random effects were considered (S was based on PAM250)

Approach	Binary trait			Continuous trait
Approach	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n=350$$\end{document}n=350	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n=400$$\end{document}n=400	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n=450$$\end{document}n=450	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n=350$$\end{document}n=350	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n=400$$\end{document}n=400	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n=450$$\end{document}n=450
Ext.features	0.301	0.326	0.321	0.474	0.524	0.541
Seq.-B62	0.761	0.833	0.892	0.711	0.781	0.847
Seq.-P250	0.838	0.906	0.946	0.780	0.845	0.914
TCRL-B62	0.772	0.844	0.892	0.798	0.859	0.895
TCRL-P250	0.832	0.902	0.939	0.843	0.899	0.942

Power comparison of the TCRL and alternative methods for binary or continuous outcomes when both fixed and random effects were considered (S was based on BLOSUM62) Power comparison of the TCRL and alternative methods for binary or continuous outcomes when both fixed and random effects were considered (S was based on PAM250) More simulation studies were provided in the Supplementary Material. In Additional file 1: Section 3, we considered two characteristics and for the extracted features. In Additional file 1: Section 4, we conducted simulation studies based on a real dataset of skin cancer. The results of those experiments showed a similar pattern as that in this section.

Real data analysis

TCR -chain’s CDR3 sequences of the skin cutaneous melanoma (SKCM) patients were obtained from TCGA [15]. Patients who had only one unique sequence in their TCR data were excluded. Schadendorf et al. [19] observed that the survival curve of melanoma patients reached a plateau around 3 years of follow-up regardless of prior therapy, ipilimumab dose, or treatment regimen in their clinical trials. Thus, we analyzed a binary trait based on whether the ith subject survived at least 3 years () or not (). Those who were censored before 3 years of follow-up were excluded from our analysis. We adjusted for age, gender, tumor purity and ploidy of cancer cells. After removing subjects with missing values, 248 melanoma patients were kept in our analysis, with 67 patients having short-term survival and 181 patients having long-term survival. The total number of unique amino acid sequences across the 248 patients was 6660. The number of unique sequences per patient ranged from 2 to 182, with the median number of 14.5. The length of amino acid sequences ranged from 6 to 15, with the median length of 13. Prior to fitting our models, we performed the age-adjusted logistic regression to assess the association of the Shannon entropy with the survival status. The Shannon entropy for the ith subject was computed aswhere and was the abundance corresponding to the kth unique amino acid sequence. A higher Shannon entropy score translates to a greater diversity for an individual’s TCR repertoire. The results of our logistic regression model show that the odds ratio for age (year) was 1.03 ( confidence interval is 1.01–1.05; p value: 0.002), suggesting that elder patients were more likely to have short-term survival. As to the Shannon entropy, the association between the Shannon entropy and the survival status after controlling for age has a p value of 0.074 (odds ratio: 0.78; confidence interval: 0.60–1.02). This suggests that a higher diversity of the TCR repertoire may be potentially beneficial to longer survival, but the p value does not reach the significance level of 0.05. Next, we applied the proposed approach along with the compared approaches to assess the association between the TCR repertoire and the survival status. We used the proportions of the 20 amino acids as the extracted feature vector . To incorporate biochemical properties of amino acids into the analysis, we used the Kyte hydrophobicity score as the W. The hydrophobicity score is positive for hydrophobic amino acids and is negative for hydrophilic amino acids. The product of W and , , can be seen as the hydrophobicity score weighted by the proportions of amino acids for ith subject’s repertoire. In Table 8, the p values for the five considered approaches are listed. It can be seen that the approach based on the extracted features yielded a p value of 0.097, indicating that the fixed effect (i.e., the hydrophobicity score) did not have significant association with the survival status. The p values for the TCRL-B62 and the TCRL-P250 are 2.75E−03 and 3.52E−03, respectively, both of which are significant at the nominal level of 0.05. The p values for the Seq.-B62 and the Seq.-P250 are 2.77E−03 and 3.90E−03, respectively, indicating that the random effect (i.e., the hidden features of the TCR repertoire) made the major contribution to the observed association signal. Overall, our analysis established an association between the TCR repertoire and the survival status. Our findings also suggest that besides features that can be extracted, hidden features should also be considered as they can be important factors for evaluating the association between TCR repertoire and disease outcomes.

Table 8

p Values for assessing the association between melanoma patients’ TCR repertoires and their survival statuses

Ext. features	Seq.-B62	Seq.-P250	TCRL-B62	TCRL-P250
9.71E−02	2.77E−03	3.90E−03	2.75E−03	3.52E−03

p Values for assessing the association between melanoma patients’ TCR repertoires and their survival statuses

Discussion

We have proposed an analysis tool for testing the association between the TCR repertoire and clinical phenotypes. Our approach accounts for both the extracted features as well as the features that are difficult to be extracted from the sequences of the TCR repertoire and is able to incorporate prior biological knowledge into the analysis. In our analysis, we have considered the amino acid compositions as the extracted features. Other types of features, such as the amino acid k-mers, may be potentially considered as well. The idea of getting k-mers is to identify a string of k consecutive amino acids, and a similar strategy has been used in text-mining for extracting feature vectors. A major challenge associated with this strategy is that the number of k-mers may be relatively large, for example, 400 for the set of 2-mer amino acids. To handle high dimensions, one simple way is to focus on the relatively frequent k-mers, though more sophisticated approaches should be further studied. Besides the k-mers, new methods have been recently developed to infer cognate targets or antigen specificities of TCRs [20, 21], and these targets/antigens provide potential features for data analysis. With rapid growth of biological knowledge and databases, the prediction of TCR targets is likely to expand dramatically in the coming years. Future development is needed to utilize these predicted targets for feature extraction. As to the sequence homology, the TCRhom was proposed to harness the sequence information embedded in the TCR repertoire, and this approach shares a spirit with related methods in quantitative genetics where the genotype-trait association is modeled through a set of random effects [22, 23]. In quantitative genetics, the variance-covariance matrix of the random effect, S, is specified by either the marker-based genetic relationship or the kinship among the studied subjects [24, 25]. In our case, the genetic relationship is characterized by the TCR repertoire, whose composition and abundance vary across different subjects. We have used the pairwise alignment for aligning amino acid sequences when calculating the TCRhom matrix. Alternatively, multiple sequence alignment (MSA) approaches, such as MAFFT [26], MUSCLE [27], and Clustal Omega [28], can be potentially applied. The MSA methods are usually used to infer conservative regions or evolutionary relationships between the sequences, and generally require higher memory than the pairwise alignment methods. Which methods to use is likely to depend upon specific tasks at hand and the computation environment available to investigators. As to the limitation of the proposed approach, since the random effect for the TCR repertoire has no explicit features, we can not estimate the coefficients of the features. As such, the size of the association (i.e., regression coefficients) can not be evaluated by using the current model. To obtain the regression coefficients for the association, one would need to extract features from the and evaluate these features under the alternative hypothesis, but how to efficiently extract features from the TCR repertoire remains to be investigated. Another limitation of the proposed approach is with regard to the prediction of the outcome. We note that TCRL is designed to test the genetic associations for TCR repertoire and is not yet able to be used for predicting the outcome. To do prediction, one usually needs to have a set of features as well as the estimated sizes for these features. The TCRL does not have these needed components and we hope that future research on feature extraction would help to address this important issue. There remain challenges to analyze the TCR repertoire. For example, how to combine the and chains of the TCR to conduct a more comprehensive analysis is largely unknown. Moreover, many neoantigens in cancers are caused by somatic mutations, and how to harness such information in the TCR analysis is still unclear. One potential strategy is to infer the neoantigens using the somatic mutations of the patients, and then investigate if there are specific TCRs that target these antigens. Thus, somatic mutations and TCR repertoire are inherently related to each other, and the joint analysis of them can potentially provide important information for immunotherapy. Finally, while our simulation studies and real data analysis are focused on the TCR analysis, the proposed approach can be also applied to the analysis of B cell receptor repertoire, which has been shown to play important roles in immune-mediated diseases [29]. Overall, our approach provides a new tool to examine the relationship between the immune repertoire and disease outcomes.

Conclusions

To conclude, we have developed an approach, TCR-L, that is able to utilize both extracted features and hidden features to test the association between the TCR repertoire and clinical outcomes. Through simulations, we showed that our proposed approach controls the type I error well and is more powerful than possible alternative approaches. In real data analysis, our approach successfully identified an association between the TCR repertoire and the survival of melanoma patients, demonstrating the power of the proposed approach. Overall, the TCR-L provides a timely tool for examining the association between the TCR repertoire and clinical outcomes, which has emerged as an important topic in clinical investigations. Additional file 1: Supplementary material, including a detailed example of the TCRhom calculation, the proof of independence of score statistics and more simulation studies.

27 in total

1. GCTA: a tool for genome-wide complex trait analysis.

Authors: Jian Yang; S Hong Lee; Michael E Goddard; Peter M Visscher
Journal: Am J Hum Genet Date: 2010-12-17 Impact factor: 11.025

2. Identifying specificity groups in the T cell receptor repertoire.

Authors: Jacob Glanville; Huang Huang; Allison Nau; Olivia Hatton; Lisa E Wagar; Florian Rubelt; Xuhuai Ji; Arnold Han; Sheri M Krams; Christina Pettus; Nikhil Haas; Cecilia S Lindestam Arlehamn; Alessandro Sette; Scott D Boyd; Thomas J Scriba; Olivia M Martinez; Mark M Davis
Journal: Nature Date: 2017-06-21 Impact factor: 49.962

3. Quantifiable predictive features define epitope-specific T cell receptor repertoires.

Authors: Pradyot Dash; Andrew J Fiore-Gartland; Tomer Hertz; George C Wang; Shalini Sharma; Aisha Souquette; Jeremy Chase Crawford; E Bridie Clemens; Thi H O Nguyen; Katherine Kedzierska; Nicole L La Gruta; Philip Bradley; Paul G Thomas
Journal: Nature Date: 2017-06-21 Impact factor: 49.962

4. A resource-efficient tool for mixed model association analysis of large-scale data.

Authors: Longda Jiang; Zhili Zheng; Ting Qi; Kathryn E Kemper; Naomi R Wray; Peter M Visscher; Jian Yang
Journal: Nat Genet Date: 2019-11-25 Impact factor: 38.330

5. ImmunoMap: A Bioinformatics Tool for T-cell Repertoire Analysis.

Authors: John-William Sidhom; Catherine A Bessell; Jonathan J Havel; Alyssa Kosmides; Timothy A Chan; Jonathan P Schneck
Journal: Cancer Immunol Res Date: 2017-12-20 Impact factor: 11.151

6. Quantitative assessment of T cell repertoire recovery after hematopoietic stem cell transplantation.

Authors: Jeroen W J van Heijst; Izaskun Ceberio; Lauren B Lipuma; Dane W Samilo; Gloria D Wasilewski; Anne Marie R Gonzales; Jimmy L Nieves; Marcel R M van den Brink; Miguel A Perales; Eric G Pamer
Journal: Nat Med Date: 2013-02-24 Impact factor: 53.440