| Literature DB >> 35911778 |
Yotaro Katayama1, Ryo Yokota2, Taishin Akiyama3,4, Tetsuya J Kobayashi5,1.
Abstract
Sparked by the development of genome sequencing technology, the quantity and quality of data handled in immunological research have been changing dramatically. Various data and database platforms are now driving the rapid progress of machine learning for immunological data analysis. Of various topics in immunology, T cell receptor repertoire analysis is one of the most important targets of machine learning for assessing the state and abnormalities of immune systems. In this paper, we review recent repertoire analysis methods based on machine learning and deep learning and discuss their prospects.Entities:
Keywords: T cell; T cell receptor; deep learning; immunoinformatics; machine learning
Mesh:
Substances:
Year: 2022 PMID: 35911778 PMCID: PMC9334875 DOI: 10.3389/fimmu.2022.858057
Source DB: PubMed Journal: Front Immunol ISSN: 1664-3224 Impact factor: 8.786
Figure 1(A) Schematic illustration of the pipeline in TCR repertoire analysis. 1) First, T cells in samples, typically being collected from peripheral blood, are processed to extract its DNA or RNA of TCRs. 2,3) PCR is conducted to amplify the signal. 4,5) Then, the amplified DNA or cDNA is sequenced by NGS to obtain TCR sequences. 6,7) Finally, these sequences are mapped to the reference genes by the software pipeline introduced in the main text and analyzed further. (B) A typical experimental flow for applying ML methods on repertoire datasets. 1-3) Samples are collected from multiple groups of donors who have different immunological and physiological conditions. 4,5) By the pipeline illustrated in (A), the dataset is obtained for each sample typically in the format of a table or matrix. 6) Datasets are encoded to ML friendly formats (feature vectors) using feature extraction methods. In bioinformatics, it is common to analyze gene expression matrices, which summarize the expression level of each gene for each sample. In repertoire analysis, for each sample, we have a matrix, each raw of which represents the sequence of one TCR, its observation count, its gene usage, and other properties of the TCR. Note that typically 104 to 105 different sequences are observed per sample and that only a limited number of overlapping sequences are usually detected among samples. Therefore, a relatively large sparse matrix must be handled for repertoire analysis. 7) ML algorithms are performed on the encoded datasets.
Figure 2Challenges in TCR repertoire analysis. (A) Only limited observations are possible compared with the massive diversity of TCRs in a body or that of possible TCR sequences. (B) Various factors alter the repertoire, which results in large individual differences. (C) As we cannot observe all the antigens that a TCR recognizes, we cannot directly evaluate the similarity between TCRs with different sequences. (D) Experimental procedures including PCR and NGS inevitably introduce errors and batch effects.
Figure 3Graphical summary of the development in TCR analysis methods. Early analysis was based on the TCR clonotype abundance (frequency) distribution (left panel). Recently, sequence information has started to be utilized in various ways (right panel) by employing statistical, ML, and DL methods.
Qualitative comparison of the methods reviewed in this article. In practice, both feature encoding methods and ML algorithms for specific tasks such as classification or regression are combined. As the choice of ML algorithms is usually arbitrary, this table is organized by the viewpoint of feature extraction.
| Methods | Core Idea | TCR-level encoding | Repertoire-level encoding | ML methods combined with | Strength | Weakness | Notable Examples | Relationship with other methods | |
|---|---|---|---|---|---|---|---|---|---|
| Distribution based models | Statistics (Diversity) | TCR diversity is related to healthiness and abnormality of immunological states. Diversity indices such as a rarity weighted count of TCR clonotypes can be used as basic parameters of the immunological state. | NA | A diversity index (a scalar value) | NA | Applicable to data with small sample size and/or small number of sequences. | Too simple and ignoring sequence information. | Grieff et al ( | NA |
| Distribution Shape | The distribution of the clonotype frequency is used to analyze the structure of clonotype diversity. By fitting the sample distribution by probabilistic models, characteristic parameters of the distribution are estimated. | NA | Model parameters of distributions | Probablistic Model | Applicable to data with small sample size and/or small number of sequences. Flexibility of modeling. | Arbitrariness of modeling and ignoring sequence information. | Guidani et al ( | NA | |
| Sequence Information based methods | Hypothesis Test | The TCRs shared among the samples in a condition compared to others might be correlated with the condition. Such TCRs can be identified by hypothesis tests. | Significance of presence or absence of specific TCRs in a condition | A bool vector of the existence of the specific condition-related TCRs found by the hypothesis tests | Various Classifiers | Each TCR can be characterized by the relatedness to the conditions. | Ignoring most of the sequences. | Emerson et al ( | To include similarity, hypothesis tests are combined with dissimilarity-based methods (ex. Glanville et al., |
| Dissimilarity | Similar TCRs may play a similar role in the body. Distance between TCRs can be used to detect and cluster the similar TCRs. | Relative distance from other sequences. Manifold learning is sometimes used to calculate absolute position of the TCR in the latent space | Density distribution on the latent space | Clustering Algorithms and Manifold Learning | Utilizing all sequences to characterize samples. Each TCR is characterized by the relative distance from the other TCRs. | Computational cost of pairwise alignment. | Dash et al ( | NA | |
| Motif | Local patterns such as (k-mer) motifs in a TCR may be related to its function. Encoding TCRs by a vector of local features may be a good representation of TCRs. | Bag of k-mer. Atchley vector is also used to encode the TCR to more dense vector. | Bag of k-mer or aggreation of TCR-level encoding | Various Classifiers / Regressors | Utilize all sequences to characterize samples. Applicable to data with small sample size and/or small number of sequences. Each TCR is directly characterized as a feature vector. | Low flexibility in modeling. | Sun et al ( | Motifs are sometimes used for calculating dissimilarity (ex. Mayer-Blackwell et al., | |
| Generative Models | The mechanisms of generation and selection of TCRs are the determinants of TCR repertoire. Their modelling provides additional information to the observed and not-observed repertoires. | NA | Model parameters of the generative models | Probablistic Model and Manifold Learning | Utilizing all sequences to characterize samples. Applicable to data with small sample size and/or small number of sequences. Generation of pseudo data (for simulatiion, data augmentation etc.) | Validity of assumptions in models. | Murugan et al ( | NA | |
| Deep Learning (DL) | Good representations of repertoires may be obtained by Deep learning and may improve the performance of various repertoire analysis | Various encoding based on VAE or language models (See embedding methods) | Inferred parameters of DL-based models | Generative Models and Embedding Methods | High flexibility in modeling. High performance if sufficient amount of data is provided. | Model is not explainable and data expensive. | Davidsen et al ( | Embedding Methods are closely related with DL. | |
| Embedding Methods | Because TCR sequences are a collection of strings, encoding TCRs to fixed-length dense vectors using NLP may lead to efficient algorithms. | Sentence embedding | NA | Various Algorithms incl. DL | High flexibility in modeling. Applicable to data with small sample size and/or small number of sequences (after pre-training). | Model is not explainable. | Cheng et al ( | NA | |
NA stands for not applicable.