| Literature DB >> 35601486 |
Ying Xu1,2, Xinyang Qian1,2, Xuanping Zhang1,2, Xin Lai1,2, Yuqian Liu1,2, Jiayin Wang1,2.
Abstract
Recent studies highlight the potential of T cell receptor (TCR) repertoires in accurately detecting cancers via noninvasive sampling. Unfortunately, due to the complicated associations among cancer antigens and the possible induced T cell responses, currently, the practical strategy for identifying cancer-associated TCRs is the computational prediction based on TCR repertoire data. Several state-of-the-art methods were proposed in recent year or two; however, the prediction algorithms were still weakened by two major issues. To facilitate the computational processes, the algorithms prefer to decompose the original TCR sequences into length-fixed amino acid fragments, while the first dilemma comes as the lengths of cancer-associated motifs are suggested to be various. Moreover, the correlations among TCRs in the same repertoire should be further considered, which are often ignored by the existing methods. We here developed a deep multi-instance learning method, named DeepLION, to improve the prediction of cancer-associated TCRs by considering these issues. First, DeepLION introduced a deep learning framework with alternative convolution filters and 1-max pooling operations to handle the amino acid fragments with different lengths. Then, the multi-instance learning framework modeled the TCR correlations and assigned adjusted weights for each TCR sequence during the predicting process. To validate the performance of DeepLION, we conducted a series of experiments on several cohorts of patients from nine cancer types. Compared to the existing methods, DeepLION achieved, on most of the cohorts, higher prediction accuracies, sensitivities, specificities, and areas under the curve (AUCs), where the AUC reached notably 0.97 and 0.90 for thyroid and lung cancer cohorts, respectively. Thus, DeepLION may further support the detection of cancers from TCR repertoire data. DeepLION is publicly available on GitHub, at https://github.com/Bioinformatics7181/DeepLION, for academic usage only.Entities:
Keywords: T cell receptor; TCR repertoire data analysis; cancer-associated TCR; deep learning framework; machine learning approach; multi-instance learning
Year: 2022 PMID: 35601486 PMCID: PMC9121378 DOI: 10.3389/fgene.2022.860510
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.772
FIGURE 1DeepLION for accurate TCR repertoire prediction. (A) The workflow of DeepLION is divided into three parts: data preprocessing, the CNN for TCRs, and MIL. During data preprocessing, the top k most abundant TCR sequences were extracted from each repertoire after removing unqualified sequences and they were encoded into matrixes by the Beshnova matrix. The CNN for TCRs consisted of 14 convolution filters covering six various region sizes, 1-max pooling operations, and a one-layer linear classifier L. The TCR matrixes were input to the CNN and their scores were output. In the MIL part, DeepLION employed another one-layer linear classifier L′ to aggregate k TCR scores to predict the repertoire. (B) The details of the convolution and pooling operations of CNN in DeepLION. When a 2 × d convolution filter (the red box) performed a complete convolution operation on the TCR matrix from top to bottom, it could be regarded as extracting the biochemical features of the 2-mers such as "CA", "AS", etc., and then a 10 × 1 feature map, a feature set of all 2-mers, was generated. Other filters performed similar convolution operations and 14 feature maps were obtained. The maximum value of each map (marked with a blue box) was selected by a 1-max pooling operation, which could be viewed as the feature of the z-mers most likely to be the cancer-specific motif. These features were interconnected to generate a 14 × 1 TCR feature vector.
The situation of continuous CDR3 residue regions and convolution filter design.
| Size of region | Number of region | Frequency of region | Size of filter | Number of filter |
|---|---|---|---|---|
| 2 | 12 | 0.207 | 2 × | 3 |
| 3 | 12 | 0.207 | 3 × | 3 |
| 4 | 13 | 0.224 | 4 × | 3 |
| 5 | 8 | 0.138 | 5 × | 2 |
| 6 | 7 | 0.121 | 6 × | 2 |
| 7 | 5 | 0.086 | 7 × | 1 |
| 8 | 1 | 0.017 | — | — |
The specifics of the datasets.
| Source | Disease | Cell type | Data type | Sample size |
|---|---|---|---|---|
| IA | Melanoma | PBMCs | TCR-seq | 21 |
| BRCA | PBMCs | TCR-seq | 16 | |
| Ovarian | PBMCs | TCR-seq | 4 | |
| Pancreatic | PBMCs | TCR-seq | 7 | |
| Bladder | PBMCs | TCR-seq | 30 | |
| GBM | PBMCs | TCR-seq | 15 | |
| Lung | PBMCs | TCR-seq | 29 | |
| CRC | PBMCs | TCR-seq | 3 | |
| Non-cancer | PBMCs | TCR-seq | 786 | |
| Geneplus | THCA | PBMCs and TILs | TCR-seq | 170 |
| Lung | PBMCs and TILs | TCR-seq | 184 | |
| Non-cancer | PBMCs | TCR-seq | 260 |
IA, Adaptive Biotechnologies immuneACCESS online database; BRCA, breast cancer; GBM, glioblastoma multiforme; CRC, colorectal cancer; THCA, thyroid cancer; PBMCs, peripheral blood mononuclear cells; TILs, tumor-infiltrating T lymphocytes; TCR-seq, T cell receptor-sequencing.
The training and test data in experiments.
| Section | Data type | Data source | Disease | Cell type | Sample size |
|---|---|---|---|---|---|
| 3.2 | Training data | TCGA | Multiple cancers | — | 30,000 |
| IA | Non-cancer (H2) | — | ∼60,000 | ||
| Test data | IA (T1) | Melanoma | PBMCs | 21 | |
| BRCA | PBMCs | 16 | |||
| Ovarian | PBMCs | 4 | |||
| Pancreatic | PBMCs | 7 | |||
| Bladder | PBMCs | 30 | |||
| GBM | PBMCs | 15 | |||
| Lung | PBMCs | 29 | |||
| CRC | PBMCs | 3 | |||
| Non-cancer (H1) | PBMCs | 666 | |||
| Geneplus (T2) | THCA | PBMCs and TILs | 170 | ||
| Lung | PBMCs and TILs | 184 | |||
| Non-cancer (H3) | PBMCs | 260 | |||
| 3.3 | Training and test data | Geneplus (T2) | THCA | PBMCs and TILs | 170 |
| Lung | PBMCs and TILs | 184 | |||
| Non-cancer (H3) | PBMCs | 260 |
TCGA, The Cancer Genome Atlas; IA, Adaptive Biotechnologies immuneACCESS online database; BRCA, breast cancer; GBM, glioblastoma multiforme; CRC, colorectal cancer; THCA, thyroid cancer; PBMCs, peripheral blood mononuclear cells; TILs, tumor-infiltrating T lymphocytes.
The performance of models on different cancer samples.
| T1
| T2 | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Melanoma (21) | BRCA (16) | Ovarian (4) | Pancreatic (7) | THCA (170) | ||||||
| MCAT
| MCNN | MCAT | MCNN | MCAT | MCNN | MCAT | MCNN | MCAT | MCNN | |
| SEN | 0.762 | 0.762 | 0.438 | 0.750 | 1 | 1 | 0.714 | 1 | 0.353 | 0.453 |
| AUC | 0.912 | 0.900 | 0.854 | 0.892 | 0.988 | 0.989 | 0.945 | 0.962 | 0.692 | 0.724 |
|
|
|
|
|
| ||||||
| MCAT | MCNN | MCAT | MCNN | MCAT | MCNN | MCAT | MCNN | MCAT | MCNN | |
| SEN | 0.733 | 0.767 | 0.133 | 0.133 | 0.241 | 0.310 | 1 | 1 | 0.326 | 0.473 |
| AUC | 0.881 | 0.913 | 0.665 | 0.690 | 0.535 | 0.663 | 1 | 0.995 | 0.736 | 0.753 |
SEN, sensitivity; AUC, area under the receiver operating characteristic curve; BRCA, breast cancer; GBM, glioblastoma multiforme; CRC, colorectal cancer; THCA, thyroid cancer.
Each group of samples was a mix of cancer and control samples (n = 666 for groups of T1 and n = 260 for groups of T2).
The thresholds of two models were set at 0.9 specificity (MCAT: 0.277 for T1 and 0.351 for T2; MCNN: 0.392 for T1 and 0.423 for T2).
FIGURE 2The ROC curves of models on combined cancer samples. (A) The ROC curves on T1. The AUC of MCAT is 0.78 whereas the AUC of MCNN is 0.83. (B) The ROC curves on T2. The AUC of MCAT is 0.71 whereas the AUC of MCNN is 0.73.
The performances of models on THCA and lung cancer samples.
| THCA (170) | Lung (184) | |||||||
|---|---|---|---|---|---|---|---|---|
|
| MCAT | MCNN |
|
| MCAT | MCNN |
| |
| ACC | 0.651 | 0.693 | 0.753 | 0.872 | 0.659 | 0.698 | 0.732 | 0.818 |
| SEN | 0.444 | 0.488 | 0.418 | 0.775 | 0.552 | 0.625 | 0.538 | 0.730 |
| SPE | 0.800 | 0.827 | 0.973 | 0.957 | 0.712 | 0.750 | 0.869 | 0.882 |
| AUC | 0.600 | 0.692 | 0.724 | 0.974 | 0.680 | 0.736 | 0.753 | 0.899 |
ACC, accuracy; SEN, sensitivity; SPE, specificity; AUC, area under the receiver operating characteristic curve; THCA, thyroid cancer.
Each group of samples was a mix of cancer and control samples (n = 260).
The threshold of both MLOG and MLION was 0.5 for two samples and the thresholds of MCAT and MCNN were set by Youden index (MCAT: 0.336 for THCA and 0.321 for lung cancer; MCNN: 0.433 for THCA and 0.419 for lung cancer).
FIGURE 3The ROC curves of models on T2. (A) The ROC curves on THCA samples. The AUCs of MLOG, MCAT, MCNN, MLION are 0.60, 0.69, 0.72 and 0.97. (B) The ROC curves on lung cancer samples. The AUCs of MLOG, MCAT, MCNN, MLION are 0.68, 0.74, 0.75 and 0.90.
The performances of cross-validations on THCA and lung cancer samples.
| THCA (170) | Lung (184) | |||
|---|---|---|---|---|
| K-fold | Nested | K-fold | Nested | |
| ACC | 0.843 ± 0.017 | 0.817 ± 0.010 | 0.786 ± 0.020 | 0.741 ± 0.010 |
| SEN | 0.773 ± 0.025 | 0.706 ± 0.021 | 0.697 ± 0.035 | 0.679 ± 0.026 |
| SPE | 0.892 ± 0.035 | 0.910 ± 0.016 | 0.848 ± 0.025 | 0.783 ± 0.019 |
| AUC | 0.925 ± 0.010 | 0.909 ± 0.007 | 0.841 ± 0.014 | 0.806 ± 0.011 |
ACC, accuracy; SEN, sensitivity; SPE, specificity; AUC, area under the receiver operating characteristic curve; THCA, thyroid cancer.
Each group of samples was a mix of cancer and control samples (n = 260).
The results show 95% confidence intervals for all the validations (totally 50 validations for each cross-validation).