| Literature DB >> 32134923 |
Jared Ostmeyer1, Elena Lucas2, Scott Christley1, Jayanthi Lea3, Nancy Monson4, Jasmin Tiro1, Lindsay G Cowell5.
Abstract
We previously showed, in a pilot study with publicly available data, that T cell receptor (TCR) repertoires from tumor infiltrating lymphocytes (TILs) could be distinguished from adjacent healthy tissue repertoires by the presence of TCRs bearing specific, biophysicochemical motifs in their antigen binding regions. We hypothesized that such motifs might allow development of a novel approach to cancer detection. The motifs were cancer specific and achieved high classification accuracy: we found distinct motifs for breast versus colorectal cancer-associated repertoires, and the colorectal cancer motif achieved 93% accuracy, while the breast cancer motif achieved 94% accuracy. In the current study, we sought to determine whether such motifs exist for ovarian cancer, a cancer type for which detection methods are urgently needed. We made two significant advances over the prior work. First, the prior study used patient-matched TILs and healthy repertoires, collecting healthy tissue adjacent to the tumors. The current study collected TILs from patients with high-grade serous ovarian carcinoma (HGSOC) and healthy ovary repertoires from cancer-free women undergoing hysterectomy/salpingo-oophorectomy for benign disease. Thus, the classification task is distinguishing women with cancer from women without cancer. Second, in the prior study, classification accuracy was measured by patient-hold-out cross-validation on the training data. In the current study, classification accuracy was additionally assessed on an independent cohort not used during model development to establish the generalizability of the motif to unseen data. Classification accuracy was 95% by patient-hold-out cross-validation on the training set and 80% when the model was applied to the blinded test set. The results on the blinded test set demonstrate a biophysicochemical TCR motif found overwhelmingly in women with HGSOC but rarely in women with healthy ovaries, strengthening the proposal that cancer detection approaches might benefit from incorporation of TCR motif-based biomarkers. Furthermore, these results call for studies on large cohorts to establish higher classification accuracies, as well as for studies in other cancer types.Entities:
Year: 2020 PMID: 32134923 PMCID: PMC7058380 DOI: 10.1371/journal.pone.0229569
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Patient characteristics.
Age, stage, patient diagnosis, and the number of unique TCRB sequences for each sample in the training and validation cohorts.
| Age | FIGO Stage | Diagnosis | Unique TCRBs | ||
|---|---|---|---|---|---|
| 52 | IVB | High-grade serous carcinoma | 8353 | ||
| 55 | IIIC | High-grade serous carcinoma | 1343 | ||
| 58 | IVB | High-grade serous carcinoma | 3249 | ||
| 52 | IIIC | High-grade serous carcinoma | 2692 | ||
| 50 | IIIC | High-grade serous carcinoma with endometrioid component | 719 | ||
| 53 | IIIC | High-grade serous carcinoma | 2225 | ||
| 53 | IIIC | High-grade serous carcinoma | 7363 | ||
| 55 | IIIC | High-grade serous carcinoma | 1667 | ||
| 59 | IIIC | High-grade serous carcinoma | 190 | ||
| 52 | IIIC | High-grade serous carcinoma | 227 | ||
| 52 | - | Cervix with LSIL | 695 | ||
| 51 | - | Cervix with LSIL; uterus with LM, AM | 603 | ||
| 55 | - | Uterus with LM, AM | 1870 | ||
| 55 | - | Uterus with LM, AM | 780 | ||
| 53 | - | Uterus with DPE, LM, AM | 3788 | ||
| 51 | - | Uterus with LM, AM; contralateral ovary with EM | 2896 | ||
| 58 | - | Contalateral ovary with MCT | 1101 | ||
| 55 | - | Uterus with LM, AM | 337 | ||
| 51 | - | Uterus with AM | 1409 | ||
| 52 | - | Uterus with LM, AM | 423 | ||
| 51 | IIIC | High-grade serous carcinoma | 467 | ||
| 56 | IVB | High-grade serous carcinoma | 1562 | ||
| 57 | IIIA1(i) | High-grade serous carcinoma | 572 | ||
| 57 | IIIC | High-grade serous carcinoma | 2414 | ||
| 51 | IIIC | High-grade serous carcinoma | 1134 | ||
| 56 | IIA | High-grade serous carcinoma | 1036 | ||
| 55 | IIIC | High-grade serous carcinoma | 2532 | ||
| 54 | IIB | High-grade serous carcinoma | 398 | ||
| 57 | IIIC | High-grade serous carcinoma | 2287 | ||
| 51 | IIIC | High-grade serous carcinoma | 332 | ||
| 51 | - | Uterus with LM | 803 | ||
| 50 | - | Uterus with LM, AM | 1290 | ||
| 50 | - | Uterus with LM | 1285 | ||
| 50 | - | Uterus with LM | 807 | ||
| 55 | - | Uterus with LM | 439 | ||
| 52 | - | Uterus with LM, AM | 685 | ||
| 53 | - | Uterus with LM | 1708 | ||
| 50 | - | Uterus with LM | 152 | ||
| 50 | - | Uterus with LM | 1405 | ||
| 57 | - | Uterus with LM | 202 |
LM: Leiomyoma; AM: Ademomyosis; DPE: disordered proliferative endometrium; MCT: mature cystic teratoma; EM: endometriosis; LSIL: low-grade squamous intraepithelial lesion.
Fig 1Study overview.
(a) Ovarian samples are collected from patients with and without HGSOC cancer. High-throughput immune receptor sequencing reveals the TCRβ CDR3 sequences found in each tissue sample. (b) The CDR3 sequences are cut in motifs. In this example, a motif is assembled from three amino acid residues. Only a single residue from the CDR3 may be skipped, allowing for a single gap. Otherwise, the amino acid residues are contiguous neighbors. (c) Each amino acid residue is converted into a set of five chemical features using Atchley factors for a total of fifteen features describing the motif. The relative abundance of each motif is included as an additional sixteenth feature. (d) Each feature is multiplied by a weight (β1 through β16) that determines its relative importance, and a bias value (β0) is added to calculate a logit. The logit can be converted into a probability value for that motif. (e) The weights and bias value are picked such that there is at least one motif with a probability value close to 1 in each HGSOC sample and all motifs in each healthy ovary sample have a probability close to 0.
Different model configurations evaluated on Cohort 1.
Each row represents a different model, and the columns describe the configuration of each model. The first row (bold font) corresponds to the model configuration with the best performance for the breast and colorectal cancer datasets [25]. The second row (bold underlined font) corresponds to the best performing model configuration presented here. The first column indicates the number of amino acid residues in the motif. The second column indicates the number of CDR3 amino acid residues that could be skipped when assembling a motif. For example, if the value is 2, then 2 CDR3 amino acid residues could be skipped. The third column indicates if binary indicators indicating whether the corresponding CDR3 residue was ignored were used. For example, if a CDR3 residue was ignored but would have been in the third position of a motif if it had been included, then the 3rd indicator would have a value of 1. The fourth column indicates if an amino acid was skipped in the CDR3 for the given position in the motif. The fifth column indicates if the expected frequency of the motif in blood was included as a feature. The expected frequency was estimated using publicly available data from 786 presumed healthy individuals [31]. The sixth column indicates if the log of the motif relative abundance was used for the relative abundance term. Column 7 indicates if each feature is squared and used as an additional feature, resulting in 2nd order terms in the model. Column 8 indicates if batch normalization was used. Column 9 (fourth from last) is the log-loss averaged across the one-holdout cross-validations. Column 10 (third from last) is the accuracy computed over the one-holdout cross-validations. Column 11 (second from last) is the number of gradient steps used to fit the model as determined by early-stopping. Column 12 is the number of fits to the training data, of which the best fit to the training data is applied to the holdout sample.
| FEATURES | CROSS-VAL LOG-LOSS | CROSS-VAL ACCUR-ACY | EARYL STOPP-ING | NUM FITS TO TRAIN | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Motif Size | # of Gap Positions | One-Hot Indicator of Gap Position | Restricting Gap to Position X | Expected Frequency in Blood | Log Frequency Instead | 2nd Order Terms | Batch Norm. | ||||
| 4 | 0 | √ | √ | 0.680 | 75% | 9 | 131072 | ||||
| 3 | 1 | √ | 0.400 | 95% | 1687 | 131072 | |||||
| 4 | 0 | √ | 0.887 | 65% | 1506 | 524288 | |||||
| 3 | 1 | √ | √ | 0.477 | 90% | 1411 | 786432 | ||||
| 3 | 2 | √ | √ | 0.963 | 65% | 467 | 131072 | ||||
| 4 | 1 | √ | √ | 1.004 | 55% | 692 | 65536 | ||||
| 4 | 2 | √ | √ | 0.639 | 80% | 3222 | 131072 | ||||
| 3 | 0 | 1.083 | 50% | 3 | 131072 | ||||||
| 4 | 0 | 1.037 | 50% | 4 | 131072 | ||||||
| 3 | 1 | x = 1 | 1.043 | 50% | 1 | 131072 | |||||
| 3 | 1 | x = 2 | 1.089 | 50% | 1 | 131072 | |||||
| 3 | 1 | x = 3 | 1.072 | 50% | 4 | 131072 | |||||
| 3 | 1 | √ | 0.378 | 90% | 2499 | 786432 | |||||
| 3 | 2 | √ | 1.016 | 75% | 1145 | 131072 | |||||
| 4 | 0 | √ | √ | 1.083 | 50% | 5 | 131072 | ||||
| 3 | 1 | √ | √ | √ | 1.049 | 50% | 5 | 131072 | |||
| 3 | 0 | √ | 0.823 | 85% | 1036 | 131072 | |||||
| 4 | 0 | √ | 1.108 | 50% | 1 | 131072 | |||||
| 4 | 3 | √ | 0.447 | 85% | 2499 | 131072 | |||||
To ensure each model can run in a reasonable amount of time, only the top 65,536 most abundant motifs in a biopsy are used.
Fig 2Workflow for model selection and parameter fitting.
(a) Left Panel: The diagram shows how cohort I is used to train and evaluate each model. Model performance is evaluated by an exhaustive 1-holdout cross-validation using only cohort I. (b) Right Panel: The diagram shows how the best performing model is evaluated with unseen test data. The best performing model is refitted to all the samples in cohort I, and then used to score the test samples from cohort II. The same random initial β-coefficients are reused from (a) when refitting the best performing model in (b).
Fig 3Results.
(a) Classification results obtained by leave-out cross-validation for each patient in Cohort I. (b) Illustration of the classifier weights averaged across all 20 cross-validation runs (error bars for the standard deviation are omitted because the range was too small to plot relative to the size of each arrow). For each of the five Atchley factors, the weights are shown for the three residue positions. The weight for the log-frequency of the receptor is also shown. Positive weight values are shown pointing up, and negative weight values are shown pointing down. The length of the arrow corresponds to the weight's magnitude. (c) All motifs with a score above 0.5 (middle column) are shown for the 20 patient samples. Each motif is shown in the context of its respective CDR3. The leftmost column indicates the patient and the right most column indicates the number of times the motif is observed in the sample. (d) Classification results obtained on Cohort II test samples. (e) The ROC curve shows true and false positive rates for different thresholds of a positive diagnosis based on the model applied to Cohort II. The area under the curve is 0.79. (f) All motifs with a score above 0.5 (middle column) shown for the 20 patient samples in Cohort II. Each motif is shown in the context of its respective CDR3. The leftmost column indicates the patient and the right most column indicates the number of times the motif is observed in the sample.
Permutation results.
Each row corresponds to a single permutation of the Cohort I data set, indicated in column 1. The second column shows the loss averaged over all patient-hold-out cross-validations. The third column shows the classification accuracy over all patient-hold-out cross-validations. The fourth column shows the fitting step, out of 2500, at which the lowest average loss was observed.
| Run | Average Loss | Classification Accuracy | Early Stopping Step |
|---|---|---|---|
| 1 | 0.972 | 55% | 132 |
| 2 | 1.07 | 50% | 3 |
| 3 | 1.021 | 50% | 3 |
| 4 | 1.06 | 50% | 1 |
| 5 | 1.055 | 50% | 5 |
| 6 | 1.038 | 60% | 471 |
| 7 | 0.77 | 85% | 503 |
| 8 | 1.03 | 50% | 209 |
| 9 | 0.964 | 65% | 295 |
| 10 | 1.041 | 30% | 245 |
| 11 | 1.011 | 50% | 73 |
| 12 | 1.008 | 55% | 158 |
| 13 | 1.076 | 50% | 4 |
| 14 | 0.63 | 85% | 2497 |
| 15 | 1.012 | 50% | 34 |
| 16 | 1.042 | 50% | 4 |
| 17 | 1.044 | 50% | 4 |
| 18 | 1.043 | 50% | 10 |
| 19 | 1.089 | 50% | 44 |
| 20 | 0.891 | 70% | 1005 |
| 0.99335 | 55% |