Literature DB >> 32134923

Biophysicochemical motifs in T cell receptor sequences as a potential biomarker for high-grade serous ovarian carcinoma.

Jared Ostmeyer¹, Elena Lucas², Scott Christley¹, Jayanthi Lea³, Nancy Monson⁴, Jasmin Tiro¹, Lindsay G Cowell⁵.

Abstract

We previously showed, in a pilot study with publicly available data, that T cell receptor (TCR) repertoires from tumor infiltrating lymphocytes (TILs) could be distinguished from adjacent healthy tissue repertoires by the presence of TCRs bearing specific, biophysicochemical motifs in their antigen binding regions. We hypothesized that such motifs might allow development of a novel approach to cancer detection. The motifs were cancer specific and achieved high classification accuracy: we found distinct motifs for breast versus colorectal cancer-associated repertoires, and the colorectal cancer motif achieved 93% accuracy, while the breast cancer motif achieved 94% accuracy. In the current study, we sought to determine whether such motifs exist for ovarian cancer, a cancer type for which detection methods are urgently needed. We made two significant advances over the prior work. First, the prior study used patient-matched TILs and healthy repertoires, collecting healthy tissue adjacent to the tumors. The current study collected TILs from patients with high-grade serous ovarian carcinoma (HGSOC) and healthy ovary repertoires from cancer-free women undergoing hysterectomy/salpingo-oophorectomy for benign disease. Thus, the classification task is distinguishing women with cancer from women without cancer. Second, in the prior study, classification accuracy was measured by patient-hold-out cross-validation on the training data. In the current study, classification accuracy was additionally assessed on an independent cohort not used during model development to establish the generalizability of the motif to unseen data. Classification accuracy was 95% by patient-hold-out cross-validation on the training set and 80% when the model was applied to the blinded test set. The results on the blinded test set demonstrate a biophysicochemical TCR motif found overwhelmingly in women with HGSOC but rarely in women with healthy ovaries, strengthening the proposal that cancer detection approaches might benefit from incorporation of TCR motif-based biomarkers. Furthermore, these results call for studies on large cohorts to establish higher classification accuracies, as well as for studies in other cancer types.

Entities: Chemical Disease Gene Mutation Species

Year: 2020 PMID： 32134923 PMCID： PMC7058380 DOI： 10.1371/journal.pone.0229569

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.240

Introduction

Despite the tremendous genomic heterogeneity between cancers, there is evidence that cancer patients mount T cell responses against antigens they have in common, including tumor antigens. Shared tumor antigens can be generally classified into three categories: (1) self-antigens with dysregulated expression or increased copy numbers, such as MelanA, HER2, SOX2, and NY-ESO-1 [1-5], (2) altered self-antigens, such as recurrent oncogenic mutations, including BRAFV600E and CDKR24C [6] and TGF-βRII frameshift mutations [7], and (3) non-self-antigens–viral epitopes expressed by virus-induced cancers, such as those derived from Human Papilloma Virus [8, 9], Hepatitis B Virus [10], and Epstein Barr Virus [11]. Ovarian cancer is considered rich in the first category of shared tumor antigens, with relatively large percentages of ovarian cancers expressing MAGE-A1, MAGE-A3, NY-ESO-1, and others [12, 13]. In the case of the alpha folate receptor, 97% of ovarian cancers were found to express it, with the vast majority having moderate or strong expression levels, while only 63% of healthy ovaries were found to express it, and in all cases the expression was weak [14]. Evidence for T cell responses against shared tumor antigens comes from studies demonstrating the presence of T cells with binding capacity for, and reactivity to, the shared antigens [1, 3, 4, 15–20]. Indeed, responses against shared tumor antigens may outnumber those against mutated neoantigens, including for highly mutated cancers such as melanoma [21, 22]. In addition to effector T cells responding to tumor antigens, a significant portion of the tumor-infiltrating lymphocyte (TIL) population is expected to be regulatory T cells that are reactive to tissue-restricted self-antigens associated with the organ of cancer origin, as these T cells are highly enriched in cancer lesions [21]. Thus, on balance, we expect much of a TIL population to be composed of T cells with specificity for antigens shared across cancer patients and not present, or present at significantly reduced levels, in cancer-free individuals. We hypothesized that the above-described T cell responses could serve as the basis for cancer early detection biomarkers and sought to develop a method for detecting them that didn’t require knowledge of the target antigens and didn’t rely on the assumption that T cells responding to a common target would express T cell receptors with the same amino acid sequence. Utilizing publicly available TCR deep sequencing data, we applied multiple instance learning (MIL) after converting the TCR amino acid sequences to a biophysicochemical representation using Atchley Factors [23-25]. We found that TCR repertoires from breast or colorectal cancer TILs could be distinguished from adjacent healthy tissue repertoires by the presence of TCRs bearing specific, biophysicochemical motifs in their antigen binding regions [25]. The motifs were different between the two cancer types, and both achieved high classification accuracy. The colorectal cancer motif achieved 93% accuracy, while the breast cancer motif achieved 94% accuracy. In the current study, we sought to establish the plausibility of using TCR motifs for ovarian cancer detection and applied our method to locally collected patient samples. We made two significant advances over the prior work. First, the prior study used patient-matched TILs and healthy repertoires, collecting healthy tissue adjacent to the tumors. Thus, the classification task was to distinguish two repertoires that had both been collected from an organ effected by cancer, one repertoire from within the cancerous lesion and one repertoire from a lesion-free region. The current study collected TILs from patients with high-grade serous ovarian carcinoma (HGSOC) and collected healthy ovary repertoires from cancer-free women undergoing hysterectomy with salpingo-oophorectomy for benign disease. Thus, the current classification task is distinguishing repertoires from women with HGSOC versus repertoires from women with healthy ovaries. The second advance comes from the opportunity to assess the motif on a blinded test data set. In the prior study, only a training data set was available, and classification accuracy was measured by patient-hold-out cross-validation. In the current study, both a training and test data set were available. Thus, in addition to assessing classification accuracy of the motif by patient-hold-out cross-validation, the ability of the motif to generalize to a new, independent cohort of data not used for motif discovery was assessed. The current study revealed a TCR biophysicochemical motif present overwhelmingly in HGSOC TILs repertoires but rarely in healthy ovary repertoires. The motif is specific to HGSOC, i.e., it is different from the motifs previously identified for colorectal and breast cancer. The classification accuracy assessed by cross-validation on the training data was 95% (19/20). Applying the same model selection and cross-validation procedure to data with permutated labels resulted in an average classification accuracy of 55%, and the accuracies of all 20 permutations were < 95%. Application of the best model to the unseen test set resulted in a classification accuracy of 80% (16/20), indicating that the motif has some capacity to generalize. These results strengthen the proposal that cancer detection approaches might benefit from incorporation of TCR motif-based biomarkers and call for studies assessing the approach on large training and testing data sets and on additional cancer types.

Materials and methods

Datasets

We obtained 40 archived tissue blocks from the Pathology Laboratories of Parkland Health and Hospital System and two university hospitals associated with UT Southwestern Medical Center (St. Paul Hospital and Clements University Hospital): 20 HGSOC specimens and 20 normal ovary specimens. The study was approved by the UT Southwestern Medical Center IRB, study number STU-2018-0239, with a waiver of consent because anonymized, archived, FFPE tissue blocks were used. We divided the blocks into two cohorts, one training cohort (Cohort I) and one test cohort (Cohort II), each with 10 HGSOC specimens and 10 normal ovary specimens (Table 1). In both cohorts, all donors were between 50 and 59 years of age. In the training cohort, eight of the HGSOC samples were stage IIIC, and two were stage IVB. The test cohort was more heterogeneous with respect to the HGSOC stage. Six were stage IIIC; one was stage IVB. The remaining three samples were stages IIA, IIB, and IIIA1(i). Of the 20 control samples, in addition to normal ovarian tissue, 12 had fallopian tube tissue within the block from which our samples were cut. In seven of the remaining eight cases, ovarian sections demonstrated serous (tubal-type) epithelial inclusions. Thus, these controls are representative of the tissue from which ovarian cancer is believed to arise. Tissue curls were sent to Adaptive Biotechnologies for sequencing of the TCR β (TCRB) locus at survey depth. Sequencing was based on genomic DNA, and, for the tissue blocks with fallopian tube tissue or epithelial inclusions, the DNA from the various tissue types was combined for sequencing. We did not quantify the number of TILs present in the tissue by immunohistochemistry prior to sequencing, but the number of unique TCR in Table 1 estimates the number of T cell clones present the sequencing sample [26, 27]. The study design is shown in Fig 1A.

Table 1

Patient characteristics.

Age, stage, patient diagnosis, and the number of unique TCRB sequences for each sample in the training and validation cohorts.

		Age	FIGO Stage	Diagnosis	Unique TCRBs
Cohort I, Training Cohort	HGSOC Cases	52	IVB	High-grade serous carcinoma	8353
		55	IIIC	High-grade serous carcinoma	1343
		58	IVB	High-grade serous carcinoma	3249
		52	IIIC	High-grade serous carcinoma	2692
		50	IIIC	High-grade serous carcinoma with endometrioid component	719
		53	IIIC	High-grade serous carcinoma	2225
		53	IIIC	High-grade serous carcinoma	7363
		55	IIIC	High-grade serous carcinoma	1667
		59	IIIC	High-grade serous carcinoma	190
		52	IIIC	High-grade serous carcinoma	227
	Normal Ovary Cases	52	-	Cervix with LSIL	695
		51	-	Cervix with LSIL; uterus with LM, AM	603
		55	-	Uterus with LM, AM	1870
		55	-	Uterus with LM, AM	780
		53	-	Uterus with DPE, LM, AM	3788
		51	-	Uterus with LM, AM; contralateral ovary with EM	2896
		58	-	Contalateral ovary with MCT	1101
		55	-	Uterus with LM, AM	337
		51	-	Uterus with AM	1409
		52	-	Uterus with LM, AM	423
Cohort II, Test Cohort	HGSOC Cases	51	IIIC	High-grade serous carcinoma	467
		56	IVB	High-grade serous carcinoma	1562
		57	IIIA1(i)	High-grade serous carcinoma	572
		57	IIIC	High-grade serous carcinoma	2414
		51	IIIC	High-grade serous carcinoma	1134
		56	IIA	High-grade serous carcinoma	1036
		55	IIIC	High-grade serous carcinoma	2532
		54	IIB	High-grade serous carcinoma	398
		57	IIIC	High-grade serous carcinoma	2287
		51	IIIC	High-grade serous carcinoma	332
	Normal Ovary Cases	51	-	Uterus with LM	803
		50	-	Uterus with LM, AM	1290
		50	-	Uterus with LM	1285
		50	-	Uterus with LM	807
		55	-	Uterus with LM	439
		52	-	Uterus with LM, AM	685
		53	-	Uterus with LM	1708
		50	-	Uterus with LM	152
		50	-	Uterus with LM	1405
		57	-	Uterus with LM	202

LM: Leiomyoma; AM: Ademomyosis; DPE: disordered proliferative endometrium; MCT: mature cystic teratoma; EM: endometriosis; LSIL: low-grade squamous intraepithelial lesion.

Fig 1

Study overview.

(a) Ovarian samples are collected from patients with and without HGSOC cancer. High-throughput immune receptor sequencing reveals the TCRβ CDR3 sequences found in each tissue sample. (b) The CDR3 sequences are cut in motifs. In this example, a motif is assembled from three amino acid residues. Only a single residue from the CDR3 may be skipped, allowing for a single gap. Otherwise, the amino acid residues are contiguous neighbors. (c) Each amino acid residue is converted into a set of five chemical features using Atchley factors for a total of fifteen features describing the motif. The relative abundance of each motif is included as an additional sixteenth feature. (d) Each feature is multiplied by a weight (β1 through β16) that determines its relative importance, and a bias value (β0) is added to calculate a logit. The logit can be converted into a probability value for that motif. (e) The weights and bias value are picked such that there is at least one motif with a probability value close to 1 in each HGSOC sample and all motifs in each healthy ovary sample have a probability close to 0.

Study overview.

Patient characteristics.

Age, stage, patient diagnosis, and the number of unique TCRB sequences for each sample in the training and validation cohorts. LM: Leiomyoma; AM: Ademomyosis; DPE: disordered proliferative endometrium; MCT: mature cystic teratoma; EM: endometriosis; LSIL: low-grade squamous intraepithelial lesion. The data are freely available from the VDJServer Community Data Portal (CDP) (vdjserver.org) under the project accession 3276777473314001386-242ac116-0001-012 [28]. The sequences are available in FASTA format in the “Browse Project Data” section. Annotated alignments are available in the tab-separated-values format recommended by the Adaptive Immune Receptor Repertoire Community in the “View Analyses and Results” section as output from IgBlast [29, 30].

Representing TCRs

As previously described [25], we analyzed X-ray crystallographic structures of human TCRs bound to peptide–MHC complex obtained from the Protein Data Bank in order to determine how to represent TCRB sequence in a way that would capture the antigen binding capabilities of the corresponding TCRB chain. We focused on complementarity determining region 3 (CDR3), because it is the somatically generated portion of the gene and the primary determinant of the chain’s antigen-binding specificity. We also focused on residues that directly contact peptide in a peptide–MHC complex. The crystal structure analysis revealed that TCRB CDR3 residues in contact with peptide tend to lie near each other, forming a local neighborhood of contact residues. The size and relative location of this neighborhood varied, but it rarely included any of the first or last three CDR3 residues, its average length was four, and in ~25% of cases, a non-contact residue was interspersed between the contact residues. Thus, to capture CDR3 contact residues, we excluded the first and last three CDR3 residues and partitioned the remaining sequence into every possible contiguous strip of three amino acid residues, referred to as a motif (Fig 1B). We also allowed one residue in the CDR3 sequence to be skipped when assembling a motif (Fig 1B). Such skipped residues are referred to as a gap. Our expectation is that, for each TCRB CDR3, at least one of its motifs contains residues that contact the peptide component of the receptor's cognate antigen. Alternative models were considered but exhibited reduced performance (Table 2).

Table 2

Different model configurations evaluated on Cohort 1.

FEATURES								CROSS-VAL LOG-LOSS	CROSS-VAL ACCUR-ACY	EARYL STOPP-ING	NUM FITS TO TRAIN
Motif Size	# of Gap Positions	One-Hot Indicator of Gap Position	Restricting Gap to Position X	Expected Frequency in Blood	Log Frequency Instead	2^nd Order Terms	Batch Norm.	CROSS-VAL LOG-LOSS	CROSS-VAL ACCUR-ACY	EARYL STOPP-ING	NUM FITS TO TRAIN
4	0				√			0.666	90%	1211	131072
3	1							0.332	95%	2499	131072
4	0			√			√	0.680	75%	9	131072
3	1			√				0.400	95%	1687	131072
4	0			√				0.887	65%	1506	524288
3	1	√		√				0.477	90%	1411	786432
3	2	√		√				0.963	65%	467	131072
4	1	√		√				1.004	55%	692	65536
4	2	√		√				0.639	80%	3222	131072
3	0							1.083	50%	3	131072
4	0							1.037	50%	4	131072
3	1		x = 1					1.043	50%	1	131072
3	1		x = 2					1.089	50%	1	131072
3	1		x = 3					1.072	50%	4	131072
3	1	√						0.378	90%	2499	786432
3	2	√						1.016	75%	1145	131072
4	0			√		√		1.083	50%	5	131072
3	1	√		√		√		1.049	50%	5	131072
3	0					√		0.823	85%	1036	131072
4	0					√		1.108	50%	1	131072
4	3				√			0.447	85%	2499	131072

To ensure each model can run in a reasonable amount of time, only the top 65,536 most abundant motifs in a biopsy are used.

Different model configurations evaluated on Cohort 1.

Each row represents a different model, and the columns describe the configuration of each model. The first row (bold font) corresponds to the model configuration with the best performance for the breast and colorectal cancer datasets [25]. The second row (bold underlined font) corresponds to the best performing model configuration presented here. The first column indicates the number of amino acid residues in the motif. The second column indicates the number of CDR3 amino acid residues that could be skipped when assembling a motif. For example, if the value is 2, then 2 CDR3 amino acid residues could be skipped. The third column indicates if binary indicators indicating whether the corresponding CDR3 residue was ignored were used. For example, if a CDR3 residue was ignored but would have been in the third position of a motif if it had been included, then the 3rd indicator would have a value of 1. The fourth column indicates if an amino acid was skipped in the CDR3 for the given position in the motif. The fifth column indicates if the expected frequency of the motif in blood was included as a feature. The expected frequency was estimated using publicly available data from 786 presumed healthy individuals [31]. The sixth column indicates if the log of the motif relative abundance was used for the relative abundance term. Column 7 indicates if each feature is squared and used as an additional feature, resulting in 2nd order terms in the model. Column 8 indicates if batch normalization was used. Column 9 (fourth from last) is the log-loss averaged across the one-holdout cross-validations. Column 10 (third from last) is the accuracy computed over the one-holdout cross-validations. Column 11 (second from last) is the number of gradient steps used to fit the model as determined by early-stopping. Column 12 is the number of fits to the training data, of which the best fit to the training data is applied to the holdout sample. To ensure each model can run in a reasonable amount of time, only the top 65,536 most abundant motifs in a biopsy are used. When different TCRs bind the same peptide, the TCRB CDR3 contact residues may be different amino acids across the different TCRs. Thus, to identify motifs with different amino acid sequences but similar antigen-binding capabilities, we represented each motif using numerical values for the biophysicochemical properties of its component amino acids. We used Atchley factors as the biophysicochemical descriptors [24]. Atchley factors were derived from a set of over 50 amino acid properties by identifying clusters of properties that co-vary. The five Atchley factor values for each amino acid residue correspond loosely to its polarity, secondary structure, molecular volume, codon diversity, and electrostatic charge. For input into our model, each amino acid residue was represented by a vector of its five Atchley factor values (Fig 1C). With three amino acid residues in a motif, there are a total of 15 Atchley factor values that we represent using the symbols f1 through f15. T-cells undergo clonal expansion in response to antigen stimulation, creating copies of the T-cell and its receptor. Thus, the quantity of a motif can indicate whether receptors containing it have encountered their cognate antigen. We therefore included an estimator of motif quantity as a feature in the model. When calculating the relative abundance of a motif in each sample, we identified every TCRB sequence containing the motif in its CDR3 and summed over the sequences’ template counts, CCDR3. This provided the motif count, Cmotif. We then divided by the total count of all motifs in the sample, T, to get the motif’s relative abundance, f. As with most statistical classifiers, it is important to normalize the input into the model, i.e., the model features. For each feature, we calculated a weighted mean and variance of the feature values over all motifs, where the weights were the relative abundances of each motif, f. Thus, motifs that appear more frequently exerted a greater influence on feature mean and variance than motifs that appeared only once or a few times. We then subtracted the mean from each feature value and divided by the square root of the variance to obtain a normalized value for model input.

Logistic model

The 15 Atchley factor values for the three motif residues along with the motif’s relative abundance were combined into a single vector [f1, f2, …, f15, f] representing the features of a motif. Every motif was scored on the basis of these features using a logistic function that calculates the probability that a motif was derived from a HGSOC-associated repertoire. We used the logistic function because of its widespread use and simplicity, and because it models the outcome of a two-category process. The first step was to compute a biased, weighted sum of the features, referred to as a logit. The logit for the ith motif is represented as l. The bias term β0 along with the weights β1 through β16 are the parameters of the model and were fit using gradient descent optimization techniques as described below. Every motif was scored using the same values for the weights and bias term. Once the logit was computed, the value was passed through the sigmoid function to obtain a probability value between 0 and 1 for the ith motif.

Aggregating motif probabilities (multiple instance learning)

To predict whether a repertoire was derived from HGSOC or healthy ovarian tissue, the probabilities assigned to each motif must be aggregated into a single value that predicts the repertoire-level label. This problem of predicting a label for a set of objects from the scores for the individual objects in the set can be formally described as MIL [23]. According to the standard assumption of MIL, at least one motif from HGSOC-derived repertoires must have a high probability, while none of the motifs from healthy tissue-derived repertoires should have a high probability. This assumption can be implemented by simply taking as the repertoire score the maximum score over all motif scores in the repertoire. Thus, the probability that repertoire j is tumor-derived given the individual motif probabilities was computed as: Using 0.5 as the threshold score, a repertoire is predicted to be tumor-derived when at least one motif is scored with a probability ≥ 0.5. The predictions from Eq (3) were used to fit the model’s parameters using Cohort I.

Parameter fitting

Values for the bias term β0 and weights β1 through β16 were selected to maximize the probability that each prediction from Eq (3) is correct, assigning tumor-derived repertoires a probability close to 1 and healthy ovary-derived repertoires a probability close to 0, using the following objective function: where Eq (3) introduces a nonlinear operation into the model, resulting in a model that cannot be fitted using standard optimization techniques for logistic regression. Thus, gradient optimization techniques were used, such that each parameter is iteratively adjusted along the gradient in a direction that maximizes the log-likelihood, which in turn maximizes the likelihood that each prediction is correct (Fig 2).

Fig 2

Workflow for model selection and parameter fitting.

Workflow for model selection and parameter fitting.

(a) Left Panel: The diagram shows how cohort I is used to train and evaluate each model. Model performance is evaluated by an exhaustive 1-holdout cross-validation using only cohort I. (b) Right Panel: The diagram shows how the best performing model is evaluated with unseen test data. The best performing model is refitted to all the samples in cohort I, and then used to score the test samples from cohort II. The same random initial β-coefficients are reused from (a) when refitting the best performing model in (b). Because gradient optimization techniques are sensitive to the initial values of the bias and weight terms, the initial values must be carefully selected. As is typical, the bias term β0 was initialized to 0. The weight values β1 through β16 were initialized using two different distributions, depending on the feature. Weight values β1 through β15 are for Atchley factors, and initial values were sampled from . The weight value for β16 is for the relative abundance of each motif, and an initial value was sampled from . This initialization scheme ensures that the contribution from the 15 Atchley factors has the same expected magnitude as the contribution from the motifs’ relative abundances. Next, the Adam optimizer, a gradient descent-based optimizer, was run for 2,500 iterations with a step size of 0.01 [32]. Default values for the other Adam optimizer settings were used (b1 = 0.9, b2 0.999, ε =10−8). A limitation of gradient descent-based methods is there is no guarantee of finding the globally optimal solution. Although the logistic model is a linear model, the motif probabilities were aggregated together in a non-linear fashion, and multiple local minima could exist. To address this, 217 = 131,072 runs of Adam optimization, each starting from different initial values as described in the previous paragraph, were used, and the best fit solution over all runs was used to classify new samples. By identifying the best fit to the training data across a huge number of runs, we were attempting to find the globally optimal solution.

Overfitting

Overfitting is always a concern with any statistical classifier. We previously found that L1/L2 regularization and dropout worsened the performance of our approach, perhaps because of the highly non-linear characteristics of our model [33]. Thus, we did not apply them in this study, but we did apply early stopping, as described below, which we previously found to significantly improve the model’s performance.

Model selection and testing

We used Cohort I for model selection and validation by patient-hold-out cross-validation (Fig 2, left panel). The same initial values for the weights β1 through β16 were reused with each cross-validation, ensuring the only variation between runs was due to the patient sample being held out, and not because each cross-validation used differential initial values for the weights β1 through β16. We evaluated multiple models (Table 2) and selected as the best model the one with the lowest average negative log-likelihood on patient hold-out cross-validation. Briefly, we considered motifs of either three or four amino acid residues. We also considered gaps, where we allowed either one or two amino acid residues from the CDR3 to be skipped when assembling each possible motif. Additional features and modifications to the model were considered, as indicated in Table 2. For each model, the optimal number of gradient optimization steps was determined by examining the average log-likelihood of each model at each training step in the patient-holdout cross-validation. To account for model selection bias, the phenomenon whereby we identified a model that performs well on Cohort I without having discovered a generalizable signal, we evaluated the selected model on the test cohort, Cohort II (Fig 2, right panel). The weights and bias term for the best model identified via cross-validation on Cohort I were refit using all 20 Cohort I samples, and then Cohort II samples were scored. The code is available here: https://github.com/jostmey/MaxSnippetModelOvarian.

Results

Cohort I

The best performing model by patient-holdout cross-validation on Cohort I used a motif of three amino acid residues and allowed for a single gap (Table 2). Under that model, the average number of motifs per tumor sample was 7,683.3, and the average number of motifs per healthy sample was 6,154.2. The largest number of motifs in any sample was 13,277. The best average log-likelihood was observed at the last (2,500th) gradient optimization step. The model correctly classified 95% (19/20) of held-out samples with an average log-likelihood of 0.332 bits (Fig 3A). The model correctly classified all healthy ovarian samples, giving a specificity of 100%, although one was quite close to the threshold score of 0.5. The model correctly classified all but one tumor sample, giving a sensitivity of 90%. To estimate the probability of correctly classifying 19 of 20 samples by chance, we performed a permutation analysis with 20 permutation runs. For each permutation, the sample labels were permuted and then patient-holdout cross-validation was performed. Early stopping was applied. The classification accuracies of all 20 permutations were < 95%, allowing us to assign p < 0.05 to the observed accuracy (Table 3). The average log-likelihood over all permutations was 0.993 bits, and the average accuracy was 55%.

Fig 3

Results.

(a) Classification results obtained by leave-out cross-validation for each patient in Cohort I. (b) Illustration of the classifier weights averaged across all 20 cross-validation runs (error bars for the standard deviation are omitted because the range was too small to plot relative to the size of each arrow). For each of the five Atchley factors, the weights are shown for the three residue positions. The weight for the log-frequency of the receptor is also shown. Positive weight values are shown pointing up, and negative weight values are shown pointing down. The length of the arrow corresponds to the weight's magnitude. (c) All motifs with a score above 0.5 (middle column) are shown for the 20 patient samples. Each motif is shown in the context of its respective CDR3. The leftmost column indicates the patient and the right most column indicates the number of times the motif is observed in the sample. (d) Classification results obtained on Cohort II test samples. (e) The ROC curve shows true and false positive rates for different thresholds of a positive diagnosis based on the model applied to Cohort II. The area under the curve is 0.79. (f) All motifs with a score above 0.5 (middle column) shown for the 20 patient samples in Cohort II. Each motif is shown in the context of its respective CDR3. The leftmost column indicates the patient and the right most column indicates the number of times the motif is observed in the sample.

Table 3

Permutation results.

Run	Average Loss	Classification Accuracy	Early Stopping Step
1	0.972	55%	132
2	1.07	50%	3
3	1.021	50%	3
4	1.06	50%	1
5	1.055	50%	5
6	1.038	60%	471
7	0.77	85%	503
8	1.03	50%	209
9	0.964	65%	295
10	1.041	30%	245
11	1.011	50%	73
12	1.008	55%	158
13	1.076	50%	4
14	0.63	85%	2497
15	1.012	50%	34
16	1.042	50%	4
17	1.044	50%	4
18	1.043	50%	10
19	1.089	50%	44
20	0.891	70%	1005
Average	0.99335	55%

Results.

Permutation results.

Each row corresponds to a single permutation of the Cohort I data set, indicated in column 1. The second column shows the loss averaged over all patient-hold-out cross-validations. The third column shows the classification accuracy over all patient-hold-out cross-validations. The fourth column shows the fitting step, out of 2500, at which the lowest average loss was observed. To discern the features that increase the probability of a HGSOC categorization, we examined the model weights across all 20 cross-validation runs (Fig 3B). The weights reveal how each Atchley factor contributes to the score and the relative importance of each position in the motif. Motifs with a positively charged, hydrophilic residue that tends to participate in alpha-helices in position 1, followed by a small residue that tends to participate in bends and coils in position 2, followed by a large, positively charged residue in position 3 will be scored by the model with a high probability of deriving from a HGSOC-associated repertoire. The weight for the relative abundance of the motif is positive indicating that more abundant motifs would have a higher probability than less abundant motifs. We aligned the high scoring motifs from each holdout sample and present them within the context of the CDR3 sequences from which they originated (Fig 3C). The motifs varied in terms of their component residues, but a restricted set of amino acids was observed at each position. Amino acids Glutamic acid, Lysine, and Arginine were common in position 1, Tryptophan and Tyrosine were common in position 2, and Histidine and Tryptophan were common in position 3. We also determined the number of times each CDR3 appeared in each sample and noted that most of them appear only once. None of the CDR3 sequences are shared across patients.

Cohort II

Given the potential for overfitting and model selection bias, we assessed the model’s performance on samples not used for model selection or parameter fitting, i.e., on Cohort II. After selecting the best performing model using cross-validation on Cohort I, as described above, we then refit the parameters of the selected model using all 20 Cohort I samples using 2,500 gradient optimization steps, which was determined to be the optimal number of steps in the cross-validation (Table 2). The resulting weights β1 through β16 appear indistinguishable from those in Fig 3C. The newly fitted model was then applied to Cohort II and correctly classified 80% (16/20) of the samples with an average log-likelihood fit of 0.821 bits. The model correctly classified all but one healthy ovarian sample (specificity 90%) and misclassified three tumor samples (sensitivity 70%) (Fig 3D). The area under the Receiver Operating Characteristic (ROC) curve was 0.79 (Fig 3E). We aligned the high scoring motifs from the Cohort II samples and present them within the context of the CDR3 sequences from which they originated (Fig 3F). As with the Cohort I motifs, the amino acid residues present at each position vary, but the variability is restricted to a subset. As with the Cohort I samples, amino acids Glutamic Acid and Lysine are common in position 1, Tryptophan and Tyrosine are common in position 2, and Histidine, and Tryptophan are common in position 3. In contrast, Arginine was common in position 1 of Cohort I motifs but is found in position 1 of only one Cohort II motif, and Aspartic Acid is common in position 3 of Cohort II motifs but was not observed in position 3 of Cohort I motifs. As with Cohort I, we found that the majority of CDR3s containing high-scoring motifs were present only one time in their sample.

Discussion

We previously hypothesized that T cell responses against antigens shared among cancer patients might enable development of a new approach to cancer detection [25]. Shared tumor antigens are not favored for antigen-targeted immunotherapy where the goal is to elicit such a high degree of tumor-cell killing that the tumor is eradicated. In that case, antigens with expression patterns highly-restricted to the tumor and that are targeted by high-affinity TCRs are needed. For cancer detection, however, it is only necessary that the corresponding TCRs be present in patients with the cancer and not in those without or that they be present with an elevated abundance in those with cancer relative to those without. To determine whether such T cell responses might enable cancer detection, we first sought to develop a method for identifying the corresponding TCRs that didn’t require knowledge of the target antigens and didn’t rely on the assumption that T cells responding to a common target would express TCRs with the same amino acid sequence. To accomplish this, we developed the method described here, converting amino acid sequences into numerical vectors whose components correspond to amino acid biophysicochemical values, such as charge, and applying multiple instance learning. In all cases in which the method has been applied, it has identified a motif that can distinguish the tissue or patient groups of interest with solid performance [25, 33]. We hypothesize that TCRs bearing these motifs have overlapping antigen binding profiles and are concentrated in cancer tissue due to the presence of a common antigen there. This is a hypothesis that will have to be tested experimentally, but the strong classification performance of the motifs warrants further study, despite uncertainty regarding any shared antigen specificity. In our first application of this method to TCRs, we considered motifs of four residues and did not allow gaps [25]. Additionally, we took the natural logarithm of the motif relative abundance term. Taking that same model and fitting the weight values on Cohort I, we obtained a classification accuracy of 90% with a likelihood error of 0.666 (Table 2). To determine whether we could improve the performance, we explored additional models not considered in our prior work (Table 2). The best performing model used a three-residue motif allowing for one gap and achieved a classification accuracy of 95% with a likelihood error of 0.332 (Table 2). Thus, while the approach has produced good results across multiple cancer types, each one has required optimization of the motif representation to obtain the best performance. Additional innovation to the modeling approach is required to produce a method that works across multiple cancer types without this customization. Whenever multiple models are evaluated on the same data and the best performing model is selected, model selection bias can occur. To determine the extent of model selection bias in our Cohort I result, we evaluated the selected model’s performance on Cohort II, which is wholly unseen (i.e., not used for parameter fitting or model selection). The classification accuracy on Cohort II is 80% with a likelihood error of 0.821. Reduced performance on test data is expected, and these results indicate that the model has identified a signal that is expected to generalize to new samples with 80% accuracy. We have applied the method to three cancer types and in each case identified a distinct biophysicochemical motif. While for breast cancer, all receptors bearing the motif were of high abundance, and in some cases were the top most abundant clone, for colorectal cancer, all but a few of the motif-bearing clones were of low abundance [25]. In the case of ovarian cancer, we again observed that motif-bearing clones are of low abundance, and in fact, in all but a few cases, the corresponding CDR3 sequences were observed in the sample only a single time. While this is perhaps surprising, we note that frozen tissue was used for the colorectal samples in our prior study, while the ovarian samples in this study were all formalin-fixed paraffin-embedded samples that had been collected between 2009 and 2016. The samples are therefore likely subject to significant DNA damage and to have significantly reduced sequence coverage of target regions [34]. It seems unlikely that the motif identified by our approach is purely an artifact given that it correctly classified 80% of the Cohort II samples. Taking the data at face value, it appears the motifs that mark repertoires as being HGSOC-associated are found in low frequency clones. While our previous results demonstrate that TCR repertoires from TILs can be distinguished from adjacent healthy tissue repertoires by the presence of TCRs bearing specific, biophysicochemical motifs in their antigen binding regions, our current results go further by demonstrating that TILs repertoires from women with HGSOC can be distinguished from ovarian tissue-associated repertoires from women with healthy ovaries. Thus, in this case, we are distinguishing women with cancer from women without cancer, which is the classification task that is directly relevant to cancer detection. Despite this significant advance over the prior work, however, there are still several limitations that must be addressed. First, the HGSOC samples used in this study were primarily from women with stage III or IV disease. It is critical to determine whether this or another signature can be detected at early stages of disease, particularly before the appearance of invasive disease. Second, to have any potential utility for cancer detection, the signature must be detectable in tissue collected by minimally invasive means. That typically means blood. While the overlap between TILs T cell repertoires and the peripheral T cell repertoire has been shown to be relatively low, it is much higher, with as much as ~50–60% overlap, when the CD8+PD-1+ subset of peripheral T cells is sorted [35-40]. Furthermore, the specific antigens recognized by this subset were similar to that of the TILs population [40]. Thus, it is reasonable to expect that a TCR signature found in the tissue can be detected in this or another peripheral T cell subset. An additional potential utility of our approach is in the diagnosis of women who present with an ovarian mass. Thus, it will be essential to assess the signature on benign ovarian tumors, as well as on ovarian cancers of other types, to determine whether the signature presented here is present in those cases or whether these have their own unique signature. Taken together, our current and prior results indicate that TCR-based biomarkers have potential utility for cancer detection. They justify further studies on larger patient cohorts designed to improve the generalizability of the signature with a particular focus on blood samples from patients with early stage disease. Additionally, they justify application of this method in other cancer types, such as pancreatic cancer, where, like ovarian, the need for early detection methods are particularly critical. 16 Jan 2020 PONE-D-19-32179 Biophysicochemical Motifs in T-cell Receptor Sequences as a Potential Biomarker for High-Grade Serous Ovarian Carcinoma PLOS ONE Dear Dr. Cowell, Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. We would appreciate receiving your revised manuscript by 30 days. When you are ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. To enhance the reproducibility of your results, we recommend that if applicable you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols Please include the following items when submitting your revised manuscript: A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). This letter should be uploaded as separate file and labeled 'Response to Reviewers'. A marked-up copy of your manuscript that highlights changes made to the original version. This file should be uploaded as separate file and labeled 'Revised Manuscript with Track Changes'. An unmarked version of your revised paper without tracked changes. This file should be uploaded as separate file and labeled 'Manuscript'. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. We look forward to receiving your revised manuscript. Kind regards, David Wai Chan, Ph.D. Academic Editor PLOS ONE Journal requirements: When submitting your revision, we need you to address these additional requirements. 1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at http://www.journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and http://www.journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf 2. To comply with PLOS ONE submissions requirements, please include your ethics statement ('This study was approved by the UTSW IRB, study number STU-2018-0239, with a waiver of consent because anonymized, archived, FFPE tissue was used') in the Methods section of your manuscript. 3. We note that you have stated that you will provide repository information for your data at acceptance. Should your manuscript be accepted for publication, we will hold it until you provide the relevant accession numbers or DOIs necessary to access your data. If you wish to make changes to your Data Availability statement, please describe these changes in your cover letter and we will update your Data Availability statement to reflect the information you provide. 4. Thank you for stating the following financial disclosure: "This project was supported by funding to LGC from UT Southwestern Medical Center, Be the Difference Foundation, Commercial Real Estate Women of Dallas (CREW Dallas), and an anonymous donor." Please state what role the funders took in the study. If the funders had no role, please state: "The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript." If this statement is not correct you must amend it as needed. Please include this amended Role of Funder statement in your cover letter; we will change the online submission form on your behalf. 5. Thank you for stating the following in the Competing Interests section: "NO authors have competing interests." We note that you received funding from a commercial source: 'Commercial Real Estate Women of Dallas (CREW Dallas)' Please provide an amended Competing Interests Statement that explicitly states this commercial funder, along with any other relevant declarations relating to employment, consultancy, patents, products in development, marketed products, etc. Within this Competing Interests Statement, please confirm that this does not alter your adherence to all PLOS ONE policies on sharing data and materials by including the following statement: "This does not alter our adherence to PLOS ONE policies on sharing data and materials.” (as detailed online in our guide for authors http://journals.plos.org/plosone/s/competing-interests). If there are restrictions on sharing of data and/or materials, please state these. Please note that we cannot proceed with consideration of your article until this information has been declared. Please include your amended Competing Interests Statement within your cover letter. We will change the online submission form on your behalf. Please know it is PLOS ONE policy for corresponding authors to declare, on behalf of all authors, all potential competing interests for the purposes of transparency. PLOS defines a competing interest as anything that interferes with, or could reasonably be perceived as interfering with, the full and objective presentation, peer review, editorial decision-making, or publication of research or non-research articles submitted to one of the journals. Competing interests can be financial or non-financial, professional, or personal. Competing interests can arise in relationship to an organization or another person. Please follow this link to our website for more details on competing interests: http://journals.plos.org/plosone/s/competing-interests 6. PLOS requires an ORCID iD for the corresponding author in Editorial Manager on papers submitted after December 6th, 2016. Please ensure that you have an ORCID iD and that it is validated in Editorial Manager. To do this, go to ‘Update my Information’ (in the upper left-hand corner of the main menu), and click on the Fetch/Validate link next to the ORCID field. This will take you to the ORCID site and allow you to create a new iD or authenticate a pre-existing iD in Editorial Manager. Please see the following video for instructions on linking an ORCID iD to your Editorial Manager account: https://www.youtube.com/watch?v=_xcclfuvtxQ Additional Editor Comments (if provided): This study is interesting and there are some minor points needed for verification. As the reviewer suggested, it's nice to extend the approach to distinguishing women with or without HGSC by using ovarian tissues from non-cancer patients in their training set. [Note: HTML markup is below. Please do not edit.] Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Yes Reviewer #2: Yes ********** 2. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: I Don't Know Reviewer #2: Yes ********** 3. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: No Reviewer #2: No ********** 4. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes Reviewer #2: Yes ********** 5. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: This study builds on previous work by this group that T cell receptor repertories (TCR) can be used to identify breast and colon cancer. This study used methods previously established to identify TCR motif in T cells that can distinguish between normal and HGSOC. The authors highlighted an important limitation that the study only examined advanced stage disease. Early stage disease and TILs from blood also need to be examined. The authors should also include in the discussion that performance of the TCR motifs identified in the HGSOC samples should also be compared to those present in benign tumor tissues. Please provide further information on the TILs sequencing method used. Tissue used from paraffin blocks but is not clear if sequencing is performed at the DNA or amino acid level? is not clear if ovarian and Fallopian tissue was analysed separately or combined in the cases where both were available. Is any information known about the number if TILS present in the tissues. Minor correction Please include abbreviation for LSIL in Table 1 Reviewer #2: In their recent manuscript titled “Biophysicochemical Motifs in T-cell Receptor Sequences as a Potential Biomarker for High-Grade Serous Ovarian Carcinoma” Ostmeyer et al. further developed the concept of disease classification based on T-cell receptor (TCR) repertories analysis they put forth in their 2019 Cancer Research paper. This time they attempted to demonstrate specific biophysiochemical motifs in the TCR of tumour infiltrating lymphocytes (TIL) in ovarian tumour exist. Those motifs can help distinguish women with and without high grade serous carcinoma (HGSC). By incorporating healthy women in the training set the motifs identified should be able to be generalized which may help cancer detection in the future. This is an excellent study probing an interesting possibility of cancer detection. Although the road to application may still be long ahead, the authors managed to demonstrate the feasibility of the principle. Major concerns 1. Differentiating ovarian cancer tissue form normal ovary is generally not a problem. This study extended their approach to distinguishing women with or without HGSC by using ovarian tissues from non-cancer patients in their training set. This is perhaps the most significant difference from their previous papers and is one step further towards the goal of using TCR repertoire for cancer detection. However, the team still need to demonstrate this technology can help to detect cancer (1) before invasive disease arise; and (2) in samples convenient taken e.g. blood. Until then the technique is no better than current markers such as IHC. 2. The authors focused on HGSC but when a woman is diagnosed with ovarian cancer perhaps it more important to distinguish whether it’s HGSC vs other type I e.g. clear cell which affect management options and prognosis. Currently for equivocal cases IHC markers are used but there are some exceptions. I wonder TCR repertoire can help on this problem. 3. As mentioned in the discussion, TILs repertories and peripheral T cells repertoires are rather different. Although sorting CD8+PD-1+ cells may improve the representation of repertories within the tumour, I expect the accuracy of classification when using peripheral blood will be further reduced. Would increasing sequencing depth help on this? How will be the cost of applying this technology? 4. Can the learning process discover associations between certain motifs with clinical parameters such as chemoresistance, responsiveness towards immunotherapy? Minor concerns 1. “Additionally, they justify application of this method in other cancer types, such as pancreatic cancer, where the need for early detection methods are particularly critical.” I think ovarian cancer is also a type that early detection is very much desired. ********** 6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files to be viewed.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. Please note that Supporting Information files do not need this step. 20 Jan 2020 POINT-BY-POINT RESPONSE Journal Requirements – the item numbers correspond to those in request email 1. Changes have been made according to the style requirements. These do not show in track changes. 2. The ethics statement has been added to the first paragraph of the Materials and Methods section, in the Datasets subsection (lines 122-124). 3. The data are freely available from the VDJServer Community Data Portal as described in the Materials and Methods section on lines 158-162. The code is available from GitHub per the link now provided on line 331. 4. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. 5. No change has been made, because CREW Dallas is NOT a commercial entity. It is a 501c3. Please let me know if some change is still required. 6. The ORCID iD has been updated. 7. The figure tiff files have been converted using PACE. Reviewer Comments – in the order they appear in the emailed reviews. Reviewer comments in italics; our response in normal font. Reviewer #1 The authors should also include in the discussion that performance of the TCR motifs identified in the HGSOC samples should also be compared to those present in benign tumor tissues. This has been added in the Discussion section, lines 490-493. Please provide further information on the TILs sequencing method used. Tissue used from paraffin blocks but is not clear if sequencing is performed at the DNA or amino acid level? is not clear if ovarian and Fallopian tissue was analysed separately or combined in the cases where both were available. Is any information known about the number of TILS present in the tissues. All of these questions are now answered in the Materials and Methods section, lines 134-138. Please include the abbreviation for LSIL in Table 1. Clarification of this abbreviation has been added to the Table 1 footnote, along with the other abbreviations. Reviewer #2 … the team still need to demonstrate this technology can help to detect cancer (1) before invasive disease arise; and (2) in samples convenient taken e.g. blood. We previously mentioned that the use of primarily stage III and IV samples is a limitation of our study, and that “It is critical to determine whether this or another signature can be detected at early stages of disease” (Discussion section, lines 481-482). We have now added the phrase “particularly before the appearance of invasive disease” to emphasize the point made by the reviewer, which is a very important one. We had previously included a discussion of the need to demonstrate the performance of the technology in blood (Discussion section, lines 482-488). We believe this already addresses the reviewer comment but would be happy to make any suggested changes or additions to this text. … perhaps it more important to distinguish whether it’s HGSC vs other types We have added text to the Discussion section (lines 490-493) to clarify that in women who present with a pelvic mass, determining the nature of the mass is an important clinical question and that the utility of our approach for this task needs to be assessed. Additional clarification: we focused on HGSOC because it is the most common type of ovarian cancer. We agree with the reviewer that, when a woman is diagnosed with ovarian cancer, it is important to determine what type of cancer it is, or whether it is even cancer (versus a benign tumor). This is currently done by means requiring a biopsy. The ability to make this determination with blood biomarkers would have clinical benefit. However, we are focused on early detection in the context of screening the general population, a clinical application that places different requirements on a molecular biomarker than diagnosis. It is for that reason that we focused on the most common ovarian cancer type and whether its presence could be distinguished from cancer-free women. However, we agree that the reviewer’s suggested application is an important one that we hope to pursue in future studies. I expect the accuracy of classification when using peripheral blood will be further reduced. Would increasing the sequencing depth help on this? How will be the cost of applying this technology? We agree that a straight application of our current method to blood would be expected to have reduced accuracy. We anticipate the need to refit the model parameter that corresponds to motif relative abundance. Additional changes may also be necessary. We are in the process of applying for funding that would allow us to test the model on blood and optimize a blood-specific model. We strongly agree that deep, and perhaps ultra-deep, sequencing will be required, at least in the motif discovery stage. Regarding costs, we don’t know the answer to that. We note, however, that Adaptive Biotechnologies has an FDA-approved assay for detecting measurable residual disease for leukemias that is based on this technology, and the test has received approval to be covered by Medicare. This suggests to us, that, if this approach proves to have clinical utility (which will require many many follow up studies on larger sample sizes, using blood, with more heterogeneous control groups, etc), it will be possible to develop a clinical assay that has been optimized to make it cost-effective. Can the learning process discover associations between certain motifs with clinical parameters such as … We believe the answer is yes, if TCR specificity plays a role in the chemoresistance or response to immunotherapy. We are currently in the process of assessing this in the context of response to immune checkpoint blockade but do not yet have an answer. I think ovarian cancer is also a type that early detection is very much desired. This is correct, and we have modified the sentence that originally only mentioned pancreatic cancer to read “Additionally, they justify application of this method in other cancer types, such as pancreatic cancer, where, like ovarian, the need for early detection methods are particularly critical.” 11 Feb 2020 Biophysicochemical Motifs in T-cell Receptor Sequences as a Potential Biomarker for High-Grade Serous Ovarian Carcinoma PONE-D-19-32179R1 Dear Dr. Cowell, We are pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it complies with all outstanding technical requirements. Within one week, you will receive an e-mail containing information on the amendments required prior to publication. When all required modifications have been addressed, you will receive a formal acceptance letter and your manuscript will proceed to our production department and be scheduled for publication. Shortly after the formal acceptance letter is sent, an invoice for payment will follow. To ensure an efficient production and billing process, please log into Editorial Manager at https://www.editorialmanager.com/pone/, click the "Update My Information" link at the top of the page, and update your user information. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org. If your institution or institutions have a press office, please notify them about your upcoming paper to enable them to help maximize its impact. If they will be preparing press materials for this manuscript, you must inform our press team as soon as possible and no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org. With kind regards, David Wai Chan, Ph.D. Academic Editor PLOS ONE Additional Editor Comments (optional): Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation. Reviewer #1: All comments have been addressed ********** 2. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Yes ********** 3. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: Yes ********** 4. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes ********** 5. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes ********** 6. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: The authors have addressed my concerns in the revised manuscript. Revised manuscript is is now suitable for publication. ********** 7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No 19 Feb 2020 PONE-D-19-32179R1 Biophysicochemical Motifs in T cell Receptor Sequences as a Potential Biomarker for High-Grade Serous Ovarian Carcinoma Dear Dr. Cowell: I am pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department. If your institution or institutions have a press office, please notify them about your upcoming paper at this point, to enable them to help maximize its impact. If they will be preparing press materials for this manuscript, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org. For any other questions or concerns, please email plosone@plos.org. Thank you for submitting your work to PLOS ONE. With kind regards, PLOS ONE Editorial Office Staff on behalf of Dr. David Wai Chan Academic Editor PLOS ONE

38 in total

1. Using synthetic templates to design an unbiased multiplex PCR assay.

Authors: Christopher S Carlson; Ryan O Emerson; Anna M Sherwood; Cindy Desmarais; Moon-Wook Chung; Joseph M Parsons; Michelle S Steen; Marissa A LaMadrid-Herrmannsfeldt; David W Williamson; Robert J Livingston; David Wu; Brent L Wood; Mark J Rieder; Harlan Robins
Journal: Nat Commun Date: 2013 Impact factor: 14.919

2. Prospective identification of neoantigen-specific lymphocytes in the peripheral blood of melanoma patients.

Authors: Alena Gros; Maria R Parkhurst; Eric Tran; Anna Pasetto; Paul F Robbins; Sadia Ilyas; Todd D Prickett; Jared J Gartner; Jessica S Crystal; Ilana M Roberts; Kasia Trebska-McGowan; John R Wunderlich; James C Yang; Steven A Rosenberg
Journal: Nat Med Date: 2016-02-22 Impact factor: 53.440

Review 3. Antigenicity and immunogenicity of Melan-A/MART-1 derived peptides as targets for tumor reactive CTL in human melanoma.

Authors: Pedro Romero; Danila Valmori; Mikael J Pittet; Alfred Zippelius; Donata Rimoldi; Frederic Lévy; Valérie Dutoit; Maha Ayyoub; Verena Rubio-Godoy; Olivier Michielin; Philippe Guillaume; Pascal Batard; Immanuel F Luescher; Ferdy Lejeune; Danielle Liénard; Nathalie Rufer; Pierre-Yves Dietrich; Daniel E Speiser; Jean-Charles Cerottini
Journal: Immunol Rev Date: 2002-10 Impact factor: 12.988

4. Vaccination against HPV-16 oncoproteins for vulvar intraepithelial neoplasia.

Authors: Gemma G Kenter; Marij J P Welters; A Rob P M Valentijn; Margriet J G Lowik; Dorien M A Berends-van der Meer; Annelies P G Vloon; Farah Essahsah; Lorraine M Fathers; Rienk Offringa; Jan Wouter Drijfhout; Amon R Wafelman; Jaap Oostendorp; Gert Jan Fleuren; Sjoerd H van der Burg; Cornelis J M Melief
Journal: N Engl J Med Date: 2009-11-05 Impact factor: 91.245

5. Digital genomic quantification of tumor-infiltrating lymphocytes.

Authors: Harlan S Robins; Nolan G Ericson; Jamie Guenthoer; Kathy C O'Briant; Muneesh Tewari; Charles W Drescher; Jason H Bielas
Journal: Sci Transl Med Date: 2013-12-04 Impact factor: 17.956

6. TGFBR2 and BAX mononucleotide tract mutations, microsatellite instability, and prognosis in 1072 colorectal cancers.

Authors: Kaori Shima; Teppei Morikawa; Mai Yamauchi; Aya Kuchiba; Yu Imamura; Xiaoyun Liao; Jeffrey A Meyerhardt; Charles S Fuchs; Shuji Ogino
Journal: PLoS One Date: 2011-09-20 Impact factor: 3.240

Review 7. Targeted Therapy of Hepatitis B Virus-Related Hepatocellular Carcinoma: Present and Future.

Authors: Sarene Koh; Anthony Tanoto Tan; Lietao Li; Antonio Bertoletti
Journal: Diseases Date: 2016-02-15

8. AIRR Community Standardized Representations for Annotated Immune Repertoires.

Authors: Jason Anthony Vander Heiden; Susanna Marquez; Nishanth Marthandan; Syed Ahmad Chan Bukhari; Christian E Busse; Brian Corrie; Uri Hershberg; Steven H Kleinstein; Frederick A Matsen Iv; Duncan K Ralph; Aaron M Rosenfeld; Chaim A Schramm; Scott Christley; Uri Laserson
Journal: Front Immunol Date: 2018-09-28 Impact factor: 7.561

9. IgBLAST: an immunoglobulin variable domain sequence analysis tool.

Authors: Jian Ye; Ning Ma; Thomas L Madden; James M Ostell
Journal: Nucleic Acids Res Date: 2013-05-13 Impact factor: 16.971

10. Ex vivo staining of metastatic lymph nodes by class I major histocompatibility complex tetramers reveals high numbers of antigen-experienced tumor-specific cytolytic T lymphocytes.

Authors: P Romero; P R Dunbar; D Valmori; M Pittet; G S Ogg; D Rimoldi; J L Chen; D Liénard; J C Cerottini; V Cerundolo
Journal: J Exp Med Date: 1998-11-02 Impact factor: 14.307

7 in total

1. Immune receptor CDR3 chemical features that preserve sequence information are highly efficient in reflecting survival distinctions: A pan-cancer analysis.

Authors: Brooke E Mcbreairty; Boris I Chobrutskiy; Andrea Chobrutskiy; Etienne C Gozlan; Michael J Diaz; George Blanck
Journal: Biomed Rep Date: 2022-06-09

Review 2. Epithelial Ovarian Cancer: Providing Evidence of Predisposition Genes.

Authors: Sidrah Shah; Alison Cheung; Mikolaj Kutka; Matin Sheriff; Stergios Boussios
Journal: Int J Environ Res Public Health Date: 2022-07-01 Impact factor: 4.614

3. Systemic Adaptive Immune Parameters Associated with Neuroblastoma Outcomes: the Significance of Gamma-Delta T Cells.

Authors: Etienne C Gozlan; Boris I Chobrutskiy; Saif Zaman; Michelle Yeagley; George Blanck
Journal: J Mol Neurosci Date: 2021-03-05 Impact factor: 3.444

4. TCRpower: quantifying the detection power of T-cell receptor sequencing with a novel computational pipeline calibrated by spike-in sequences.

Authors: Shiva Dahal-Koirala; Gabriel Balaban; Ralf Stefan Neumann; Lonneke Scheffer; Knut Erik Aslaksen Lundin; Victor Greiff; Ludvig Magne Sollid; Shuo-Wang Qiao; Geir Kjetil Sandve
Journal: Brief Bioinform Date: 2022-03-10 Impact factor: 11.622

5. T Cell Receptor Repertoires Acquired via Routine Pap Testing May Help Refine Cervical Cancer and Precancer Risk Estimates.

Authors: Scott Christley; Jared Ostmeyer; Lisa Quirk; Wei Zhang; Bradley Sirak; Anna R Giuliano; Song Zhang; Nancy Monson; Jasmin Tiro; Elena Lucas; Lindsay G Cowell
Journal: Front Immunol Date: 2021-04-02 Impact factor: 7.561

6. Machine Learning Analysis of Naïve B-Cell Receptor Repertoires Stratifies Celiac Disease Patients and Controls.

Authors: Or Shemesh; Pazit Polak; Knut E A Lundin; Ludvig M Sollid; Gur Yaari
Journal: Front Immunol Date: 2021-03-10 Impact factor: 7.561

7. Effects of adiponectin, plasma D-dimer, inflammation and tumor markers on clinical characteristics and prognosis of patients with ovarian cancer.

Authors: Hui Li; Lulu Sun; Lili Chen; Zhihui Kang; Guorong Hao; Fenglou Bai
Journal: J Med Biochem Date: 2022-02-02 Impact factor: 3.402

7 in total