| Literature DB >> 36175411 |
Mary L Stackpole1,2, Weihua Zeng1, Shuo Li1,2, Chun-Chi Liu2, Yonggang Zhou1, Shanshan He1, Angela Yeh1, Ziye Wang1, Fengzhu Sun3, Qingjiao Li4, Zuyang Yuan1, Asli Yildirim5, Pin-Jung Chen6, Paul Winograd6, Benjamin Tran6, Yi-Te Lee7, Paul Shize Li8, Zorawar Noor9, Megumi Yokomizo10, Preeti Ahuja10,11, Yazhen Zhu7,11, Hsian-Rong Tseng7,11, James S Tomlinson6,9,11,12, Edward Garon9,11, Samuel French1,11, Clara E Magyar1,11, Sarah Dry1,11, Clara Lajonchere11,13, Daniel Geschwind13, Gina Choi9, Sammy Saab9, Frank Alber5,14, Wing Hung Wong15,16, Steven M Dubinett1,7,9,11,12, Denise R Aberle10,11, Vatche Agopian17,18, Steven-Huy B Han19, Xiaohui Ni20, Wenyuan Li21, Xianghong Jasmine Zhou22,23,24.
Abstract
Early cancer detection by cell-free DNA faces multiple challenges: low fraction of tumor cell-free DNA, molecular heterogeneity of cancer, and sample sizes that are not sufficient to reflect diverse patient populations. Here, we develop a cancer detection approach to address these challenges. It consists of an assay, cfMethyl-Seq, for cost-effective sequencing of the cell-free DNA methylome (with > 12-fold enrichment over whole genome bisulfite sequencing in CpG islands), and a computational method to extract methylation information and diagnose patients. Applying our approach to 408 colon, liver, lung, and stomach cancer patients and controls, at 97.9% specificity we achieve 80.7% and 74.5% sensitivity in detecting all-stage and early-stage cancer, and 89.1% and 85.0% accuracy for locating tissue-of-origin of all-stage and early-stage cancer, respectively. Our approach cost-effectively retains methylome profiles of cancer abnormalities, allowing us to learn new features and expand to other cancer types as training cohorts grow.Entities:
Mesh:
Substances:
Year: 2022 PMID: 36175411 PMCID: PMC9522828 DOI: 10.1038/s41467-022-32995-6
Source DB: PubMed Journal: Nat Commun ISSN: 2041-1723 Impact factor: 17.694
Fig. 1cfMethyl-Seq assay.
a Diagram of the cfMethyl-Seq protocol. b Typical TBE-UREA PAGE image of cfMethyl-Seq libraries made from cfDNA, compared with conventional RRBS with cfDNA or intact genomic DNA as input material. The non-specific ligation product from cfDNA fragments with the conventional RRBS protocol is indicated by an arrow. This technical validation experiment was repeated independently twice and showed similar results. For cfMethyl-Seq assays generating data for analysis, each sample was constructed into library without replicate. c The percentage of reads with MspI sites on both ends, on only one end, and on neither end from WGBS assay, our cfMethyl-Seq assay, and RRBS assay on cfDNA. Source data are provided as a Source Data file. d The percentage of mapped fragments that fall in CpG islands, CpG island shores, CpG island shelves, and open sea regions is shown for cfMethyl-Seq libraries, RRBS libraries, and WGBS libraries on cfDNA. Source data are provided as a Source Data file. e Methylation concordance between a genomic DNA sample sequenced with RRBS, and sheared and sequenced with cfMethyl-Seq, increases with depth of coverage. Pearson correlation (y-axis) of the methylation rate (beta value) in the two datasets was calculated on the CpG sites that are covered by both datasets at minimum depth of coverage specified on the x-axis. Source data are provided as a Source Data file. Abbreviations: RRBS Reduced representation bisulfite sequencing.
Fig. 2Study design and overview of the computational method.
a Overview of the sample usage for marker discovery, model training, and validation. All tissue samples are used for marker discovery, and all plasma samples are randomly split into three sets, used for marker discovery, training, and validating the predictive model. The plasma sample split is repeated 10 times and the prediction performance is averaged over the 10 runs. b Details of sample usage for marker discovery. Different types of methylation markers were discovered by using different samples. Note that 30 reference noncancer plasma samples (in blue boxes) correspond to “marker filtration” in a. Abbreviations: TOO tissue of origin.
Fig. 3Performance of the stacked ensemble model for cancer detection.
a ROC curve of the stacked ensemble method for detecting all four cancer types. Source data are provided as a source data file. b Sensitivity breakdown in each cancer stage and cancer type. Sensitivity is shown at 1 false positive (97.9% specificity). The average number of test cancer patients in each cancer type and stage over 10 runs is indicated in the label of the horizontal axis. Sensitivity is not computed if the average number of cancer patients in a cancer stage/type over 10 runs is <4. The points and error bars represent the average sensitivity over 10 runs and 95% confidence intervals. Source data are provided as a Source Data file. c Performance (AUROC) of using all marker types and each individual marker type ( 102 samples). The points and error bars represent the average AUROC over 10 runs and 95% confidence intervals. Source data are provided as a Source Data file.
Fig. 4Performance of the stacked ensemble model for cancer Tissue-Of-Origin prediction.
a Confusion matrix for all-stage cancer samples. Source data are provided as a Source Data file. b Confusion matrix for early-stage (i.e., stage I/II) cancer samples. Source data are provided as a Source Data file. c The accuracy of using all marker types and each individual marker type ( 35 samples in the test set of each run). The points and error bars represent the average accuracy over 10 runs and 95% confidence intervals. Source data are provided as a Source Data file.
Fig. 5Impact of the number of markers and the training sample size on the cancer detection performance.
a Performance of using the union of top M cancer-specific markers of four cancer types. Source data are provided as a Source Data file. b Performance of using the union of top M tissue-specific markers of each tissue pair. Source data are provided as a Source Data file. c Performance of the ensemble model for cancer detection increases with increasing training sample size (using 30% to 100% of the training samples). Source data are provided as a Source Data file. In all figures, the points and error bars represent the average AUROC over 10 runs and 95% confidence intervals ( 102 test samples per run).