| Literature DB >> 28335812 |
Shuli Kang1, Qingjiao Li1, Quan Chen1, Yonggang Zhou2,3, Stacy Park4, Gina Lee5, Brandon Grimes4, Kostyantyn Krysan4, Min Yu6, Wei Wang7, Frank Alber1, Fengzhu Sun1, Steven M Dubinett8,9,10,11, Wenyuan Li12, Xianghong Jasmine Zhou13,14.
Abstract
We propose a probabilistic method, CancerLocator, which exploits the diagnostic potential of cell-free DNA by determining not only the presence but also the location of tumors. CancerLocator simultaneously infers the proportions and the tissue-of-origin of tumor-derived cell-free DNA in a blood sample using genome-wide DNA methylation data. CancerLocator outperforms two established multi-class classification methods on simulations and real data, even with the low proportion of tumor-derived DNA in the cell-free DNA scenarios. CancerLocator also achieves promising results on patient plasma samples with low DNA methylation sequencing coverage.Entities:
Keywords: Cancer diagnosis; Cell-free DNA; DNA methylation; Liquid biopsy; Next-generation sequencing
Mesh:
Substances:
Year: 2017 PMID: 28335812 PMCID: PMC5364586 DOI: 10.1186/s13059-017-1191-5
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Fig. 1Flowchart of CancerLocator. Step 1: A set of solid tumor samples and healthy plasma samples collected from public databases and the literature are used to select the informative features (CpG clusters) that can differentiate tumor types or healthy plasma samples. Then the beta distributions of the methylation levels of these selected features for each tumor type or healthy plasma samples are learnt. Step 2: Given a plasma sample, the methylation profile of its cfDNAs is measured by whole-genome bisulfite sequencing, which is then used as input for cancer location prediction by CancerLocator
Fig. 2The mixture model of methylation level (x) in a patient’s plasma cfDNA for different burdens of ctDNAs from the tumor type t. Note that x, u, and v are the methylation levels of a single CpG cluster k in cfDNA, solid tumor, and normal plasma, respectively
Fig. 3The predicted ctDNA burden for simulated normal and cancer plasma samples. a Predicted ctDNA burdens for normal samples whose true ctDNA burden should be zero. b Predicted and true ctDNA burdens for cancer samples. Each dot represents a prediction with the true (x-axis) and predicted (y-axis) ctDNA burdens. The correct and incorrect predictions are represented by cyan and red, respectively, in both a and b
Fig. 4Classification performances of three methods (CancerLocator, RF and SVM) on the ten subsets of simulation data. Each subset includes plasma cfDNA samples at certain cancer stage (represented as a ctDNA burden range)
Confusion matrix of prediction results on the real plasma samples
| Method | True class | Predicted class | |||||
|---|---|---|---|---|---|---|---|
| Breast | Colon | Kidney | Liver | Lung | Non-cancer | ||
| CancerLocator | Breast |
| 0 | 0 | 0 | 0 | 30 |
| Liver | 0 | 0 | 20 |
| 33 | 4 | |
| Lung | 14 | 0 | 0 | 10 |
| 28 | |
| Non-cancer | 0 | 0 | 10 | 17 | 1 |
| |
| Random forest | Breast | 0 | 0 | 1 | 0 | 1 | 48 |
| Liver | 3 | 3 | 10 |
| 7 | 214 | |
| Lung | 4 | 0 | 1 | 0 |
| 114 | |
| Non-cancer | 0 | 0 | 0 | 1 | 0 |
| |
| SVM | Breast | 0 | 0 | 0 | 0 | 15 | 35 |
| Liver | 0 | 0 | 13 |
| 34 | 177 | |
| Lung | 0 | 0 | 1 | 0 |
| 93 | |
| Non-cancer | 0 | 0 | 1 | 0 | 12 |
| |
Numbers in bold are correct predictions
Fig. 5The relationship between ctDNA burden and tumor tissue prediction for each plasma sample of the real data. Each point represents a real plasma sample. This plot illustrates the average estimated tumor burden (y-axis) and the most frequently predicted tumor type (dot color) among ten runs for each plasma sample
Fig. 6Illustration of the data partition for learning discriminating features, in both simulation and real data experiments. Note that simulation and real data experiments share the same subset (25%) of normal plasma samples