| Literature DB >> 28389380 |
Abstract
With the technology development on detecting circulating tumor cells (CTCs) and cell-free DNAs (cfDNAs) in blood, serum, and plasma, non-invasive diagnosis of cancer becomes promising. A few studies reported good correlations between signals from tumor tissues and CTCs or cfDNAs, making it possible to detect cancers using CTCs and cfDNAs. However, the detection cannot tell which cancer types the person has. To meet these challenges, we developed an algorithm, eTumorType, to identify cancer types based on copy number variations (CNVs) of the cancer founding clone. eTumorType integrates cancer hallmark concepts and a few computational techniques such as stochastic gradient boosting, voting, centroid, and leading patterns. eTumorType has been trained and validated on a large dataset including 18 common cancer types and 5327 tumor samples. eTumorType produced high accuracies (0.86-0.96) and high recall rates (0.79-0.92) for predicting colon, brain, prostate, and kidney cancers. In addition, relatively high accuracies (0.78-0.92) and recall rates (0.58-0.95) have also been achieved for predicting ovarian, breast luminal, lung, endometrial, stomach, head and neck, leukemia, and skin cancers. These results suggest that eTumorType could be used for non-invasive diagnosis to determine cancer types based on CNVs of CTCs and cfDNAs.Entities:
Keywords: Cancer; Copy number variation; Founding clone; Non-invasive detection; eTumorType
Mesh:
Substances:
Year: 2017 PMID: 28389380 PMCID: PMC5414714 DOI: 10.1016/j.gpb.2017.01.004
Source DB: PubMed Journal: Genomics Proteomics Bioinformatics ISSN: 1672-0229 Impact factor: 7.691
Cancer types and sample sizes in the somatic founding clone CNV profile
| Ovarian serous cystadenocarcinoma | OV | 538 |
| Breast invasive carcinoma (luminal subtype) | LUMINAL | 531 |
| Colon adenocarcinoma/rectum adenocarcinoma | COAD/READ | 513 |
| Glioblastoma multiforme | GBM | 467 |
| Kidney renal clear cell carcinoma | KIRC | 415 |
| Lung squamous cell carcinoma | LUSC | 403 |
| Uterine corpus endometrial carcinoma | UCEC | 401 |
| Lung adenocarcinoma | LUAD | 395 |
| Head and neck squamous cell carcinoma | HNSC | 335 |
| Brain lower grade glioma | LGG | 244 |
| Thyroid carcinoma | THCA | 203 |
| Stomach adenocarcinoma | STAD | 177 |
| Bladder urothelial carcinoma | BLCA | 151 |
| Prostate adenocarcinoma | PRAD | 149 |
| Cervical squamous cell carcinoma and endocervical adenocarcinoma | CESC | 149 |
| Breast invasive carcinoma (basal subtype) | BASAL | 91 |
| Skin cutaneous melanoma | SKCM | 83 |
| Acute myeloid leukemia | LAML | 82 |
Note: CNV, copy number variation.
Figure 1Scheme of the eTumorType algorithm
Pair-wise GO-ada model: the CNV profiles (rows for genes and columns for samples) for cancer type 1 (cancer 1) and cancer type 2 (cancer 2) were used to select significant DAGs. DAGs associated with six cancer hallmarks as annotated with GO terms were retained and 12 GO sets were selected (see Method) and input into ada R package to build GO-ada models. Average number of GO-ada models for a sample: for a given sample, the numbers of GO-ada models favoring each cancer type prediction were constructed as a matrix based on all the 12 × models and then the vector of average number of GO-ada models was created. Next, cancer-type centroid matrix was built by collecting the centroid vector of average number of GO-ada models for all the 18 possible cancer types (the rows of the matrix) on each cancer type (the column of the matrix) data of the training set. Centroid-based prediction: for a given new sample, the vector of the average number of GO-ada models favoring each cancer type prediction was calculated and then used for evaluating its correlations with the centroid vector of each cancer type (column of the cancer-type centroid matrix). The correlation coefficients were ranked and the 3 top-ranked cancer types were selected as the final prediction. Leading pattern-weighted prediction: the same procedure as the centroid-based prediction was performed except that the weighted correlation replaced the simple correlation (see Method). GO, Gene Ontology; CNV, copy number variation; DAG, differentially-amplified gene.
Figure 2The average number of GO-ada models voting for possible cancer types for LUMINAL, LAML, LUSC, and HNSC datasets
The average numbers of GO-ada models voting for possible cancer types are shown by the boxplots. The box displays the range of 25th percentile and 75th percentile. The circles represent the values lower than 10th percentile or greater than 90th percentile. The abbreviations of cancers are explained in Table 1.
Correlation coefficients of centroids among the training, validation, and test sets for various cancer types
| LUSC | 1 | 1 | 1 | 1 |
| LUMINAL | 0.99 | 0.99 | 0.99 | 0.99 |
| GBM | 0.99 | 1 | 1 | 0.99 |
| KIRC | 1 | 0.99 | 0.99 | 0.99 |
| LGG | 0.99 | 0.99 | 1 | 0.99 |
| OV | 1 | 0.99 | 0.99 | 0.99 |
| THCA | 0.99 | 0.99 | 0.99 | 0.99 |
| COAD/READ | 0.99 | 0.99 | 0.98 | 0.98 |
| HNSC | 0.98 | 0.99 | 0.98 | 0.98 |
| UCEC | 0.99 | 0.98 | 0.99 | 0.98 |
| CESC | 0.97 | 0.97 | 0.99 | 0.97 |
| LUAD | 0.99 | 0.99 | 0.97 | 0.97 |
| PRAD | 0.97 | 0.98 | 0.97 | 0.97 |
| BLCA | 0.97 | 0.96 | 0.98 | 0.96 |
| BASAL | 0.96 | 0.96 | 0.98 | 0.96 |
| STAD | 0.96 | 0.97 | 0.97 | 0.96 |
| LAML | 0.97 | 0.93 | 0.97 | 0.93 |
| SKCM | 0.92 | 0.89 | 0.96 | 0.89 |
Note: The abbreviations of cancers are explained in Table 1.
Accuracy and power of centroid-based cancer type prediction using top-1 selections
| OV | 0.76 | 0.76 | 0.82 | 0.89 | 0.72 | 0.72 |
| LUMINAL | 0.73 | 0.51 | 0.81 | 0.55 | 0.60 | 0.42 |
| LUAD | 0.66 | 0.56 | 0.75 | 0.68 | 0.65 | 0.42 |
| LUSC | 0.59 | 0.82 | 0.67 | 0.91 | 0.51 | 0.73 |
| COAD/READ | 0.79 | 0.59 | 0.89 | 0.70 | 0.66 | 0.46 |
| GBM | 0.77 | 0.81 | 0.86 | 0.89 | 0.74 | 0.81 |
| UCEC | 0.46 | 0.10 | 0.70 | 0.35 | 0.40 | 0.13 |
| THCA | 0.29 | 0.86 | 0.35 | 0.85 | 0.27 | 0.93 |
| STAD | 0.53 | 0.43 | 0.62 | 0.51 | 0.38 | 0.39 |
| LGG | 0.71 | 0.64 | 0.73 | 0.71 | 0.46 | 0.47 |
| PRAD | 0.62 | 0.71 | 0.69 | 0.73 | 0.39 | 0.50 |
| KIRC | 0.74 | 0.49 | 0.80 | 0.64 | 0.73 | 0.39 |
| HNSC | 0.57 | 0.49 | 0.61 | 0.64 | 0.36 | 0.36 |
| CESC | 0.50 | 0.73 | 0.51 | 0.63 | 0.28 | 0.43 |
| LAML | 0.66 | 0.98 | 0.71 | 0.75 | 0.44 | 0.41 |
| BLCA | 0.62 | 0.56 | 0.54 | 0.47 | 0.39 | 0.23 |
| BASAL | 0.33 | 0.93 | 0.34 | 0.72 | 0.30 | 0.72 |
| SKCM | 0.52 | 0.74 | 0.47 | 0.41 | 0.36 | 0.50 |
| Average | 0.60 | 0.65 | 0.66 | 0.67 | 0.48 | 0.50 |
Note: The abbreviations of cancers are explained in Table 1.
Accuracy and power of centroid-based cancer type prediction using top-3 selections
| OV | 0.84 | 0.93 | 0.86 | 0.94 | 0.78 | 0.92 |
| LUMINAL | 0.93 | 0.80 | 0.93 | 0.73 | 0.88 | 0.66 |
| LUAD | 0.89 | 0.76 | 0.81 | 0.81 | 0.90 | 0.59 |
| LUSC | 0.82 | 0.98 | 0.76 | 0.98 | 0.77 | 0.95 |
| COAD/READ | 0.93 | 0.83 | 0.92 | 0.79 | 0.83 | 0.65 |
| GBM | 0.94 | 0.90 | 0.90 | 0.91 | 0.89 | 0.87 |
| UCEC | 0.92 | 0.35 | 0.85 | 0.43 | 0.78 | 0.50 |
| THCA | 0.48 | 0.93 | 0.46 | 0.88 | 0.46 | 0.98 |
| STAD | 0.90 | 0.72 | 0.81 | 0.63 | 0.66 | 0.64 |
| LGG | 0.91 | 0.94 | 0.78 | 0.82 | 0.77 | 0.88 |
| PRAD | 0.78 | 0.93 | 0.76 | 0.83 | 0.68 | 0.93 |
| KIRC | 0.91 | 0.92 | 0.87 | 0.80 | 0.86 | 0.77 |
| HNSC | 0.86 | 0.85 | 0.71 | 0.78 | 0.68 | 0.75 |
| CESC | 0.74 | 0.91 | 0.65 | 0.80 | 0.63 | 0.73 |
| LAML | 0.83 | 0.98 | 0.81 | 0.81 | 0.81 | 0.76 |
| BLCA | 0.80 | 0.81 | 0.74 | 0.57 | 0.65 | 0.43 |
| BASAL | 0.59 | 1.00 | 0.48 | 0.83 | 0.52 | 0.89 |
| SKCM | 0.85 | 0.94 | 0.67 | 0.59 | 0.67 | 0.75 |
| Average | 0.83 | 0.86 | 0.77 | 0.77 | 0.73 | 0.76 |
Note: The abbreviations of cancers are explained in Table 1.
Accuracy and power of leading pattern-weighted correlation prediction of caner types using top-3 selections
| OV | 0.90 | 0.94 | 0.83 | 0.93 | 0.80 | 0.93 |
| LUMINAL | 0.93 | 0.97 | 0.85 | 0.88 | 0.83 | 0.85 |
| LUAD | 0.96 | 0.86 | 0.91 | 0.87 | 0.82 | 0.70 |
| LUSC | 0.85 | 0.98 | 0.80 | 0.96 | 0.78 | 0.95 |
| COAD/READ | 0.96 | 0.95 | 0.94 | 0.88 | 0.87 | 0.79 |
| GBM | 0.97 | 0.91 | 0.92 | 0.92 | 0.92 | 0.89 |
| UCEC | 0.94 | 0.73 | 0.81 | 0.58 | 0.84 | 0.66 |
| THCA | 0.80 | 0.98 | 0.62 | 0.90 | 0.56 | 0.98 |
| STAD | 0.98 | 0.89 | 0.81 | 0.71 | 0.80 | 0.67 |
| LGG | 0.97 | 0.97 | 0.91 | 0.82 | 0.90 | 0.88 |
| PRAD | 0.94 | 0.94 | 0.93 | 0.87 | 0.96 | 0.87 |
| KIRC | 0.96 | 0.98 | 0.91 | 0.84 | 0.86 | 0.88 |
| HNSC | 0.93 | 0.90 | 0.81 | 0.91 | 0.80 | 0.76 |
| CESC | 0.78 | 0.93 | 0.67 | 0.73 | 0.65 | 0.73 |
| LAML | 1.00 | 0.96 | 0.92 | 0.69 | 0.91 | 0.59 |
| BLCA | 0.93 | 0.89 | 0.76 | 0.63 | 0.71 | 0.40 |
| BASAL | 0.90 | 0.96 | 0.67 | 0.67 | 0.70 | 0.78 |
| SKCM | 0.94 | 0.94 | 0.83 | 0.59 | 0.92 | 0.69 |
| Average | 0.92 | 0.93 | 0.83 | 0.80 | 0.81 | 0.78 |
Note: The abbreviations of cancers are explained in Table 1.