| Literature DB >> 28659176 |
Irina Kalatskaya1, Quang M Trinh2, Melanie Spears3,4, John D McPherson5, John M S Bartlett3,4,6, Lincoln Stein2,7.
Abstract
BACKGROUND: A key step in cancer genome analysis is the identification of somatic mutations in the tumor. This is typically done by comparing the genome of the tumor to the reference genome sequence derived from a normal tissue taken from the same donor. However, there are a variety of common scenarios in which matched normal tissue is not available for comparison.Entities:
Keywords: Matching normal tissue; Next-generation sequencing; Somatic mutation; Variant classification
Mesh:
Year: 2017 PMID: 28659176 PMCID: PMC5490163 DOI: 10.1186/s13073-017-0446-9
Source DB: PubMed Journal: Genome Med ISSN: 1756-994X Impact factor: 11.117
Characteristics of cancer datasets used for training and/or validation
| Dataset (source) | Number of samples | Number of samples used in testing | Mean read depth after filtering [95% CI] | Mutation calling pipeline | Total number of somatic/germline SNVsa in all samples | Mean somatic SNVs per sample [95% CI]a | Mean germline per sample [95% CI]a | Ratio somatic to germline (after collapsing) |
|---|---|---|---|---|---|---|---|---|
| UCEC (TGCA) | 251 | 151 | 88.84 [88.77, 88.92] | bambam_v1.4 | 38,012/504,241 | 147.015 [88.36, 205.66] | 2,008.92 [1,972.23, 2,045.62] | 2:1 |
| BRCA (TCGA) | 500 | 400 | 85.92 [85.87, 85.97] | bambam_v1.4 | 5556/1,037,432 | 10.77 [9.05, 12.48] | 2,074.86 [2051.26, 2098.46] | 1:6 |
| COAD (TGCA) | 215 | 115 | 122.17 [122.06, 122.28] | carnac_v1.0 | 60,624/1,932,510 | 276.68 [191.78, 361.58] | 8,988.41 [8826.01, 9150.82] | 1:1 |
| KIRC (TGCA) | 304 | 204 | 177.59 [177.46, 177.73] | carnac_v1.0 | 10,489/2,416,155 | 33.56 [31.60, 35.51] | 7,947.87 [7792.68, 8,103.07] | 1:7 |
| PAAD (TCGA) | 146 | 46 | 363.09 [362.80, 363.37] | carnac_v1.0 | 5,593/1,263,918 | 37.08 [33.59, 40.58] | 8,656.48 [8587.71, 8725.25] | 1:10.5 |
| ESO (dbGAP) | 145 | 45 | 58.39 [58.33, 58.44] | MuTect2 | 26,098/790,051 | 181.85 [150.65, 213.05] | 5,451.51 [5,307.16, 5595.85] | 1:2.5 |
All datasets were sequenced using Illumina technology
aOnly non-silent variants in coding regions with read depth >10 and PASS somatic mutation caller filtering were taken into account
List of features used in the classifiers, types of their values, and source of data
| Features | Type of value | Internal or external | Number of distinct values |
|---|---|---|---|
| COSMIC_CNT | Integer | External database | Numeric |
| ExAC | Boolean | External database | 2 |
| dbSNP | Boolean | External database | 2 |
| Mutation assessor | Categorical | External database | 5 |
| PolyPhen-2 | Categorical | External database | 3 |
| Sequence context | Categorical | Human genome | 64 |
| Sample frequency (SF) | Double | Internal data | Numeric |
| Variant allele frequency | Double | Internal data | Numeric |
| Flanking regions | Double | Internal data | Numeric |
| Substitution pattern | Categorical | Internal data | 6 |
Fig. 1ISOWN framework for somatic mutation prediction. Variants retrieved either directly from TCGA portal in the form of VCF files or using GATK/MuTect2 pipeline (see “Implementation” section for more details) were annotated with a series of external databases. Low quality calls were removed by applying a standard set of filters. Only coding and non-silent variants were taken into account (unless otherwise indicated). After flanking regions and variant allele frequencies were calculated for each variant and data collapsed in the unique set of variants (see “Implementation” section), some variants were pre-labeled as germline based on their presence in dbSNP/common_all but not in COSMIC or as somatic based on the fact that over hundred samples with this particular mutation were submitted to COSMIC (CNT >100). The best machine learning algorithm was selected using a tenfold cross-validation approach. One hundred randomly selected samples from each dataset were used for classifier training and final accuracies were calculated based on the remaining samples
Fig. 2Tenfold cross-validation. We generated 1000 training sets, each containing 700 randomly selected somatic and 700 germline variants from each cancer set. ISOWN validation was done using different machine learners (shown with different colors). Plot shows average F1-measure (upper panel), false positive rate (middle panel) and AUC (lower panel) from 1000 training sets
Fig. 3ISOWN validation using different machine learning algorithms for six whole-exome sequencing datasets. NBC (green), LADTree (red), and random forest (blue) were trained based on a gradually increasing number of samples (x-axis). The F1-measure was calculated based on a held-out independent sample set across six cancer datasets
Fig. 4Cross-cancer validation. NBC (upper panel) and LADTree (lower panel) classifiers were trained using variants from 100 samples from cancer indicated on the x-axis and validated using cancer set indicated on the y-axis