| Literature DB >> 27688957 |
Thomas C A Smith1, Antony M Carr2, Adam C Eyre-Walker1.
Abstract
Across independent cancer genomes it has been observed that some sites have been recurrently hit by single nucleotide variants (SNVs). Such recurrently hit sites might be either (i) drivers of cancer that are postively selected during oncogenesis, (ii) due to mutation rate variation, or (iii) due to sequencing and assembly errors. We have investigated the cause of recurrently hit sites in a dataset of >3 million SNVs from 507 complete cancer genome sequences. We find evidence that many sites have been hit significantly more often than one would expect by chance, even taking into account the effect of the adjacent nucleotides on the rate of mutation. We find that the density of these recurrently hit sites is higher in non-coding than coding DNA and hence conclude that most of them are unlikely to be drivers. We also find that most of them are found in parts of the genome that are not uniquely mappable and hence are likely to be due to mapping errors. In support of the error hypothesis, we find that recurently hit sites are not randomly distributed across sequences from different laboratories. We fit a model to the data in which the rate of mutation is constant across sites but the rate of error varies. This model suggests that ∼4% of all SNVs are errors in this dataset, but that the rate of error varies by thousands-of-fold between sites.Entities:
Keywords: Cancer; Mutation; Mutation rate variation; Sequencing error; Somatic; Variation
Year: 2016 PMID: 27688957 PMCID: PMC5036107 DOI: 10.7717/peerj.2391
Source DB: PubMed Journal: PeerJ ISSN: 2167-8359 Impact factor: 2.984
Observed and expected values for the distribution of SNVs for sites hit from 0–7 times.
(A) shows data for the whole interrogable human genome, excluding simple sequence repeats. (B) shows data for all bases in the genome that are uniquely mappable at 100 base pairs. (C) the same as B but for 20 base pairs. P < 0.001 for observing >7 sites with 3 SNVs in (A), (B) and (C) if SNVs were randomly distributed throughout the genome.
| Site Type | 0 hits | 1 hit | 2 hits | 3 hits | 4hits | 5hits | 6hits | 7hits |
|---|---|---|---|---|---|---|---|---|
| Non-Exon TE obs (TE) | 1344972042 | 1649680 | 7034 | 762 | 130 | 26 | 9 | 3 |
| Non-Exon TE exp (TE) | 1344964359 | 1663896 | 1430 | 1.14 | 9E−04 | 7E−07 | 5E−10 | 4E−13 |
| Non-Exon Non-TE obs (NTE) | 1321454397 | 1527967 | 3171 | 188 | 35 | 6 | 2 | 2 |
| Non-Exon Non-TE exp (NTE) | 1321451907 | 1532655 | 1206 | 0.86 | 6E−04 | 4E−07 | 3E−10 | 2E−13 |
| Exon obs (EX) | 119708384 | 97488 | 245 | 23 | 0 | 0 | 1 | 0 |
| Exon exp (EX) | 119708145 | 97939 | 57 | 0.03 | 2E−05 | 7E−09 | 3E−12 | 1E−15 |
| Total obs | 2786134823 | 3275135 | 10450 | 973 | 165 | 32 | 12 | 5 |
| Total exp | 2786124411 | 3294490 | 2692 | 2.04 | 2E−03 | 1E−06 | 8E−10 | 5E−13 |
| Non-Exon TE obs (TE) | 1223239922 | 1517676 | 3927 | 266 | 25 | 11 | 5 | 1 |
| Non-Exon TE exp (TE) | 1223236637 | 1523873 | 1322 | 1.07 | 9E−04 | 7E−07 | 5E−10 | 4E−13 |
| Non-Exon Non-TE obs (NTE) | 1276165087 | 1499761 | 2698 | 97 | 16 | 2 | 0 | 1 |
| Non-Exon Non-TE exp (NTE) | 1276163336 | 1503124 | 1201 | 0.88 | 6E−04 | 5E−07 | 3E−10 | 2E−13 |
| Exon obs (EX) | 112360615 | 93084 | 185 | 16 | 0 | 0 | 0 | 0 |
| Exon exp (EX) | 112360453 | 93392 | 55 | 0.03 | 2E−05 | 7E−09 | 3E−12 | 1E−15 |
| Total obs | 2611765624 | 3110521 | 6810 | 379 | 41 | 13 | 5 | 2 |
| Total exp | 2611760426 | 3120389 | 2578 | 2 | 2E−03 | 1E−06 | 8E−10 | 6E−13 |
| Non-Exon TE obs (TE) | 388613299 | 480820 | 741 | 9 | 0 | 0 | 0 | 0 |
| Non-Exon TE exp (TE) | 388612958 | 481494 | 417 | 0.34 | 3E−04 | 2E−07 | 2E−10 | 1E−13 |
| Non-Exon Non-TE obs (NTE) | 892370709 | 1061716 | 1621 | 31 | 4 | 1 | 0 | 1 |
| Non-Exon Non-TE exp (NTE) | 892369874 | 1063340 | 868 | 0.65 | 5E−04 | 3E−07 | 2E−10 | 2E−13 |
| Exon obs (EX) | 74735962 | 61034 | 103 | 6.00 | 0 | 0 | 0 | 0 |
| Exon exp (EX) | 74735883 | 61187 | 36 | 0.02 | 9E−06 | 4E−09 | 2E−12 | 7E−16 |
| Total obs | 1355719970 | 1603570 | 2465 | 46 | 4 | 1 | 0 | 1 |
| Total exp | 1355718714 | 1606021 | 1321 | 1 | 8E−04 | 6E−07 | 4E−10 | 3E−13 |
Figure 1The number of sites with 0–7 SNVs per site for: Main = all data, M100 = sites that are uniquely mappable at 100 base-pairs, M20 = sites that are uniquely mappable at base-pairs and, Expected is the expected number of SNVs per site drawn from a Poisson distribution using all data.
The fit of 4 models to the observed distribution of recurrent SNVs in the three different genomic fractions (A) TE, (B) NTE and (C) EX.
The median shape parameters are given for models 1b and 2b and the median eta are given for models 2b.
| Model | Log-likelihood | Shape | ε | |
|---|---|---|---|---|
| Non-Exon TE (TE) | ||||
| 1a | 2 | −269283.00 | 0.13 | |
| 1b | 64 | −2935.80 | 0.12 | |
| 2a | 3 | −266889.00 | 0.00021 | 0.956 |
| Non-Exon Non-TE (NTE) | ||||
| 1a | 2 | −227728.00 | 0.31 | |
| 1b | 64 | −1206.53 | 0.37 | |
| 2a | 3 | −227026.00 | 0.00039 | 0.963 |
| Exon (EX) | ||||
| 1a | 2 | −13877.9 | 0.18 | |
| 1b | 64 | −270.47 | 0.22 | |
| 2a | 3 | −13843.30 | 0.00019 | 0.966 |
Notes.
number of parameters
Italics indicate the best fit as determined by a likelihood ratio test.
Figure 2The fit of the observed recurrent SNV distribution to expected distribution under the favoured model, 2b, for (A) TE, (B) NTE and (C) EX genomic fractions