| Literature DB >> 29052513 |
Rebecca F Halperin1, John D Carpten2, Zarko Manojlovic3, Jessica Aldrich4, Jonathan Keats4, Sara Byron1, Winnie S Liang5, Megan Russell4, Daniel Enriquez5, Ana Claasen5, Irene Cherni4, Baffour Awuah6, Joseph Oppong6, Max S Wicha7, Lisa A Newman8, Evelyn Jaigge7, Seungchan Kim4, David W Craig9,10.
Abstract
BACKGROUND: Significant clinical and research applications are driving large scale adoption of individualized tumor sequencing in cancer in order to identify tumors-specific mutations. When a matched germline sample is available, somatic mutations may be identified using comparative callers. However, matched germline samples are frequently not available such as with archival tissues, which makes it difficult to distinguish somatic from germline variants. While population databases may be used to filter out known germline variants, recent studies have shown private germline variants result in an inflated false positive rate in unmatched tumor samples, and the number germline false positives in an individual may be related to ancestry.Entities:
Keywords: Cancer; Copy number alterations; Germline variant; Next generation sequencing; Precision medicine; Somatic mutation; Tumor purity
Mesh:
Substances:
Year: 2017 PMID: 29052513 PMCID: PMC5649057 DOI: 10.1186/s12920-017-0296-8
Source DB: PubMed Journal: BMC Med Genomics ISSN: 1755-8794 Impact factor: 3.063
Samples and sequencing statistics
| Sample name | Cancer type | Self reported ancestry | Reads (Millions) | Mean target coverage | % Covered at 100X |
|---|---|---|---|---|---|
| GBM6-HA | Glioblastoma | Hispanic | 897 | 553X | 95.7 |
| GBM1-EA | Glioblastoma | Caucasian | 2768 | 2101X | 98.3 |
| TNBC3-AA | Breast | African American | 905 | 599X | 96.1 |
| TNBC4-AA | Breast | African American | 906 | 546X | 95.4 |
| TNBC6-AA | Breast | African American | 1040 | 658X | 95.4 |
| TNBC7-AA | Breast | African American | 899 | 588X | 95.6 |
| TNBC11-EA | Breast | Caucasian | 1066 | 645X | 95.8 |
| TNBC14-EA | Breast | Caucasian | 875 | 454X | 94.5 |
| TNBC15-GH | Breast | Ghanian | 598 | 1012X | 97.4 |
Description of the patient samples and sequencing metrics of the patients used in the evaluation dataset. Note that GBM1-EA was sequenced to higher mean target coverage for in-silico dilution and down-sampling experiments. Blood samples from each patient were also sequenced to establish true variant classification
Model input parameters
| Parameter | Default value or source | Description |
|---|---|---|
| K | 3 | Number of Clones |
|
| 0.5 | Mode of prior distribution of f |
|
| 1.5 | Determines shape of prior distribution of f |
|
| 0.1, 0.15, 0.5, 0.15, 0.1 | Copy Number Priors |
|
| [0.25;0.5;0.25] | Minor Allele Copy Number Priors |
|
| 1E-5 | Segmentation significance cutoff |
|
| COSMIC | Number of cancer variants observed at the position |
|
| 1000 Genomes | Population Allele Frequencies |
|
| 1E-5, 1E-6 | Constant for calculating prior somatic |
|
| 7.14E-5, 1.43E-5 | Population allele frequencies assigned to alleles not seen in input population |
| Fmax-somatic | 1E-3 | Maximum population allele frequency to be considered a possible somatic variant |
|
| 10 | Minimum mapping quality to count read |
|
| 5 | Minimum base quality to count base |
| Τ | 0.99 | Minimum posterior probability of belonging to the PASS group to be called pass |
| Τ | 0.8 | Minimum posterior probability of variant is somatic to be called somatic |
| Τ | 0.8 | Minimum posterior probability of variant is germline to be called somatic |
Where applicable, default values were determined empirically on an independent data set
Other model parameters and variables
| Variable | Descriptions |
|---|---|
|
| |
| RT, RB | Total read depth, B allele read depth |
| πS, πAB, πAA | prior probability of somatic, germline heterozygous, germline homozygous variant |
|
| Mean mapping quality of reads supporting the A or B allele |
|
| Mean base quality of bases supporting B allele |
| X | Total number of exons, |
| Y | Number of heterozygous germline variants |
| Z | Number of somatic variants |
| G | Number of segments |
|
| |
| fi | fraction of cells in the sample with the variants in clone i |
| C | centering parameter |
| W | controls the spread of the allelic fraction distributions |
|
| |
| N | total copy number |
| M | minor allele copy number |
| ϕS, ϕG | expected allele fraction of somatic or germline variant |
| IS, Ij | Index of clonal subset containing somatic variant or copy number variant |
| Q* | Number of copy number altered exons |
|
| |
| GAA, GAB | Germline homozygous or heterozygous genotype |
| O | Other genotype beside somatic, germline homozygous AA, or germline heterozygous AB |
| U | Unknown genotype due to poor mapping |
| i | Index of clonal subset {1, 2, ..., K} |
| j | Index of segment {1, 2, …, G} |
| s | Index of somatic variant {1, 2, …, Z} |
| h | Index of heterozygous variant {1, 2, …, Y} |
| n | Index of exon {1, 2, …, X} |
Key to notation used in describing model
Fig. 1Correlation between ancestry and the effectiveness of using database filters to identify somatic variants. a The distribution and number of variants unique to an individual across 2503 individual from Phase 3 of 1000 Genomes plotted as violin plot for each of 26 different populations (indicated by their 3 letter code), and colored based on their ancestral super population. b The number of private variants for 150 individuals after filtering through 1000 Genomes, ExAc. not previously sequenced shown by their principle components of common variation (>1%) is shown as a color-metric bubble chart. c The distribution of variants within the groups within the right PCA plot, correlating to sections in B, where individuals clustering near those of European, Asian, and African Ancestry
Fig. 2Overview of Variant Calling Strategy. After filtering candidate variant positions by quality, an EM approach is used to fit a model of clonal allelic copy number. The plots on the left show example copy number plots for three conditions, the top panel showing high tumor content and moderate coverage, the middle panels with high tumor content and high coverage, and the bottoms panel with moderate tumor content and moderate coverage. A one copy loss is detected in the segment indicated by the blue line in the first left-most column. Next the expected somatic and germline allelic fractions are modeled in subsequent column. The center two columns plots the expected allelic fractions for germline variants (grey), somatic main clone (blue), and somatic sub clonal (green and red) for diploid regions (left) and one copy loss regions (right). We can see that in high tumor content, moderate coverage, the main clone distribution overlaps with the germline and is difficult to detect in the diploid region, while the red sub-clone is more difficult to detect in the one copy loss region. Increasing the coverage increases sharpness of the distributions making the somatic variants easier to detect. In the moderate tumor content sample, all clones are easy to differentiate from germline in the diploid region, but the main clone is hard to detect in the one copy loss region. Using these distributions to calculate conditional probabilities, as well as using 1000 genomes population frequencies and COSMIC mutation counts to calculate prior probabilities, somatic and germline variants can be called. The right most columns show plots of the allelic fractions of germline (grey) and somatic variants colored by clone. In these, encircled ‘+’ indicates the variant was detected and empty “o” indicates a false negative. As expected, in the high tumor content moderate coverage condition, variants in the main clone are detected better in the deleted region, and the number of variants detected increases in the high coverage condition
Fig. 3Allele Frequencies of Somatic and Germline Variants and Required Coverage for Somatic Variant Detection by Simulation. The top half of each graph shows the expected allele frequency of somatic (blue) and germline variants (red) by tumor content (x-axis) for different copy number states (plot titles, N indicates total copy number, M indicates minor allele copy number). The bottom half of each graphs shows the coverage required (indicated by the color) to get the power indicated by the y-label. Black squares indicate that the detection power was not achieved even at the highest coverage evaluated. We can see that the closer the somatic and germline allele frequencies, the more difficult it is to detect somatic variants
Fig. 4Comparison of Calls of True Somatic Variants and True Values of Variants Called Somatic. The graphs on the left shows the calls of LumosVar (bottom bar in pair) compared to filtering approach (top bar in pair) in calling true somatic variants. The size of the yellow portions of the bars indicate the number of true somatic variants falsely called germline heterozygotes or homozygous, the grey represents true somatic variants that were filtered on quality or not detected as variants, and the blue represents true positive somatic calls. We can see that the filtering approach has better sensitivity (mean TPR 87%, range 78%–96%) compared to the tumor only caller (mean TPR 52%, range 27%–62%). The graphs on the right shows the number of somatic calls by the LumosVar (bottom bar in pair) compared to the filtering approach (top bar in pair) that are truly germline private heterozygous (red), germline heterozygous database variants (pink), homozygous (grey) or truly somatic (blue). We can see that the tumor only caller has better precision (mean PPV 75%, range 56%–89%) compared to the filtering approach (mean PPV 35%, range 19%–55%). The top pair of panels shows the comparison for eight of the nine evaluation samples. The middle of panels shows the comparison for an in-silico dilution series preformed using the ninth evaluation sample (GBMEA1), while the bottom panel shows a down-sampling experiment on the same sample
Fig. 5Simulations were used to predict the power to detect each true somatic variant assuming the sample fraction and copy number were correctly called. For each clone and each sample, the true positive rate is plotted against the power predicted from the simulations. The size of the bubble is proportional to the number of true positive variants in each clone, the color the points represents the sample fraction of the clone, and the number indicates the sample number. As expected, the highest sample fraction clone has the worse predicted and observed sensitivity. The graph on the left includes all of the true somatic variants, and the graph on the right only includes those that pass the quality filters. We can see that the predicted power correlates well with the measured sensitivity, particularly when the low quality variants are excluded