| Literature DB >> 34856988 |
Nicholas Hutson1, Fenglin Zhan1,2, James Graham1, Mitsuko Murakami3,4, Han Zhang5, Sujana Ganaparti1, Qiang Hu1, Li Yan1, Changxing Ma5, Song Liu1, Jun Xie6, Lei Wei7.
Abstract
BACKGROUND: Multi-sample comparison is commonly used in cancer genomics studies. By using next-generation sequencing (NGS), a mutation's status in a specific sample can be measured by the number of reads supporting mutant or wildtype alleles. When no mutant reads are detected, it could represent either a true negative mutation status or a false negative due to an insufficient number of reads, so-called "coverage". To minimize the chance of false-negative, we should consider the mutation status as "unknown" instead of "negative" when the coverage is inadequately low. There is no established method for determining the coverage threshold between negative and unknown statuses. A common solution is to apply a universal minimum coverage (UMC). However, this method relies on an arbitrarily chosen threshold, and it does not take into account the mutations' relative abundances, which can vary dramatically by the type of mutations. The result could be misclassification between negative and unknown statuses.Entities:
Keywords: Genetic testing; Liquid biopsy; Negative status; Next-generation sequencing; Personalized medicine; Tumor heterogeneity
Mesh:
Year: 2021 PMID: 34856988 PMCID: PMC8638096 DOI: 10.1186/s12920-021-00880-8
Source DB: PubMed Journal: BMC Med Genomics ISSN: 1755-8794 Impact factor: 3.063
A hypothetical example of the negative data problem
| Mutations | Biopsy (#1) | Biopsy (#2) | |||||
|---|---|---|---|---|---|---|---|
| Read counts (mutant/total) | VAF (%) | Status | Read counts (mutant/total) | VAF (%) | Status #1 (UMC 20X) | Status #2 (MSN) | |
| 40/72 | 56 | Positive | 0/15 | 0 | Unknown | Negative | |
| 5/100 | 5 | Positive | 0/30 | 0 | Negative | Unknown | |
Fig. 1Flowchart illustrating the process of defining a mutation’s status in multiple related samples. Begin with an input data matrix containing the numbers of mutant and wildtype read counts for every mutation in all related samples (top), the mutation's status in each sample is classified in a two-step fashion. First, positive sample/mutation pairs were identified. We assume the users have completed this step using their preferred method before running MSN. The MSN method does not create or remove positive statuses but directly report them to the output (left). Second, for every mutation, each “non-positive” sample is compared with every positive sample to determine if they may contain the same frequency of mutant reads. If and only if this null hypothesis is rejected against all positive samples, then this non-positive sample is considered as negative (right), otherwise, it would be classified as unknown due to low coverage (middle). The output is a data matrix containing all updated mutation statuses (bottom)
A step-by-step example of differentiating “unknown” from “negative” status using the MSN method
| Sample | Read counts | Status | |
|---|---|---|---|
| Mutant | Total (coverage) | ||
| A | 10 | 20 | Positive |
| B | 0 | 20 | TBD* |
| C | 4 | 9 | Positive |
| D | 0 | 8 | TBD* |
| E | 0 | 5 | TBD* |
*TBD to-be-determined
**By Fisher's exact test
Fig. 2Evaluate the performance of negative-defining methods in a simulated dataset. From bottom to top: we simulated four different scenarios containing varying tumor cell fractions from 90, 20, 5 to 1%. Each scenario was independently simulated three times (referred to as measurements) to mimic multiple sampling. Only mutations that are positive in at least one of the three measurements after simulation were included. X-axis: different negative-defining methods including MSN using two thresholds (p < 0.01 and p < 0.05) and UMC using four thresholds (minimum coverage for non-positive samples to be considered as negative: 20X, 50X, 200X and 300X). Y-axis: percent of defined mutation statuses by type (Unknown: non-positive but the coverage was too low to be considered as negative; FN false negative, TN true negative, FP false positive, TP true positive). Please note that the current negative-defining methods do not affect positive mutation statuses (TP and FP)
Fig. 3Evaluate the performance of negative-defining methods using dual-platform single-cell sequencing data. Two negative-defining methods, UMC and MSN, were tested in a single-cell dual-platform whole-exome sequencing dataset using varying thresholds, including UMC (p < 0.01, p < 0.05) and MSN (minimum coverage of 10X, 20X, 50X, 100X, 300X and 100X), as indicated under each dot. The overall performance of each method with a specific threshold was evaluated by (1) Y-axis: the total number of informative data points after excluding unknown statuses, with “informative data points” defined as the single-cell/mutation pairs where both platforms (NXT and AGL) yielded an informative mutation status, i.e., either positive or negative, but not unknown; (2) X-axis: the concordance of mutation statuses between the two platforms (NXT and AGL), defined as the percentage of informative data points where the two platforms yielded the same mutation status, either both positive or both negative, for the same single cell