| Literature DB >> 35571031 |
Dollina D Dodani1, Matthew H Nguyen1, Ryan D Morin2,3, Marco A Marra2,4, Richard D Corbett2.
Abstract
Formalin fixation of paraffin-embedded tissue samples is a well-established method for preserving tissue and is routinely used in clinical settings. Although formalin-fixed, paraffin-embedded (FFPE) tissues are deemed crucial for research and clinical applications, the fixation process results in molecular damage to nucleic acids, thus confounding their use in genome sequence analysis. Methods to improve genomic data quality from FFPE tissues have emerged, but there remains significant room for improvement. Here, we use whole-genome sequencing (WGS) data from matched Fresh Frozen (FF) and FFPE tissue samples to optimize a sensitive and precise FFPE single nucleotide variant (SNV) calling approach. We present methods to reduce the prevalence of false-positive SNVs by applying combinatorial techniques to five publicly available variant callers. We also introduce FFPolish, a novel variant classification method that efficiently classifies FFPE-specific false-positive variants. Our combinatorial and statistical techniques improve precision and F1 scores compared to the results of publicly available tools when tested individually.Entities:
Keywords: FFPE (formalin fixed paraffin-embedded); combinatorics; machine learning; somatic variant calling; whole genome
Year: 2022 PMID: 35571031 PMCID: PMC9092826 DOI: 10.3389/fgene.2022.834764
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.772
Summary of the BLGSP and HTMCP samples. Fold redundancy of genome sequencing coverage (X) is indicated.
| HTMCP | BLGSP | |
|---|---|---|
| Genome Reference | hg19 | hg38 |
| FFPE Tumour | N = 39; HiSeq 2500; 50.3X | N = 34; HiSeq X; 68.9X |
| FF Tumour | N = 39; HiSeq 2500; 82.5X | N = 34; HiSeq 2500; 82.4X |
| FF Normal | N = 39; HiSeq 2500; 42.8X | N = 34; HiSeq 2500; 41.0X |
FIGURE 1FFPE somatic variants were identified using five callers. Ground truth used to evaluate the FFPE variants is generated using the Strelka2 and Mutect2 variants from FF tumours (A). Recallest and precisionest are calculated by comparing against the intersection and the union of the Strelka2 and Mutect2 FF results, respectively. The FFPE results from the five callers are collated into groups of three and intersected in a Venn-like fashion. Each of the 127 possible combinations of the Venn intersection parts is compared against the ground truth (B). The results reported in (B) are from sample BLGSP-71-06-00001-01B-01E.
FIGURE 2Description of the FFPolish workflow. Generation of the training data was done using Strelka2 FFPE VCFs and the intersection of Strelka2 and Mutect2 FF VCFs. Users may use any somatic variant callers of choice in place of those listed in parentheses. Model training is done using features extracted from FFPE BAM files and Strelka2 FFPE VCF files. The model is built using hyperparameter optimization of logistic regression using grid search and 10-fold cross-validation. The generated model can be applied to any new FFPE VCF and bam file to obtain a filtered FFPE VCF. Users can train a new model if more FFPE data with matched FF results become available in the future.
FIGURE 3Recallest, precisionest, and F1est of tools tested individually (A) compared to the combinations and intersections of tools and FFPolsih (B) generated the top three results. Where data points overlap, they have been merged and represented by a single point. For the combinatorial method, a union of the three tools was used for the highest recallest and an intersection for maximum precisionest and F1est, as shown in the legend. Special Venn cases have been indicated by * and described in the legend. The regions of the Venn diagram used are shaded in black. The intersection of LoFreq, Shimmer, and Mutect2 resulted in the best precisionest for BLGSP (96.89%) and HTMCP 97.78% cohorts. The intersected trios of (LoFreq, Strelka2, Mutect2) and (LoFreq, Shimmer, Mutect2) also obtained the best F1est of 0.770 and 0.751 for BLGSP and HTMCP, respectively. The union of Strelka2, Shimmer and Mutect2 generated the most optimal recallest (87.76%) for BLGSP while the union of Strelka2, Virmid Mutect2 returned the highest recallest(81.39%) for HTMCP.