| Literature DB >> 28819548 |
Raunaq Malhotra1, Manjari Jha1, Mary Poss2, Raj Acharya3.
Abstract
We propose a random forest classifier for detecting rare variants from sequencing errors in Next Generation Sequencing (NGS) data from viral populations. The method utilizes counts of varying length of k-mers from the reads of a viral population to train a Random forest classifier, called MultiRes, that classifies k-mers as erroneous or rare variants. Our algorithm is rooted in concepts from signal processing and uses a frame-based representation of k-mers. Frames are sets of non-orthogonal basis functions that were traditionally used in signal processing for noise removal. We define discrete spatial signals for genomes and sequenced reads, and show that k-mers of a given size constitute a frame. We evaluate MultiRes on simulated and real viral population datasets, which consist of many low frequency variants, and compare it to the error detection methods used in correction tools known in the literature. MultiRes has 4 to 500 times less false positives k-mer predictions compared to other methods, essential for accurate estimation of viral population diversity and their de-novo assembly. It has high recall of the true k-mers, comparable to other error correction methods. MultiRes also has greater than 95% recall for detecting single nucleotide polymorphisms (SNPs) and fewer false positive SNPs, while detecting higher number of rare variants compared to other variant calling methods for viral populations. The software is available freely from the GitHub link https://github.com/raunaq-m/MultiRes.Entities:
Keywords: Multi-resolution frames; Next-generation sequencing; Random forest classifier; Reference free methods; Sequencing error detection; Viral populations
Year: 2017 PMID: 28819548 PMCID: PMC5548337 DOI: 10.1016/j.csbj.2017.07.001
Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN: 2001-0370 Impact factor: 7.271
Fig. 1Performance of classification algorithms for erroneous versus rare variant k-mer classification. The performance of mentioned classification algorithms for classifying 35-mers are compared over two sets of features. 35-mers are either projected onto a family of (a) 23-mers, 13-mers, and a 13 + 23-mers, and (b) projections onto 15-mers, 15 + 20-mers, 15 + 20 + 25-mers, and 15 + 20 + 25 + 30-mers. The accuracy reported is over fivefold cross validation on 35-mers extracted from HIV viral population. Accuracy improves when 35-mers are projected onto smaller sized k′- mers and as the number of projections increases. Random Forest Classifier has the best accuracy across different classification algorithms.
Comparison of performance metrics of error detection on simulated HIV datasets. FP/TP ratio is the measure of false positive to true positive ratio, Recall measures the percentage of true k-mers out of all true k-mer predicted by an algorithm, Precision measures the percentage of predicted k-mers by an algorithm that are true k-mers.
| Algorithm | FP/TP ratio | Recall | Precision | |||
|---|---|---|---|---|---|---|
| HIV 100x | HIV 400x | HIV 100x | HIV 400x | HIV 100x | HIV 400x | |
| Uncorrected | 53 | 121 | 98.91 | 99.67 | 1.85 | 0.82 |
| Quake | 9.26 | 29.5 | 94.84 | 9.74 | 3.27 | |
| BLESS | 0.71 | 76.7 | 98.38 | 99.36 | 58.48 | 1.28 |
| Musket | 0.46 | 121 | 98.46 | 68.48 | 0.82 | |
| BFC | 2.12 | 112 | 98.47 | 99.57 | 32.01 | 0.89 |
| BayesHammer | 0.37 | 69.1 | 98.47 | 98.59 | 73.04 | 1.42 |
| Seecer | 12.1 | 110 | 98.49 | 98.31 | 7.65 | 0.90 |
| MultiRes | 95.01 | 98.17 | ||||
The False positive/True Positive ratios (FP/TP ratios), Recall, and Precision are compared on two HIV datasets for the methods: Quake, BLESS, Musket, BFC, BayesHammer, Seecer, and the proposed method MultiRes. The error corrected reads from each method are broken into k-mers and compared to the true k-mers in the HIV-1 viral populations. Uncorrected denotes the statistics when no error correction is performed. Bold in each column indicates the best method for the dataset and the metric evaluated.
Comparison of performance metrics of different methods on HCV population datasets.
| Algorithm | FP/TP ratio | Recall | Precision | |||
|---|---|---|---|---|---|---|
| HCV1P | HCV2P | HCV1P | HCV2P | HCV1P | HCV2P | |
| Uncorrected | 1201 | 571 | 99.51 | 99.88 | 0.08 | 0.17 |
| Quake | 303.3 | 149 | 96.41 | 97.23 | 0.32 | 0.66 |
| BLESS | 202 | 112 | 98.35 | 97.18 | 0.49 | 0.88 |
| Musket | 938 | 463 | 93.53 | 89.17 | 0.10 | 0.21 |
| BFC | 352 | 161 | 99.32 | 99.84 | 0.28 | 0.61 |
| BayesHammer | 699 | 340 | 98.12 | 97.1 | 0.14 | 0.29 |
| Seecer | 1095 | 528 | 0.09 | 0.19 | ||
| MultiRes | 96.5 | 94.25 | ||||
The false positive to true positive ratios, Recall, and Precision of error correction methods on the two simulated HCV datasets are shown. Uncorrected refers to the statistics when no error correction is performed. Bold font in each column indicates the best method for each dataset on the evaluated measure.
Fig. 2Performance of MultiRes on HCV datasets under power law distributions of viral haplotypes with respect to count of k-mer. 35-mer multiplicity plots for HCV1P and HCV2P datasets are shown. x-axis indicates the number of times a 35-mer was observed while y-axis indicates the number of 35-mers at a count. (a) The predicted true 35-mers from MultiRes (HCV1P red, HCV2P pink) compared to the uncorrected data (HCV1P blue,HCV2P green), and (b) The true positive rare variants 35-mers from MultiRes (HCV1P red, HCV2P pink) versus the ground truth 35-mers (HCV1P red, HCV2P pink). MultiRes predicts rare variants k-mers at all counts greater than 3, with its accuracy improving as counts of k-mer increases.
Comparison of performance metrics on 5-viral mix HIV-1 dataset.
| Algorithm | Recall | Precision | FP/TP ratio | # of unique 35-mers |
|---|---|---|---|---|
| Uncorrected | 98.01 | 0.2 | 439 | 11.4 M |
| BLESS | 97.31 | 0.4 | 227 | 5.89 M |
| Musket | 0.3 | 366 | 11.2 M | |
| BFC | 97.55 | 0.3 | 316 | 9.6 M |
| BayesHammer | 97.49 | 0.8 | 122 | 6.3M |
| Seecer | 97.84 | 0.5 | 220 | 11.3M |
| MultiRes | 96.64 | 7.1 |
The recall, precision, and FP/TP ratios of each method are evaluated on the 5-viral mix HIV-1 dataset. The number of unique 35-mers indicates the number of unique 35-mers predicted by a method. There are 53 thousand true unique 35-mers in the consensus sequences of the 5 viral strains. Bold indicates the best method for the measure in each column.
Fig. 3Runtime comparison on five-viral mix dataset. Comparison of running times for different algorithms on 5-viral mix dataset on 8GB memory nodes of 2X Dual Core AMD Opteron 2216 systems from Dell. The time noted for BayesHammer is only the time reported for BayesHammer error correction step in SPADES (version 3.6.2). The time reported for MultiRes is the combined time for k-mer counting, predicting k-mers as erroneous and rare variants and generating the final output.
Comparison with Variant Calling methods on all datasets.
| Dataset | Method | Recall (%) | FP/TP ratio | Precision (%) | # of False negatives | Mapped reads (%) |
|---|---|---|---|---|---|---|
| HIV 100x | LoFreq | 97.33 | 0.004 | 99.60 | 444 | 89.51 |
| Vphaser | 98.90 | 0.007 | 99.26 | 183 | 89.51 | |
| ShoRAH | 55.21 | 7746 | ||||
| MultiRes | 0.011 | 98.88 | 97.89 | |||
| HIV 400x | LoFreq | 84.83 | 0 | 99.99 | 2522 | 99.55 |
| Vphaser | 0.292 | 77.37 | 99.55 | |||
| ShoRAH | 55.21 | 7746 | ||||
| MultiRes | 95.57 | 0.007 | 99.33 | 736 | 97.34 | |
| HCV1P | LoFreq | 1.282 | 43.82 | |||
| Vphaser | 93.51 | 1.628 | 38.05 | 118 | ||
| ShoRAH | 91.92 | 147 | ||||
| MultiRes | 98.24 | 0.597 | 62.64 | 32 | 97.32 | |
| HCV2P | LoFreq | 97.10 | 1.046 | 48.87 | 60 | |
| Vphaser | 95.65 | 1.492 | 40.13 | 90 | ||
| ShoRAH | 83.73 | 337 | 99.95 | |||
| MultiRes | 0.201 | 83.27 | 85.14 | |||
| 5-viral mix | LoFreq | 99.06 | 0.085 | 92.15 | 101 | 98.59 |
| Vphaser | 92.68 | 0.039 | 96.25 | 789 | 98.59 | |
| ShoRAH | 98.66 | 109 | ||||
| MultiRes | 0.077 | 92.82 | 96.29 |
The Recall, false positive to true positive ratios (FP/TP), Precision, number of false negatives, and % of mapped reads by methods LoFreq, VPhaser-2, ShoRAH, and MultiRes are computed for listed datasets. All reads from a sample were aligned using bwa-mem tool for LoFreq and VPhaser-2 under default settings. ShoRAH uses its own aligner for read alignment and variant calling, while k-mers detected by MultiRes were aligned using bwa-mem for MultiRes. Outputs from LoFreq (version 2.1.2), VPhaser-2 (last downloaded version October 2015), and ShoRAH (last downloaded version from November 2013) are compared against known variants for simulated datasets. For 5-viral mix, the consensus reference provided by [35] was used to determine ground truth variants. MultiRes variants are determined by aligning 35-mers to a reference sequence and bases occurring at more than 0.01 frequency as variants. Bold for each dataset indicates the best method for the performance measures.