| Literature DB >> 33092085 |
David J King1,2, Graham Freimanis1, Lidia Lasecka-Dykes1, Amin Asfor1,3, Paolo Ribeca4, Ryan Waters1, Donald P King1, Emma Laing2.
Abstract
High-throughput sequencing such as those provided by Illumina are an efficient way to understand sequence variation within viral populations. However, challenges exist in distinguishing process-introduced error from biological variance, which significantly impacts our ability to identify sub-consensus single-nucleotide variants (SNVs). Here we have taken a systematic approach to evaluate laboratory and bioinformatic pipelines to accurately identify low-frequency SNVs in viral populations. Artificial DNA and RNA "populations" were created by introducing known SNVs at predetermined frequencies into template nucleic acid before being sequenced on an Illumina MiSeq platform. These were used to assess the effects of abundance and starting input material type, technical replicates, read length and quality, short-read aligner, and percentage frequency thresholds on the ability to accurately call variants. Analyses revealed that the abundance and type of input nucleic acid had the greatest impact on the accuracy of SNV calling as measured by a micro-averaged Matthews correlation coefficient score, with DNA and high RNA inputs (107 copies) allowing for variants to be called at a 0.2% frequency. Reduced input RNA (105 copies) required more technical replicates to maintain accuracy, while low RNA inputs (103 copies) suffered from consensus-level errors. Base errors identified at specific motifs identified in all technical replicates were also identified which can be excluded to further increase SNV calling accuracy. These findings indicate that samples with low RNA inputs should be excluded for SNV calling and reinforce the importance of optimising the technical and bioinformatics steps in pipelines that are used to accurately identify sequence variants.Entities:
Keywords: high-throughput sequencing; sequencing error; sub-consensus variants; viral populations
Year: 2020 PMID: 33092085 PMCID: PMC7594041 DOI: 10.3390/v12101187
Source DB: PubMed Journal: Viruses ISSN: 1999-4915 Impact factor: 5.048
Figure 1An overview of the high-throughput sequencing (HTS) pipelines evaluated in this study. Each pipeline comprised different combinations of laboratory and computational approaches.
The location of the site-directed mutations introduced in plasmids pT7S3 A–D compared with the original pT7S3 wild type. The relative abundance of each used to create the artificial populations is provided along with the resulting single-nucleotide variant (SNV) percentage frequency*. Amplicon position is defined as the number of bases along the polymerase chain reaction (PCR) amplicon.
| Original | Site-Directed Mutagenesis | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| pT7S3 Wild Type | pT7S3 A | pT7S3 B | pT7S3 C | pT7S3 D | Nucleotide Frequency | |||||
| Plasmid Relative Abundance | 0.01% | 1.00% | 10.00% | 88.89% | 0.10% | A | T | C | G | |
| Amplicon position* | 1754 | C | T | T | T | C | 99.89% | 0.11% | ||
| 1932 | G | G | G | G | A | 0.10% | 99.90% | |||
| 2149 | G | A | A | G | G | 11.00% | 89.00% | |||
| 2297 | T | G | G | G | G | 0.01% | 99.99% | |||
| 2323 | A | G | G | G | G | 0.01% | 99.99% | |||
| 2505 | A | A | G | G | A | 1.11% | 98.89% | |||
| 2507 | T | T | G | G | T | 1.11% | 98.89% | |||
| 2755 | G | A | A | A | A | 99.99% | 0.01% | |||
| 2761 | A | T | T | T | A | 0.11% | 99.89% | |||
| 2767 | G | A | A | A | A | 99.99% | 0.01% | |||
| 2791 | C | T | T | T | T | 99.99% | 0.01% | |||
| 2843 | A | C | C | C | C | 0.01% | 99.99% | |||
| 2955 | G | A | A | A | A | 99.99% | 0.01% | |||
| 3106 | G | A | A | A | A | 99.99% | 0.01% | |||
| 3376 | C | A | C | A | C | 89.89% | 10.11% | |||
| 3645 | G | G | G | G | A | 0.10% | 99.90% | |||
| 3661 | G | A | A | A | G | 99.89% | 0.11% | |||
| 3691 | T | G | G | G | T | 0.11% | 99.89% | |||
| 3695 | G | T | T | T | G | 99.89% | 0.11% | |||
| 3697 | T | C | C | C | T | 0.11% | 99.89% | |||
Figure 2The effect of short read aligner choice on SNV calling accuracy. The range of micro-averaged Matthews correlation coefficient (MuMCC) scores of each of the tested aligners (including all parameters) used for all singlet population types. The mean of the MuMCC range is indicated by the solid black dot within each boxplot. Minimum and maximum whiskers on each bar plot indicate the highest and lowest MuMCC scores. RNA High = Red, RNA Medium = Orange, RNA Low = Light orange, DNA High = Blue, DNA Low = Light blue.
Figure 3The effect of qScore and number of replicates on SNV calling accuracy. The range of MuMCC scores of each population input for each qScore parameter and technical replicate combinations tested. Singlet, duplicate, triplicate and quadruplicate technical replicates are represented by a different shade of colour. The solid black dot within each boxplot indicates the mean of the MuMCC distribution. Minimum and maximum whiskers on each bar plot indicate the highest and lowest MuMCC scores.
The optimized computational parameters from all populations for low-frequency SNV characterisation. A list of chosen aligner and qScore, read length and frequency thresholds which produced the highest SNV detection accuracy for all population types and replicate numbers is given. The RNA Low input was removed due to its inability to accurately call low-frequency SNVs.
| Input | Replicate Combinations | Aligner | qScore | Read Length (bp) | Suggested Frequency Cut-Off | MuMCC | |
|---|---|---|---|---|---|---|---|
| RNA |
| Singlet | GEM3 | 38 | 70 | 0.20% | 0.756 |
| Duplicate | 0.20% | 0.816 | |||||
| Triplicate | 0.20% | 0.816 | |||||
| Quadruplicate | 0.20% | 0.816 | |||||
|
| Singlet | GEM3 | 38 | 70 | 0.80% | 0.707 | |
| Duplicate | 0.50% | 0.707 | |||||
| Triplicate | 0.30% | 0.707 | |||||
| Quadruplicate | 0.20% | 0.707 | |||||
| DNA |
| Singlet | GEM3 | 35 | 70 | 0.20% | 1.000 |
| Duplicate | 0.20% | 0.933 | |||||
| Triplicate | 0.04% | 0.891 | |||||
| Quadruplicate | 0.04% | 0.913 | |||||
|
| Singlet | GEM3 | 35 | 70 | 0.20% | 0.707 | |
| Duplicate | 0.20% | 0.707 | |||||
| Triplicate | 0.20% | 0.707 | |||||
| Quadruplicate | 0.20% | 0.707 |
Figure 4The MuMCC range for each population input for each percentage frequency cut-off parameter tested. Data obtained following alignment using GEM3 and fixed qScore and read length parameters. The data are represented is stacked, with singlet, duplicate, triplicate and quadruplicate replicate combinations for each population type represented by a different shade of colour. Darker colours represent a greater number of replicates with increased MuMCC score benefit.
Figure 5Identification of nucleotide substitutions which arose from processed-introduced error. All error base changes (from all frequency cut-offs tested) from each sequenced technical replicate (represent by each of the 4 bar plots in each base change) for each population type were analysed. Regardless of genomic type and abundance, four main False Positive (FP) base changes (represented >10% of all FP types) were characterized in each of the four technical replicates.
The percentage frequency and base change of each FP identified above the recommended frequency threshold for each population input and replicate. For the RNA Low input, a suggested frequency cut-off was not identified.
| Input Replicate | |||||||
|---|---|---|---|---|---|---|---|
| Input | Suggested Frequency Cut-Off | Amplicon Position | Base Change | 1 | 2 | 3 | 4 |
| DNA | 0.2% | 605 | G > T | 0.22% | |||
| RNA | 0.2% | 1135 | T > C | 0.39% | 0.35% | 0.41% | 0.30% |
| 0.2% | 1915 | T > C | 0.21% | 0.28% | |||
| 0.2% | 2300 | T > C | 0.21% | ||||
| 0.2% | 3056 | T > C | 1.06% | 0.81% | 0.62% | 0.87% | |
| RNA | 0.8% | 1135 | T > C | 1.14% | |||
| RNA | 1609 | T > C | 53.76% | ||||
| 2199 | G > T | 18.49% | |||||
| 2744 | G > C | 99.94% | |||||
| 2933 | T > C | 9.06% | |||||
| 3648 | T > C | 45.06% | |||||