| Literature DB >> 28333980 |
Abstract
Whole-genome sequencing is a powerful tool for analyzing genetic variation on a global scale. One particularly useful application is the identification of mutations obtained by classical phenotypic screens in model species. Sequence data from the mutant strain is aligned to the reference genome, and then variants are called to generate a list of candidate alleles. A number of software pipelines for mutation identification have been targeted to C. elegans, with particular emphasis on ease of use, incorporation of mapping strain data, subtraction of background variants, and similar criteria. Although success is predicated upon the sensitive and accurate detection of candidate alleles, relatively little effort has been invested in evaluating the underlying software components that are required for mutation identification. Therefore, we have benchmarked a number of commonly used tools for sequence alignment and variant calling, in all pair-wise combinations, against both simulated and actual datasets. We compared the accuracy of those pipelines for mutation identification in C. elegans, and found that the combination of BBMap for alignment plus FreeBayes for variant calling offers the most robust performance.Entities:
Mesh:
Year: 2017 PMID: 28333980 PMCID: PMC5363872 DOI: 10.1371/journal.pone.0174446
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Perfect-match SE-50bp 20-fold genomes, mapping and genome coverage.
| Aligner | Unmapped/low map quality | Uncovered genome | Uncovered CDS (number of genes) |
|---|---|---|---|
| BBMap | 7.04% | 5.29% | 3.00% (2,089) |
| BFAST | 7.20% | 5.40% | 3.06% (2,134) |
| Bowtie | 6.49% | 4.86% | 2.96% (2,067) |
| BWA | 6.50% | 4.87% | 2.96% (2,067) |
| Novoalign | 6.49% | 4.86% | 2.96% (2,067) |
aThe percentage of reads with map quality scores ≤ 3.
bThe percentage of nucleotides with read depth coverage < 3.
cThe percentage of coding sequence in the genome that is uncovered, and the number of genes that contain uncovered coding sequence. Total coding sequence, 25,460,976 bases. Total number of genes, 20,538.
Percentage of uncovered sequence using various perfect-match data sets.
| SE-50bp, 50X | SE-150bp, 20X | PE-50bp, 20X | |||||||
|---|---|---|---|---|---|---|---|---|---|
| Aligner | Unc | CDS | genes | Unc | CDS | genes | Unc | CDS | genes |
| BBMap | 4.94% | 2.87% | 1,982 | 2.22% | 1.85% | 897 | 2.55% | 1.73% | 855 |
| BFAST | 5.10% | 2.91% | 2,016 | 3.05% | 2.01% | 989 | 5.40% | 3.07% | 2,124 |
| Bowtie | 4.56% | 2.84% | 1,960 | 2.22% | 1.85% | 897 | 2.35% | 1.61% | 781 |
| BWA | 4.57% | 2.84% | 1,960 | 2.22% | 1.85% | 897 | 2.17% | 1.61% | 785 |
| Novoalign | 4.56% | 2.84% | 1,960 | 2.22% | 1.85% | 897 | 2.41% | 1.65% | 863 |
aThe percentage of nucleotides with read depth coverage < 3.
bTypes of sequence data (SE, single-end; PE, paired-end), read length, and fold genome coverage.
cValues for the percentage of the uncovered genome (Unc), percentage of uncovered coding sequences (CDS), and the number of genes that contain uncovered coding sequence.
Fig 1Sensitivity of variant-calling pipelines for EMS-type mutations.
The percentage of true-positive (TP) mutation calls is indicated for each combination of aligner plus variant caller (F, FreeBayes; G, GATK HaplotypeCaller; S, SAMtools/BCFtools; V, VarScan2) and plotted separately for different categories of variants. Homozyous (blue) and heterozygous (red) mutation calls are indicated by color. Asterisks (*) indicate the best-performing pipelines in each category. (A) Homozygous SNPs. (B) Heterozygous SNPs. (C) Insertions. (D) Deletions.
Fig 2Error rates of variant-calling pipelines.
The fraction of (A) false-positive (FP) and (B) mismatch (MM) mutation calls as a percentage of the total number of variants called by each pipeline. Variant callers are indicated as in Fig 1. Color codes for different categories of variants are indicated by the key (inset). Abbreviations: struct, structural variant; hom, homozygous; het, heterozygous; del, deletion; ins, insertion; SNP, single-nucleotide polymorphism.
Fig 3Sensitivity of variant-calling pipelines for Hawaiian SNPs.
(A) Plot of Hawaiian (Haw) SNP fraction vs. physical map position using the default threshold for variant calling. Shown is a representative example used to map lin-9(n112), located on chromosome III at position 8.9 Mb (red arrow), with BBMap+FreeBayes for variant calling. Green line, LOESS regression of the SNP fraction. (B) The same data as A with a minimum threshold of 1% variant call and one supporting read for FreeBayes. Mapping data from [26]. (C) Sensitivity for 50% (blue) and 5% (red) Hawaiian SNPs. The percentage of true-positive (TP) Hawaiian SNP calls are indicated for each pipeline. Variant callers are indicated as in Fig 1.