| Literature DB >> 31604920 |
Abstract
Whole-genome sequencing using sequencing technologies such as Illumina enables the accurate detection of small-scale variants but provides limited information about haplotypes and variants in repetitive regions of the human genome. Single-molecule sequencing (SMS) technologies such as Pacific Biosciences and Oxford Nanopore generate long reads that can potentially address the limitations of short-read sequencing. However, the high error rate of SMS reads makes it challenging to detect small-scale variants in diploid genomes. We introduce a variant calling method, Longshot, which leverages the haplotype information present in SMS reads to accurately detect and phase single-nucleotide variants (SNVs) in diploid genomes. We demonstrate that Longshot achieves very high accuracy for SNV detection using whole-genome Pacific Biosciences data, outperforms existing variant calling methods, and enables variant detection in duplicated regions of the genome that cannot be mapped using short reads.Entities:
Mesh:
Year: 2019 PMID: 31604920 PMCID: PMC6788989 DOI: 10.1038/s41467-019-12493-y
Source DB: PubMed Journal: Nat Commun ISSN: 2041-1723 Impact factor: 14.919
Fig. 1Overview of the Longshot algorithm. a Candidate variants are identified using the pileup of the original alignments and a standard genotype likelihood calculation is used to determine whether the site is a potential variant. b To determine the allele for each read at each potential SNV site it overlaps, a window is formed around the variant and the probability of the observed read sequence given each allele is calculated using the forward algorithm on a Pair-Hidden Markov Model. The most likely allele and quality score is chosen based on the relative likelihoods of the two alleles. c Using the alleles and quality values for each read at variant sites, phased genotypes for all variants are determined jointly by performing haplotype assembly using HapCUT2 (on heterozygous variants) and local updates of the phased genotypes in an iterative manner
Fig. 2Accuracy and completeness of LongShot SNV calls on whole-genome SMS data. Longshot was used to call single nucleotide variants (SNVs) using SMS data from the GIAB project for four human genomes: NA12878 ( and coverage), NA24385 (, , , and coverage), NA24149 ( coverage), and NA24143 ( coverage). For each individual, variants were also called using FreeBayes applied to coverage Illumina short reads. a Precision of the SNV calls calculated using the GIAB high-confidence variant call set. b Recall of the SNV calls. c The combined switch error rate (total rate of switch errors and mismatch errors) of the Longshot and Illumina short-read-based haplotypes. d N50 length of the haplotypes. e The fraction of heterozygous variants phased in each dataset
Comparison of accuracy for variant calling methods on whole-genome SMS data
| Genome | Read | Method | Precision | Recall | Runtime |
|---|---|---|---|---|---|
| Coverage | (h) | ||||
| Longshot |
|
| 23:31 | ||
| NA12878 | 44 | WhatsHap |
|
| 27:47 |
| Clairvoyante |
|
| 21:44 (×4) | ||
| Longshot |
|
| 41:55 | ||
| NA24385 | 62 | WhatsHap |
|
| 32:09 |
| Clairvoyante |
|
| 22:25 (×4) | ||
| Longshot |
|
| 20:03 | ||
| NA24385 | 27 | WhatsHap |
|
| 22:54 |
| Clairvoyante |
|
| 21:09 (×4) | ||
| Longshot |
|
| 18:51 | ||
| NA24143 | 27 | WhatsHap |
|
| 22:06 |
| Clairvoyante |
|
| 21:42 (×4) | ||
| Longshot |
|
| 16:59 | ||
| NA24149 | 23 | WhatsHap |
|
| 20:30 |
| Clairvoyante |
|
| 23:59 (×4) |
All methods were run on BAM files generated using the NGMLR aligner, and precision and recall values were calculated using the GIAB high-confidence variant calls. The runtime listed is the total walltime to process all chromosomes individually. Clairvoyante supports multi-threading and was run using four threads per chromosome
Comparison of PacBio and Illumina SNV calls for NA12878
| Genome | Inside GIAB | Outside GIAB | Segmental Dup. | Segmental Dup. | ||
|---|---|---|---|---|---|---|
| (1–22) | Confident | Confident | (≥95% similar) | (≥99% similar) | ||
| Region size | 2.8 Gb | 2.4 Gb | 330.7 Mb | 102.8 Mb | 47.5 Mb | |
| # SNVs | 3,518,530 | 3,002,660 | 515,870 | 180,889 | 78,851 | |
| PacBio | Ts/Tv | 2.08 | 2.14 | 1.75 | 1.95 | 1.99 |
| # SNVs | 3,563,787 | 3,065,573 | 498,214 | 116,649 | 18,684 | |
| Illumina | Ts/Tv | 2.03 | 2.1 | 1.66 | 1.84 | 1.79 |
| # SNVs | 254,428 | 63,848 | 190,580 | 103,621 | 69,705 | |
| Unique to PacBio | Ts/Tv | 1.63 | 1.83 | 1.57 | 1.9 | 1.99 |
| # SNVs | 299,733 | 126,763 | 172,970 | 39,409 | 9538 | |
| Unique to Illumina | Ts/Tv | 1.3 | 1.26 | 1.33 | 1.53 | 1.55 |
| Shared | # SNVs | 3,264,078 | 2,938,812 | 325,266 | 77,241 | 9146 |
| Illumina & PacBio | Ts/Tv | 2.12 | 2.15 | 1.85 | 2.01 | 2.04 |
Variants were called using short reads (33× coverage) with FreeBayes and using SMS long reads (30× coverage) with Longshot. The number of variants called by each technology, the number of variants shared between the two technologies, and the corresponding transition/transversion (Ts/Tv) ratios are shown for the whole genome and various subsets of the genome including GIAB high-confidence regions and segmental duplications with high sequence identity
Fig. 3Accurate variant calling using SMS reads and Longshot in the duplicated gene STRC. An Integrated Genomics Viewer (IGV) view of mapped reads shows that a long segment of the gene (circled in black) has low coverage using uniquely mapped Illumina reads due to the presence of a long segmental duplication with high sequence similarity () that spans the entire gene. PacBio reads (separated by haplotype using Longshot phased SNVs) have consistent coverage of mapped reads across the entire gene, allowing Longshot to call 42 SNVs of which 20 are shared with short reads, and 22 are unique to Longshot