| Literature DB >> 34783230 |
Maurilio Monsu1, Matteo Comin1.
Abstract
Sequencing technologies has provided the basis of most modern genome sequencing studies due to its high base-level accuracy and relatively low cost. One of the most demanding step is mapping reads to the human reference genome. The reliance on a single reference human genome could introduce substantial biases in downstream analyses. Pangenomic graph reference representations offer an attractive approach for storing genetic variations. Moreover, it is possible to include known variants in the reference in order to make read mapping, variant calling, and genotyping variant-aware. Only recently a framework for variation graphs, vg [Garrison E, Adam MN, Siren J, et al. Variation graph toolkit improves read mapping by representing genetic variation in the reference. Nat Biotechnol 2018;36:875-9], have improved variation-aware alignment and variant calling in general. The major bottleneck of vg is its high cost of reads mapping to a variation graph. In this paper we study the problem of SNP calling on a variation graph and we present a fast reads alignment tool, named VG SNP-Aware. VG SNP-Aware is able align reads exactly to a variation graph and detect SNPs based on these aligned reads. The results show that VG SNP-Aware can efficiently map reads to a variation graph with a speedup of 40× with respect to vg and similar accuracy on SNPs detection.Entities:
Keywords: SNP detection; reads alignment; variation graph
Mesh:
Year: 2021 PMID: 34783230 PMCID: PMC8709736 DOI: 10.1515/jib-2021-0032
Source DB: PubMed Journal: J Integr Bioinform ISSN: 1613-4516
Figure 1:Example of a variation graph with a biallelic SNP. The two alternative bases are included in two distinct paths.
Figure 2:(a) Mapping of the read “ATGTT”, which includes an alternative allele of an SNP, on the variation graph in the forward direction; (b) mapping of read “AAAAT”, which maps to the reference, on the complemented variation graph in the reverse direction.
Figure 3:The vg pipeline: graph construction, graph indexing, reads mapping, variant calling.
Resources and datasets used for testing.
| Dataset | Description |
|---|---|
| hg19 | Reference genome |
| dbSNP | Biallelic SNPs dataset [ |
| NA12878 | Gold standard from GIAB [ |
| SRR622461 (low coverage) | Dataset of reads with coverage 6× |
| SRR622457 (high coverage) | Dataset of reads with coverage 10× |
Time and memory comparison for reads alignment and genotyping for various tools.
| Dataset | Software | Mapping min (speedup) | Genotyping min | Total time min (speedup) | Memory GB |
|---|---|---|---|---|---|
| Low coverage |
| 8685 (38.6×) | 120 | 8804 (28.7×) | 47.74 |
| Low coverage |
| 1003 (4.45×) | 100 | 1103 (3.1×) | 47.74 |
| Low coverage | VG SNP-Aware | 225 | 129 | 354 | 45.25 |
| High coverage |
| 12,307 (40.6×) | 165 | 12,472 (28.6×) | 48.73 |
| High coverage |
| 1542 (5.1×) | 126 | 1668 (3.8×) | 48.73 |
| High coverage | VG SNP-Aware | 303 | 138 | 436 | 51.13 |
Size of the output alignment file for all datasets and tools.
| Dataset | Software | JSON (GB) | GAM (GB) |
|---|---|---|---|
| Low coverage |
| 183 | 17 |
| Low coverage |
| 135 | 14 |
| Low coverage | VG SNP-Aware | 46 | 4.2 |
| High coverage |
| 291 | 27 |
| High coverage |
| 211 | 22 |
| High coverage | VG SNP-Aware | 71 | 6.3 |
Genotyping results for all tools and datsets: true positive, false positive and false negative.
| Dataset | Software | TP | FP | FN |
|---|---|---|---|---|
| Low coverage |
| 1,773,962 | 95,198 | 1,917,199 |
| Low coverage |
| 1,503,409 | 73,185 | 2,187,744 |
| Low coverage | VG SNP-Aware | 1,511,408 | 127,387 | 2,179,745 |
| High coverage |
| 2,141,772 | 84,743 | 1,549,384 |
| High coverage |
| 1,928,400 | 57,667 | 1,762,753 |
| High coverage | VG SNP-Aware | 1,935,783 | 120,288 | 1,755,370 |
Genotyping results for all tools and datsets: precision, sensitivity and F-measure.
| Dataset | Software | Precision | Sensitivity | F-measure |
|---|---|---|---|---|
| Low coverage |
| 0.9491 | 0.4806 | 0.6381 |
| Low coverage |
| 0.9536 | 0.4073 | 0.5708 |
| Low coverage | VG SNP-Aware | 0.9223 | 0.4095 | 0.5671 |
| High coverage |
| 0.9619 | 0.5802 | 0.7239 |
| High coverage |
| 0.9710 | 0.5224 | 0.6793 |
| High coverage | VG SNP-Aware | 0.9415 | 0.5244 | 0.6736 |
| 1: |
| 2: initial_nodes=find_match(R,graph) |
| 3: |
| 4: current_node=n |
| 5: max_node_visited= 2 * length(R) |
| 6: string_to_map= extract_substring(R,n) |
| 7: mapping_direction= find_direction(R,n) |
| 8: neighbour=0 |
| 9: aligned_path=current_node |
| 10: |
| 11: next_node=find_next_node(current_node, mapping_direction) |
| 12: match=find_string(string_to_map,next_node) |
| 13: |
| 14: update(string_to_map,match) |
| 15: current_node=next_node |
| 16: neighbour=0 |
| 17: add_node(aligned_path, current_node) |
| 18: |
| 19: neighbour++ |
| 20: |
| 21: exit While; |
| 22: |
| 23: |
| 24: max_node_visited = max_node_visited - 1 |
| 25: |
| 26: |
| 27: Store_Alignment(R,aligned_path) |
| 28: |
| 29: exit for all loop and process next read |
| 30: |
| 31: |
| 32: |
| 33: |