| Literature DB >> 29764365 |
Daniel Valenzuela1, Tuukka Norri1, Niko Välimäki2, Esa Pitkänen3, Veli Mäkinen4.
Abstract
BACKGROUND: Typical human genome differs from the reference genome at 4-5 million sites. This diversity is increasingly catalogued in repositories such as ExAC/gnomAD, consisting of >15,000 whole-genomes and >126,000 exome sequences from different individuals. Despite this enormous diversity, resequencing data workflows are still based on a single human reference genome. Identification and genotyping of genetic variants is typically carried out on short-read data aligned to a single reference, disregarding the underlying variation.Entities:
Keywords: Pan-genome reference; Read alignment; Variation calling
Mesh:
Year: 2018 PMID: 29764365 PMCID: PMC5954285 DOI: 10.1186/s12864-018-4465-8
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Fig. 1Schematic view of our PanVC workflow for variation calling, including a conceptual example. The pan-genomic reference comprises the sequences GATTATTC, GATGGCAAATC, GTTTACTTC and GATTTTC, represented as a multiple sequence alignment. The set of reads from the donor individual is GTTT, TTAA, AAAT and AATC. CHIC aligner is used to find the best alignment of each read. In the example, all the alignments are exact matches starting in the first base of the third sequence, the third base of the first sequence, the seventh base of the second sequence, and on the eight base of the second sequence. After all the reads are aligned, the score matrix is computed by incrementing the values of each position where a read aligns. With those values, the heaviest path algorithm extracts a recombination that takes those bases with the highest scores. This is the ad hoc genome which is then used as a reference for variant calling using GATK. Finally the variants are normalized so that they are using the standard reference instead of the ad hoc reference
Edit distance from the predicted donor sequence to the true donor. The average distance between the true donors and the reference is 95193,9
| Pan-genome reference size | ||||
|---|---|---|---|---|
| 1 | 20 | 50 | 100 | |
| GATK | 74695.9 | - | - | - |
| MSA | - | 2885.5 | 1956.9 | 1204.7 |
| MSA | - | 1349.3 | 1117.4 | 1099.3 |
| Graph +GATK | - | 3230.4 | 3336.8 | 2706.9 |
Precision and recall of our method MSA compared to GATK
| Measure | GATK | 20 | 50 | 100 |
|---|---|---|---|---|
| SNV precision | 0.992161 | 0.998585 | 0.998863 | 0.998773 |
| SNV recall | 0.904897 | 0.997098 | 0.998695 | 0.999072 |
| Indel precision | 0.364853 | 0.996514 | 0.99731 | 0.997778 |
| Indel recall | 0.0624981 | 0.982659 | 0.985723 | 0.985958 |
Fig. 2Four different representations of a pan-genome that corresponds to the same set of individuals. Top left: a reference sequence plus a set of variants to specify the other individuals. Top right: a (directed acyclic) graph representation. Bottom left: a multiple sequence alignment representation, Bottom right: a set of sequences representations