| Literature DB >> 27964746 |
Marten Jäger1,2, Max Schubach1, Tomasz Zemojtel1, Knut Reinert3, Deanna M Church4, Peter N Robinson5,6,7,8,9.
Abstract
BACKGROUND: The last two human genome assemblies have extended the previous linear golden-path paradigm of the human genome to a graph-like model to better represent regions with a high degree of structural variability. The new model offers opportunities to improve the technical validity of variant calling in whole-genome sequencing (WGS).Entities:
Keywords: GRCh38; Genome sequencing; NGS; WGS
Mesh:
Year: 2016 PMID: 27964746 PMCID: PMC5155401 DOI: 10.1186/s13073-016-0383-z
Source DB: PubMed Journal: Genome Med ISSN: 1756-994X Impact factor: 11.117
Fig. 1Genomic regions with alternate locus scaffolds (alternate loci). The GRCh38.p2 genome assembly contains 178 genomic regions with one or more alt loci. The figure was produced using PhenoGram [49]
Fig. 2Frequency of ASDPs. Alignments contain stretches of sequences that are largely but not entirely identical between the primary assembly and an alternate locus, ranging from regions that are nearly identical to those with a substantial number of differences. ASDPs were defined to be positions of the alignment that differ between REF-HAP and ALT-HAP and are located in a sliding window in which at most 10 of 50 nucleotides are discrepant (green check marks). The red crosses show discrepancies that are excluded by this definition. In a and c, no ASDP was filtered out by the sliding window whereas in b, stretches of low sequence identity lead to the removal of several positions shown as red crosses. In d, large inserts in the ALT-HAP lead to a larger number of discrepant positions, which are discarded by the above criteria. e The effects of applying different thresholds of allowed discrepancies and window sizes to call ASDPs. The dotted lines mark the mismatch frequency (ten mismatches in 50 bases) used in this work. f Number of ASDPs that overlap with dbSNP variants according to the different thresholds. ASDP alignable scaffold-discrepant position
Distribution of ASDPs
| ASDP category | Count | Percentage |
|---|---|---|
| SNV | 187,080 | 80.5% |
| Deletion | 15,955 | 6.9% |
| Deletion (1 nt) | 6,368 | 2.7% |
| Deletion (2 nt) | 2,413 | 1.0% |
| Deletion (3–50 nt) | 7,174 | 3.1% |
| Insertion | 15,286 | 6.6% |
| Insertion (1 nt) | 6,423 | 2.8% |
| Insertion (2 nt) | 2,224 | 1.0% |
| Insertion (3–50 nt) | 6,639 | 2.9% |
| Block substitution | 14,012 | 6.0% |
| Block substitution (2 nt) | 11,659 | 5.0% |
| Block substitution (3 nt) | 1,653 | 0.7% |
| Block substitution (4–50 nt) | 700 | 0.3% |
A total of 232,333 high-quality ASDPs were characterized by our algorithm of which 80.5% corresponded to SNVs when comparing the alternate locus with the corresponding primary assembly. About 7% each were deletions and insertions and 6% were block substitutions with equal numbers of nucleotides.
ASDP alignable scaffold-discrepant position, SNV single nucleotide variant
Fig. 3Region 148. IGV screenshots [50] are shown with variant calls for in-house sample P. a The presence of numerous ASDP-associated variants as well as a structural variant associated with the alternate locus KI270808.1 clearly suggest that the sample is homozygous for the KI270808.1 rather than for the REF-HAP sequence for region 148. Note that most of the variants that correspond to ASDPs are homozygous, suggesting that KI270808.1 is present in the homozygous state. An additional non-ASDP variant is present. Variants corresponding to 50 of the 52 ASDPs shown are listed in dbSNP. b The corresponding region on the alternate locus KI270808.1 was alignable well. Only the single non-ASDP-associated variant is called. IGV shows supplemental reads in blue (i.e., reads that map to the primary assembly as well as to an alternate locus). ASDP alignable scaffold-discrepant position, SNV single nucleotide variant, SV structural variant, IGV Integrative Genomics Viewer
Fig. 4Overview of the ASDPex algorithm. a ASDPex compares the set of all variants called against REF-HAP with the set of ASDPs associated with ALT-HAP . In this example, (the number of ASDPs associated with ALT-HAP) is 6, and (the total number of variants called against REF-HAP) is 8. ASDPex defines the set of residual variants as the symmetric set difference between and , i.e., . Therefore, |RV|=4, and because , our algorithm infers that ALT-HAP is present. b The pattern of variant calls obtained for ASDPs differs according to whether the sequenced proband is homozygous for one of the two alternate loci or is heterozygous. Our algorithm exploits this pattern across the entire length of the alternate locus to infer the most likely genotype. ASDP alignable scaffold-discrepant position
Population-specific alternate loci
| Alternate locus | FIN | LWK | CHB | PEL |
|---|---|---|---|---|
| chr4_KI270787v1_alt |
|
| ||
| chr5_GL383531v1_alt |
| |||
| chr5_GL949742v1_alt |
| |||
| chr6_GL383533v1_alt |
| |||
| chr6_KI270801v1_alt |
|
|
| |
| chr9_GL383542v1_alt |
| |||
| chr11_JH159136v1_alt |
|
| ||
| chr13_KI270839v1_alt |
|
|
| |
| chr14_KI270844v1_alt |
|
| ||
| chr15_GL383555v2_alt |
|
|
| |
| chr18_GL383570v1_alt |
|
Shown are all the alternate loci that were inferred to be present in at least 90% of the individuals of a population. Alternate loci present in all investigated individuals of the population are marked with an asterisk (*)
CHB Asian, Han Chinese in Beijing, China, FIN European, Finnish in Finland, LWK African, Luhya in Webuye, Kenya, PEL South Americans, Peruvians from Lima, Peru
Variant statistics for both genome builds
| GRCh37 | GRCh38 | |||||||
|---|---|---|---|---|---|---|---|---|
| Chromosome | All | Common | Rare | Phred | All | Common | Rare | Phred |
| 1 | 344158 | 299500 | 44659 | 503.42 | 359530 | 291704 | 67825 | 473.83 |
| 2 | 354113 | 247585 | 106528 | 506.54 | 361469 | 243400 | 118069 | 492.91 |
| 3 | 295447 | 268021 | 27426 | 503.76 | 301985 | 263993 | 37993 | 492.99 |
| 4 | 319988 | 290080 | 29908 | 515.41 | 324134 | 285405 | 38729 | 507.33 |
| 5 | 266077 | 235462 | 30615 | 498.58 | 272482 | 231747 | 40734 | 485.57 |
| 6 | 280789 | 252705 | 28084 | 495.02 | 279132 | 246545 | 32588 | 487.11 |
| 7 | 249980 | 220543 | 29437 | 488.95 | 257917 | 216669 | 41248 | 475.69 |
| 8 | 229332 | 204823 | 24509 | 499.70 | 229541 | 200845 | 28696 | 490.15 |
| 9 | 192615 | 162202 | 30413 | 475.09 | 200034 | 159119 | 40916 | 466.46 |
| 10 | 217957 | 194385 | 23572 | 508.74 | 229352 | 190658 | 38694 | 494.18 |
| 11 | 219134 | 197412 | 21722 | 522.84 | 228324 | 194132 | 34192 | 498.04 |
| 12 | 205085 | 184477 | 20608 | 502.78 | 212789 | 175990 | 36799 | 483.94 |
| 13 | 166128 | 151271 | 14856 | 530.31 | 180521 | 148870 | 31651 | 494.65 |
| 14 | 141971 | 124790 | 17181 | 503.15 | 140443 | 122524 | 17919 | 495.75 |
| 15 | 130324 | 112085 | 18239 | 505.95 | 131389 | 109741 | 21648 | 493.46 |
| 16 | 134293 | 116224 | 18069 | 487.40 | 136799 | 113589 | 23210 | 473.36 |
| 17 | 118096 | 102300 | 15796 | 479.64 | 130637 | 99074 | 31563 | 452.99 |
| 18 | 124509 | 111958 | 12552 | 516.80 | 132628 | 110349 | 22279 | 485.89 |
| 19 | 98104 | 84416 | 13688 | 456.51 | 99625 | 82875 | 16750 | 455.35 |
| 20 | 90490 | 79709 | 10781 | 486.09 | 112562 | 78562 | 33999 | 475.40 |
| 21 | 69511 | 55211 | 14300 | 525.23 | 73027 | 53052 | 19975 | 513.27 |
| 22 | 59660 | 50242 | 9418 | 455.99 | 71112 | 48961 | 22151 | 445.27 |
| Total | 4307761 | 4465432 | ||||||
The mean counts of autosomal variants and the median Phred scores per chromosome are shown for GRCh37 and GRCh38.
Columns: All: all detected variants; Common: listed in dbSNP common_all_*; Rare: variants that are not common.
The mean variant counts for chromosome X were 127,914 (GRCh37) and 132,177 (GRCh38). For chromosome Y, the mean counts could not be estimated since gender information was not available for all of the 121 in-house genomes. Both genome releases include the identical mitochondrial reference (NC_012920.1) with 27 variants
Fig. 5Distribution of ASDP-associated variants called against the primary assembly. A significantly and substantially higher number of ASDP-associated variants are called against the primary assembly according to whether the region is inferred to be REF-HAP or ALT-HAP by the ASDPex algorithm. The data appear to fall into two well-separated clusters. The figure shows the counts of Ref/Alt ASDP-associated variants per megabase for seven selected regions for the 121 in-house genomes. * p<1×10−8; ** p<1×10−10 (Mann–Whitney test). ASDP alignable scaffold-discrepant position
Reduction in called variants by ASDPex
| Variant calling pipeline | Total variants | Variants per Mb |
|---|---|---|
| GRCh37 canonical | 114,023 ± 4,983 | 2198.3 ± 207.6 |
| GRCh38 canonical | 120,807 ± 4,069 | 1975.2 ± 66.5 |
The variant counts are shown for 121 in-house whole-genome sequencing samples in the ALT-LOCI-containing regions. For GRCh37, a liftover of the regions was performed and region 116 was removed from both datasets, since no alignable region(s) are present in GRCh37. Since the size of the regions is different in GRCh37 and GRCh38, average variant counts per megabase (Mb) are also shown. On average, there was a reduction of 7863 ± 2675 (6.5%) variants called using ASDPex in the ALT-LOCI-containing regions, corresponding to a reduction from 1975.2 ± 66.5 to 1846.7 ± 71.6 variants per Mb
Fig. 6rs2049805. The GWAS hit rs2049805 corresponds to an ASDP defined by an alignment between chromosome 1 of the primary assembly (region MTX1) and the alternate locus GL383519.1, which is identical over a stretch of 49 nucleotides except for the middle position. rs2049805 is significantly associated with blood urea nitrogen levels in east Asian populations [51]. ASDP alignable scaffold-discrepant position