| Literature DB >> 32450900 |
Cristian Groza1, Tony Kwan1,2, Nicole Soranzo3,4,5,6, Tomi Pastinen1,2,7, Guillaume Bourque8,9,10,11.
Abstract
BACKGROUND: Epigenomic studies that use next generation sequencing experiments typically rely on the alignment of reads to a reference sequence. However, because of genetic diversity and the diploid nature of the human genome, we hypothesize that using a generic reference could lead to incorrectly mapped reads and bias downstream results.Entities:
Keywords: ChIP-seq; De novo assembly; Epigenomics; Genome graphs; Modified reference; Personalized genomes; Reference bias
Mesh:
Year: 2020 PMID: 32450900 PMCID: PMC7249353 DOI: 10.1186/s13059-020-02038-8
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Fig. 1a Two instances of reference bias that could be corrected by a personalized genome. One read is mapped to the incorrect location in the reference genome. The other read is unmapped in the reference genome, but becomes mapped in the personalized genome. b Phased personalized genomes can be implemented in several ways. The reference can be patched with called variants to create a pair of modified personal genomes (MPGs). Alternatively, a sequence graph genome could be augmented with an individual’s alleles (GPG). Finally, the entire personal genomic sequence can be assembled de novo (DPG)
Number of altered peak calls in MPGs, DPGs, and GPGs for the NA12878 H3K4me1 and H3K27ac marks
| Version | Mark | Common | Personal-only | Ref-only |
|---|---|---|---|---|
| MPG, paternal | H3K4me1 | 146,520 | 1636 (1.1%) | 854 (0.6%) |
| MPG, maternal | H3K4me1 | 146,570 | 1622 (1.1%) | 808 (0.6%) |
| MPG, downsampled | H3K4me1 | 146,688 | 1051 (0.7%) | 550 (0.4%) |
| DPG, Hap1 | H3K4me1 | 141,444 | 7176 (4.8%) | 6755 (4.6%) |
| DPG, Hap2 | H3K4me1 | 141,442 | 7130 (4.8%) | 6774 (4.6%) |
| DPG, Pendleton | H3K4me1 | 142,347 | 16,245 (10.2%) | 8912 (5.8%) |
| GPG | H3K4me1 | 132,668 | 3068 (2.3%) | 1178 (0.9%) |
| MPG, paternal | H3K27ac | 68,888 | 660 (1.0%) | 351 (0.5%) |
| MPG, maternal | H3K27ac | 68,909 | 688 (1.0%) | 335 (0.5%) |
| MPG, downsampled | H3K27ac | 68,953 | 438 (0.6%) | 218 (0.3%) |
| DPG, Hap1 | H3K27ac | 63,419 | 2078 (3.2%) | 9901 (13.5%) |
| DPG, Hap2 | H3K27ac | 63,441 | 2091 (3.2%) | 9899 (13.5%) |
| DPG, Pendleton | H3K27ac | 66,811 | 5208 (7.2%) | 4980 (6.9%) |
| GPG | H3K27ac | 75,538 | 1847 (2.4%) | 1206 (1.6%) |
Fig. 2a A comparison of the coverage of H3K4me1 peak called regions in hg19 and the maternal MPG. b Identification of peak called regions that have a significant difference in coverage. cQ value distributions of the same H3K4me1 peaks. d NA12878 MPG estimate of the probability that each combination of variation calls present in a region may cause a personal-only peak call compared to their average widths
Fig. 3a Proportion of peaks that are called only in personalized MPGs. b Number of peaks with higher coverage in the personalized MPG than in the reference. c Blueprint MPG estimates of the probability that each combination of variation calls present in a region may cause a personal-only peak call compared to their relative average widths. d The probability that a variant affects a peak called on full reads is lower compared to trimmed reads
Fig. 4a A comparison of the coverage of peak called regions in the reference and the Hap1 DPG. The smear represents ref-only peaks with no coverage in Hap1. b Identification of peak called regions that have a significant difference in coverage. c Summary of the overlap between altered peaks, confident peaks, repeats, and segmental duplications [58]. d The repeats that overlap altered peaks are enriched in Alu elements relative to their frequency in the RepeatMasker. The categories are chosen by grouping repeats by name prefix, summing their frequencies per group, and taking the largest groups. Remaining groups are labeled as “other.” The control regions are random genomic intervals with a width distribution identical to altered peaks
Fig. 5a A comparison of the coverage of H3K4me1 peak called regions in the reference and the graph genome. Pairwise overlaps between MPG, DPG, and GPG H3K4me1 peak tracks. b Identification of peak called regions that have a significant difference in coverage. c Overlap of all peak calls. d Overlap of altered personal-only peak calls. e Overlap of ref-only peak calls. f Empirical null distributions for the overlap of personal-only peaks between personal genome implementations
Fig. 6a Comparison of altered peak q values between MPG, GPG, and DPG implementations by rank. The top n peak subset was increased by 5 peak increments. b Distribution of gene relative positions of personal-only peaks among all genomes. Personal-only and common peaks replicated in at least two genomes are also featured. c The pileup of a GPG-only peak projected to the hg19 linear reference. d The true graph rendering of the above AP in the NA12878 GPG and reference genome graph