| Literature DB >> 35866989 |
Lance D Hentges1,2, Martin J Sergeant1,2, Christopher B Cole1, Damien J Downes2, Jim R Hughes1,2, Stephen Taylor1.
Abstract
MOTIVATION: Genome sequencing experiments have revolutionized molecular biology by allowing researchers to identify important DNA-encoded elements genome-wide. Regions where these elements are found appear as peaks in the analog signal of an assay's coverage track, and despite the ease with which humans can visually categorize these patterns, the size of many genomes necessitates algorithmic implementations. Commonly used methods focus on statistical tests to classify peaks, discounting that background signal does not completely follow any known probability distribution and reducing the information-dense peak shapes to simply maximum height. Deep learning has been shown to be highly accurate for many pattern recognition tasks, on par or even exceeding human capabilities, providing an opportunity to reimagine and improve peak calling.Entities:
Year: 2022 PMID: 35866989 PMCID: PMC9477537 DOI: 10.1093/bioinformatics/btac525
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.931
Fig. 1.LanceOtron model overview. An indicated region has local enrichments calculated against background from 10 to 100 kb areas in 10 kb increments, plus whole chromosome (left). The enrichment values are used as inputs for a logistic regression model. Signal from the central 2 kb is fed into a CNN (right). The output from the CNN, logistic regression model and local enrichment values are all input into a multilayer perceptron (bottom), which produces the overall peak score for a given region
Fig. 2.Assessing peak calls made from other tools with LanceOtron. Peak calls retrieved from ENCODE are visualized using an interactive t-SNE plot in LanceOtron’s web tool (LanceOtron 22Rv1 H3K27ac project). Screen captures from the image panel display thumbnails of high (top right) and low (bottom right) quality peaks as measured by LanceOtron’s Peak Score metric. In this experiment, roughly 33% (15 162 of 46 030) of the ENCODE-identified peaks had a <10% probability of arising from a biological event according to LanceOtron’s model
Fig. 3.Benchmarking LanceOtron against MACS2 for peak calling transcription factor ChIP-seq. (A) Model performance metrics using labeled genomic regions of an ENCODE CTCF ChIP-seq dataset. (B) Comparing the number of motifs contained in peak calls generated from LanceOtron and MACS2. (C) Venn diagram of peak calls from LanceOtron and MACS2; regions that did not intersect were assessed for overlap with promotors or enhancers. (D) Thumbnail images from the most highly enriched regions called exclusively by either LanceOtron (top) or MACS2 (bottom). (E) Average coverage of the regions called exclusively by either LanceOtron (top) or MACS2 (bottom) for CTCF experimental track, control track and DNase-seq open chromatin track
Fig. 4.Benchmarking LanceOtron against MACS2 for peak calling histone ChIP-seq. (A) Model performance metrics using 10 Mb of labeled genomic regions of ENCODE ChIP-seq datasets for H3K27ac in HAP-1 cell line, and (B) H3K4me3 in MG63 cell line. (C) ChIP-seq dataset labeled by Oh et al. for H3K27ac in GM12878 cell line and (D) H3K4me3 in K562 cell line
LanceOtron and MACS2 peak call comparison for TSSs and for active regions in open chromatin
| LanceOtron | MACS2 | LanceOtron with input | MACS2 with input | |
|---|---|---|---|---|
| % top H3K27ac ChIP-seq in HAP-1 peaks overlapping TSSs (count) | 56.1% (2806/5000) | 48.6% (2428/5000) | 56.2% (2812/5000) | 52.1% (2607/5000) |
| % top H3K4me3 ChIP-seq in MG63 peaks overlapping TSSs (count) | 69.4% (3472/5000) | 66.4% (3318/5000) | 70.0% (3501/5000) | 49.8% (2491/5000) |
| % top ATAC-seq in MCF-7 peaks overlapping TSSs (count) | 44.4% (2218/5000) | 21.9% (1096/5000) | ||
| % top DNase-seq in A549 peaks overlapping TSSs (count) | 43.3% (2164/5000) | 22.7% (1133/5000) | ||
| % ATAC-seq peaks in active regions (count) | 12.6% (628/5000) | 7.5% (377/5000) | ||
| % DNase-seq peaks in active regions (count) | 12.1% (607/5000) | 9.5% (477/5000) |
Percentages and counts of peaks intersecting TSSs are given for 5000 regions of LanceOtron and MACS2 peak calls, selected for being most enriched (highest q-value or peak score for LanceOtron and MACS2, respectively). Percentages and counts are also shown for open chromatin peaks found in active areas of the genome.
Fig. 5.Benchmarking LanceOtron against MACS2 for open chromatin. (A) Model performance metrics using labeled genomic regions of an ENCODE ATAC-seq dataset in MCF-7 cell line and (B) DNase-seq in A549 cell line