| Literature DB >> 28572580 |
Maria Kondili1, Annika Fust1, Jens Preussner1, Carsten Kuenne1, Thomas Braun2, Mario Looso3.
Abstract
The annotation of genomic ranges of interest represents a recurring task for bioinformatics analyses. These ranges can originate from various sources, including peaks called for transcription factor binding sites (TFBS) or histone modification ChIP-seq experiments, chromatin structure and accessibility experiments (such as ATAC-seq), but also from other types of predictions that result in genomic ranges. While peak annotation primarily driven by ChiP-seq was extensively explored, many approaches remain simplistic ("most closely located TSS"), rely on fixed pre-built references, or require complex scripting tasks on behalf of the user. An adaptable, fast, and universal tool, capable to annotate genomic ranges in the respective biological context is critically missing. UROPA (Universal RObust Peak Annotator) is a command line based tool, intended for universal genomic range annotation. Based on a configuration file, different target features can be prioritized with multiple integrated queries. These can be sensitive for feature type, distance, strand specificity, feature attributes (e.g. protein_coding) or anchor position relative to the feature. UROPA can incorporate reference annotation files (GTF) from different sources (Gencode, Ensembl, RefSeq), as well as custom reference annotation files. Statistics and plots transparently summarize the annotation process. UROPA is implemented in Python and R.Entities:
Mesh:
Year: 2017 PMID: 28572580 PMCID: PMC5453960 DOI: 10.1038/s41598-017-02464-y
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1(A) Example of a complex annotation situation: region of interest (peak, black bar) overlaps multiple candidate features (blue). These include protein coding genes (PLOD1, MFN2, MIIP, TNFRSF8) with exon (block) and intron (line) structure and non-coding genes (Y-RNA, RN7SL649P, RNU6-777P). Depending on the origin of the peak region the optimal annotation will vary. (B) Example on JSON formatted configuration file with two queries: I) begin and end of query section (purple); II) first query targeting gene features with multiple conditions and output filters (key:value pairs, blue); III) second query relating to UTR features (green); IV) global parameters on input files and prioritization. (C) Query and feature scheme: Illustration of an oriented feature (orange) and peaks (light and dark blue) that are filtered according to a query with asymmetrical distances as given on top. According to this query, green indicates the valid region around the queried start anchor of the feature. Dark blue peaks centered outside of the green region are never assigned to the feature (upper row, “invalid hits”). Dark blue as well as light blue peaks centered in the green region are assigned to the feature (lower row, “valid hits”). If “internals” key of the query is set to TRUE, light blue peaks given in the upper row are assigned to the feature.
Figure 2Outline of the UROPA algorithm. (1) For each peak all queries are consecutively checked for features satisfying various optional criteria. (2) The resulting candidate features are ranked for each query based on the distance of the peak center to the feature anchor(s) of interest (e.g. start, end, center of the feature). (3) All candidate features resulting from any query are stored in the”all hits” table. (4) The best candidate feature for each query is stored in the “best hits” table. (5) Only the one best feature among all queries is stored in the “final hits” table. This step can optionally include prioritization of queries to ensure a desired precedence (e.g. prefer protein_coding genes even if they are located farther away from the peak). These three output files cover various granularities considering the desired outcome.
Comparison matrix of popular annotation tools: the first column defines features supported by the respective tools given in column 2–5. Available/not available features are coded as Y or N, respectively. In case of comparable features, explanations/details are given as key words. *Indicates “only via additional programming”.
| Homer | GREAT | ChiPpeakAnno | Goldminer | UROPA | |
|---|---|---|---|---|---|
| Annotation Database | Refseq | UCSC (internal database) | Pre-calculated sets, e.g. “EnsDb.Hsapiens.v75” | All genomic range files | All GTF formatted files |
| Helper script to generate annotation file | assignGenomeAnnotation | N | N | makeGRanges() | UROPAtoGTF-tool |
| Target for distance calculation | TSS only | TSS only | Start/Center/End of selected feature | Overlap | Start/Center/End of selected feature |
| Select feature type in annotation file | N | N | Y | (N)* | Y |
| Limitation on characteristics, e.g. “protein_coding” | N | N | N | (N)* | Y |
| Definition of multiple annotation queries | N | N | N | N | Y |
| Prioritization of queries | N | N | (Y) no exclusive ranking (precedence) | (Y) within gene model context, but not globally | Y |
| Limit results to upstream/downstream of selected features | N | N | N | N | Y |
| Granularity of resulting annotations | N | N | Shows all hits, no aggregation to the best hit | Clear result structure, but single best hit is often missing | All hits, best hits per query, and merged best hits among all queries |
| Parallelization | N | N | N | N | Y |
| Simple customizing (no programming) | N | Y (only in web-based version) | N | N | Y |
| Audience | Bioinformatician/Biologist | Biologist | Bioinformatician | Bioinformatician | Bioinformatician/Biologist |
| Definition of distance cutoff | N | N | Y | N | Y |
Figure 3UROPA graphical summary report. (A) The distance to the feature anchor is displayed as a fraction of the total peaks annotated using a density plot. This information can be useful to determine optimal distance settings for annotation. (B) Relative localization of peaks in relation to the annotated feature (one pie chart plot per feature). (C) Bar plot of total occurrence of individual features (one plot per query). (D) All queries are included in a pairwise comparison to show possible overlaps. Assuming multiple concurrent queries, the amount of exclusively and commonly annotated peaks can be deduced. (E) Distance histogram in relation to query and feature where each query is depicted separately. (F) The Chow Ruskey plot represents an area-proportional Venn diagram. It reveals the distribution of peaks that could be annotated per query and works for up to 5 queries.
Figure 4Global comparison of UROPA to other peak annotation tools. White circles represent peaks without any annotation, blue circles represent the number of peaks exclusively annotated by the respective tool, red circles represent peaks exclusively annotated by UROPA, and violet circles represent peaks annotated by both tools. (A) Comparison of UROPA and Homer, no tool specific peaks are reported. (B) Comparison of UROPA and Goldminer. (C) Comparison of UROPA and GREAT. (D) Comparison of UROPA and ChIPpeakAnno.
Figure 5(A) Megakaryocyte differentiation, from left to right: Megakaryocyte progenitor cells (MPC) differentiate to erythrocytes (upper branch) or megakaryocytes (lower branch), while specific transcription factors for each branch drive the differentiation process (figure adapted from[10]). (B) Heatmaps on four in silico predicted transcription factor binding sites in the mm10 mouse genome assembly. Binding sites are located in the center, surrounded by the ATAC read signal from −1 kb to +1 kb. A globally normalized color scale represents the strength of the respective ATAC signal. The binding profile is shown at the top of each heatmap. Heatmaps from the left to the right represent Gabp-alpha, Fli1and Klf1 as Megakaryocyte progenitor specific transcription factors and Pax7 as Megakaryocyte unrelated factor.
Gabp-alpha associated, in silico predicted binding sites: Binding sites are clustered by ATAC-seq signal into open (c1) and closed (c2) regions. For each query indicated as rows, the percentage of peaks with respective annotation is listed. Query 0 to 4 reflect typical promotor definitions, query 5 reflects open intergenic regions. Table is based on the best per query result file.
| Query | Feature - filter attribute | Anchor | Distance | Gabpa | |
|---|---|---|---|---|---|
| Cluster 1 | Cluster 2 | ||||
|
| gene - protein coding | start | 1000:500 | 86 | 4 |
|
| gene - protein coding | start | 2000:500 | 87 | 6 |
|
| gene - protein coding | start | 3000:500 | 88 | 8 |
|
| gene - protein coding | start | 5000:500 | 89 | 11 |
|
| gene - pseudogene | start | 5000:500 | 1 | 2 |
|
| gene - protein coding | start | 100000 | 100 | 78 |