| Literature DB >> 34123358 |
Leandro Murgas1,2, Sebastian Contreras-Riquelme1,3, J Eduardo Martínez-Hernandez1,4,2, Camilo Villaman1,2, Rodrigo Santibáñez1, Alberto J M Martin1.
Abstract
The regulation of gene expression is a key factor in the development and maintenance of life in all organisms. Even so, little is known at whole genome scale for most genes and contexts. We propose a method, Tool for Weighted Epigenomic Networks in Drosophila melanogaster (Fly T-WEoN), to generate context-specific gene regulatory networks starting from a reference network that contains all known gene regulations in the fly. Unlikely regulations are removed by applying a series of knowledge-based filters. Each of these filters is implemented as an independent module that considers a type of experimental evidence, including DNA methylation, chromatin accessibility, histone modifications and gene expression. Fly T-WEoN is based on heuristic rules that reflect current knowledge on gene regulation in D. melanogaster obtained from the literature. Experimental data files can be generated with several standard procedures and used solely when and if available. Fly T-WEoN is available as a Cytoscape application that permits integration with other tools and facilitates downstream network analysis. In this work, we first demonstrate the reliability of our method to then provide a relevant application case of our tool: early development of D. melanogaster. Fly T-WEoN together with its step-by-step guide is available at https://weon.readthedocs.io.Entities:
Keywords: Cytoscape; condition-specific networks; data integration; gene regulation; systems biology
Year: 2021 PMID: 34123358 PMCID: PMC8193463 DOI: 10.1098/rsfs.2020.0076
Source DB: PubMed Journal: Interface Focus ISSN: 2042-8898 Impact factor: 3.906
Figure 1Flowchart describing Fly T-WEoN. The TF–gene reference network is filtered by DNA methylation, then by chromatin accessibility of regulatory sites and third by histone marks. TF–gene and miRNA–gene networks are then filtered according to RNA-seq expression of the regulators and edges in the resulting networks are then scored according to the number of filters passed and provided as edge weights in the context-specific GRN.
Histone modifications considered in Fly T-WEoN and their default effect.
Effect of the histone marks on the binding of TFs to chromatin. ‘+’ symbols indicate marks that allow TF binding and ‘−’ indicate non-active TFBSs.
| modification | effect | references |
|---|---|---|
| H3K27me3 | − | [ |
| H3K36me2 | + | [ |
| H3K36me3 | + | [ |
| H3K4me1 | + | [ |
| H3K4me2 | + | [ |
| H3K4me3 | + | [ |
| H3K79me2 | + | [ |
| H3K9ac | + | [ |
| H3K9me2 | − | [ |
| H3K9me3 | − | [ |
| H3S10ph | + | [ |
| H4K16ac | + | [ |
| H4K20me3 | − | [ |
Description of the reference networks employed in Fly T-WEoN. Reference networks were created by assigning TFs to the regulation of specific genes based on a distance threshold between the TFBS and the gene. All three networks described in the table include the same 350 TFs.
| threshold (kb) | genes | edges |
|---|---|---|
| 1.5 | 15 576 | 1 094 130 |
| 2 | 15 899 | 1 190 168 |
| 5 | 16 665 | 1 679 173 |
Gold standard networks used to validate Fly T-WEoN.
Networks made with the 32 TFs at different distance thresholds between TFBSs and the TSS of each gene. Number of different genes and edges present in each of the networks made by assigning a TF to the regulation of a gene if the TF is bound within the distance and the gene TSS. Percentages indicate the ratio of edges and genes present in these networks compared to the subnetworks made with all TFBS for the same 32 TFs.
| threshold (kb) | genes | edges |
|---|---|---|
| 1.5 | 10 096 (83.96%) | 82 919 (80.30%) |
| 2 | 10 620 (84.75%) | 89 880 (80.56%) |
| 5 | 12 822 (89.91%) | 127 222 (82.19%) |
Reliability of L3 gene regulatory networks: single edges.
Performance of Fly T-WEoN measured by its ability to recover edges present in the gold-standard network for different scores. The table displays the number of true positive edges (TP), edges in the gold-standard network also present in the predicted network; false positive edges (FP) or present in the predicted network but absent in the gold-standard network; and false negative edges, those edges that are only present in the gold-standard network and are not present in the predicted network. TP, FP and FN edges were used to calculate precision (P), recall (R) and F1 (italic numbers indicate their highest values).
| score | TP | FP | FN | |||
|---|---|---|---|---|---|---|
| 2 | 81 094 | 19 475 | 1825 | 0.806 | ||
| 3 | 78 807 | 18 846 | 4112 | 0.950 | 0.807 | 0.873 |
| 4 | 76 017 | 17 998 | 6902 | 0.917 | 0.809 | 0.859 |
| 5 | 72 802 | 17 109 | 10 117 | 0.878 | 0.810 | 0.843 |
| 6 | 68 848 | 16 147 | 14 071 | 0.830 | 0.820 | |
| 7 | 61 477 | 14 874 | 21 442 | 0.741 | 0.805 | 0.772 |
| 8 | 50 071 | 12 908 | 32 848 | 0.604 | 0.795 | 0.686 |
| 9 | 5666 | 2225 | 77 253 | 0.068 | 0.718 | 0.125 |
| 10 | 1512 | 664 | 81 407 | 0.018 | 0.695 | 0.036 |
| 11 | 62 | 34 | 82 857 | 0.001 | 0.646 | 0.001 |
Reliability of L3 gene regulatory networks: graphlets.
Performance of Fly T-WEoN measured by its ability to recover graphlets present in the gold-standard network for different edge scores. The table displays the number of true positive graphlets (TP), graphlets present in the gold-standard network also found in the predicted network; false positive graphlets (FP), present in the predicted network but absent in the gold-standard network; and false negative graphlets, those that are only present in the gold-standard network and were not present in the predicted network. TP, FP and FN graphlets were used to calculate precision (P), recall (R) and F1 (italic numbers indicate their highest values).
| score | TP | FP | FN | |||
|---|---|---|---|---|---|---|
| 2 | 143 177 569 | 73 274 608 | 6 516 537 | 0.662 | ||
| 3 | 135 281 974 | 69 127 626 | 14 412 132 | 0.904 | 0.662 | 0.764 |
| 4 | 125 830 206 | 63 634 265 | 23 863 900 | 0.841 | 0.664 | 0.742 |
| 5 | 115 565 165 | 58 106 890 | 34 128 941 | 0.772 | 0.715 | |
| 6 | 103 941 870 | 52 479 975 | 45 752 236 | 0.694 | 0.665 | 0.679 |
| 7 | 84 304 099 | 45 038 593 | 65 390 007 | 0.563 | 0.652 | 0.604 |
| 8 | 57 733 273 | 34 839 940 | 91 960 833 | 0.386 | 0.624 | 0.477 |
| 9 | 757 834 | 1 231 084 | 148 936 272 | 0.005 | 0.381 | 0.01 |
| 10 | 59 937 | 123 182 | 149 634 169 | 0 | 0.327 | 0 |
| 11 | 229 | 328 | 149 693 877 | 0 | 0.411 | 0 |
Characterization of embryo development networks.
The table shows the number of edges and regulatory nodes for each of the networks created for the six time intervals during early development of the fruit fly. Regulatory nodes indicate the number of TFs in each network and the total number of genes and edges in the networks are also displayed. These networks were obtained by removing unlikely edges from a reference network were TFBSs located at most at 1.5 kb upstream the TSS are used to assign the TFs that bind to that TFBS to the regulation of each gene.
| node type | 0–4 h | 4–8 h | 8–12 h | 12–16 h | 16–20 h | 20–24 h |
|---|---|---|---|---|---|---|
| total edges | 718 583 | 733 863 | 803 613 | 888 537 | 928 599 | 840 567 |
| TF nodes | 305 | 324 | 340 | 335 | 345 | 339 |
| total nodes | 8811 | 7886 | 8554 | 10 528 | 10 993 | 11 146 |
Comparisons of embryo development networks using gaphlets.
The table displays the number of true positive graphlets (TP), graphlets in the first network (belonging to the earlier time interval) that are present in the later network; false positive graphlets (FP), those present in the later network but absent in the earlier one; and false negative graphlets (FN), those that are only present in the earlier network and not in the later network. TP, FP and FN graphlets were used to calculate precision (P), recall (R) and F1 metrics.
| comparison | TP | FP | FN | |||
|---|---|---|---|---|---|---|
| 0–4 to 4–8 h | 960 856 575 | 154 840 150 | 170 821 683 | 0.849 | 0.861 | 0.855 |
| 4–8 to 8–12 h | 1 015 103 337 | 237 272 519 | 100 593 375 | 0.910 | 0.811 | 0.857 |
| 8–12 to 12–16 h | 1 162 138 649 | 358 797 508 | 90 237 194 | 0.928 | 0.764 | 0.838 |
| 12–16 to 16–20 h | 1 347 047 992 | 264 431 780 | 173 888 152 | 0.886 | 0.836 | 0.860 |
| 16–20 to 20–24 h | 1 297 280 051 | 74 185 782 | 314 199 708 | 0.805 | 0.946 | 0.870 |
Total number of genes by type and F1 interval in each of the comparisons of embryo development consecutive networks using graphlets.
The table displays the number of genes in each of the four F1 intervals [0.0, 0.5), [0.5, 0.7), [0.7, 0.9) and [0.9, 1.0] in each of the five comparisons performed between GRNs depicting gene regulation at each time lapse. F1 values closer to 0 indicate larger local topological variation, while closer to 1 indicate fewer variations in the graphlets in which a gene participates.
| comparison | 0–4 to 4–8 h | 4–8 to 8–12 h | 8–12 to 12–16 h | 12–16 to 16–20 h | 16–20 to 20–24 h | |
|---|---|---|---|---|---|---|
| all genes | [0.0, 0.5) | 1459 | 1398 | 2595 | 2962 | 2511 |
| [0.5, 0.7) | 187 | 178 | 215 | 319 | 397 | |
| [0.7, 0.9) | 3192 | 1732 | 2023 | 1766 | 1546 | |
| [0.9, 1.0] | 4097 | 5445 | 5820 | 6857 | 7513 | |
| TFs | [0.0, 0.5) | 27 | 25 | 10 | 5 | 10 |
| [0.5, 0.7) | 7 | 9 | 12 | 7 | 4 | |
| [0.7, 0.9) | 212 | 260 | 302 | 273 | 265 | |
| [0.9, 1.0] | 70 | 47 | 16 | 53 | 66 | |
| coding genes | [0.0, 0.5) | 953 | 922 | 1787 | 2114 | 1802 |
| [0.5, 0.7) | 136 | 136 | 151 | 236 | 292 | |
| [0.7, 0.9) | 2602 | 1242 | 1382 | 1126 | 973 | |
| [0.9, 1.0] | 3598 | 4815 | 5077 | 5925 | 6527 | |
| non-coding genes | [0.0, 0.5) | 479 | 451 | 798 | 843 | 699 |
| [0.5, 0.7) | 42 | 33 | 52 | 76 | 101 | |
| [0.7, 0.9) | 378 | 230 | 339 | 367 | 308 | |
| [0.9, 1.0] | 429 | 583 | 727 | 879 | 920 |
GO Slim Biological Process terms associated with genes with the largest topological variation.
The table displays the GO Slim Biological Process obtained with PANTHER for genes with F1 values in the range [0.0–0.5). The fold enrichment value indicates the rate between the percentage of genes with the annotation and the percentage of genes with the same annotation in whole genome. If it is greater than 1, it indicates that the category is overrepresented in the data. These results were filtered by a p-value threshold of 0.01.
| comparison | GO term | GO ID | fold enrichment | |
|---|---|---|---|---|
| 0–4 to 4–8 h | cell differentiation | GO:0030154 | 2.29 | 1.14 × 10−4 |
| developmental process | GO:0032502 | 2.02 | 1.49 × 10−2 | |
| cellular developmental process | GO:0048869 | 2.15 | 3.12 × 10−4 | |
| sulfur compound metabolic process | GO:0006790 | 2.53 | 1.06 × 10−3 | |
| anatomical structure development | GO:0048856 | 1.92 | 1.41 × 10−3 | |
| cellular modified amino acid metabolic process | GO:0006575 | 2.94 | 1.85 × 10−3 | |
| glutathione metabolic process | GO:0006749 | 3.52 | 2.40 × 10−3 | |
| cell fate commitment | GO:0045165 | 4.82 | 3.54 × 10−3 | |
| neurogenesis | GO:0022008 | 2.38 | 4.14 × 10−3 | |
| generation of neurons | GO:0048699 | 2.38 | 5.81 × 10−3 | |
| multicellular organismal process | GO:0032501 | 1.58 | 9.06 × 10−3 | |
| 4–8 to 8–12 h | developmental process | GO:0032502 | 1.87 | 1.15 × 10−3 |
| cell differentiation | GO:0030154 | 2.04 | 2.07 × 10−3 | |
| chaperone-mediated protein folding | GO:0061077 | 4.18 | 3.16 × 10−3 | |
| cellular developmental process | GO:0048869 | 1.91 | 4.46 × 10−3 | |
| anatomical structure development | GO:0048856 | 1.79 | 6.33 × 10−3 | |
| 8–12 to 12–16 h | cellular modified amino acid metabolic process | GO:0006575 | 4.18 | 2.61 × 10−9 |
| glutathione metabolic process | GO:0006749 | 5.00 | 2.04 × 10−8 | |
| cofactor metabolic process | GO:0051186 | 2.24 | 3.25 × 10−5 | |
| sulfur compound metabolic process | GO:0006790 | 2.32 | 1.10 × 10−4 | |
| response to drug | GO:0042493 | 2.92 | 1.39 × 10−3 | |
| organic acid metabolic process | GO:0006082 | 1.66 | 2.81 × 10−3 | |
| small molecule metabolic process | GO:0044281 | 1.46 | 3.85 × 10−3 | |
| transmembrane transport | GO:0055085 | 1.55 | 6.24 × 10−3 | |
| carboxylic acid metabolic process | GO:0019752 | 1.62 | 6.36 × 10−3 | |
| organic anion transport | GO:0015711 | 2.05 | 6.50 × 10−3 | |
| oxoacid metabolic process | GO:0043436 | 1.60 | 6.65 × 10−3 | |
| aminoglycan metabolic process | GO:0006022 | 3.43 | 6.94 × 10−3 | |
| anion transport | GO:0006820 | 1.84 | 8.37 × 10−3 | |
| defence response | GO:0006952 | 3.25 | 8.89 × 10−3 | |
| 12–16 to 16–20 h | aminoglycan metabolic process | GO:0006022 | 3.98 | 7.78 × 10−4 |
| amino sugar catabolic process | GO:0046348 | 3.62 | 2.28 × 10−3 | |
| chemical synaptic transmission | GO:0007268 | 2.09 | 2.73 × 10−3 | |
| anterograde trans-synaptic signalling | GO:0098916 | 2.09 | 2.73 × 10−3 | |
| response to drug | GO:0042493 | 2.64 | 3.05 × 10−3 | |
| synaptic signalling | GO:0099536 | 2.06 | 4.36 × 10−3 | |
| trans-synaptic signalling | GO:0099537 | 2.06 | 4.36 × 10−3 | |
| multicellular organismal process | GO:0032501 | 1.41 | 7.76 × 10−3 | |
| 16–20 to 20–24 h | aminoglycan metabolic process | GO:0006022 | 4.25 | 7.65 × 10−4 |
| amino sugar catabolic process | GO:0046348 | 4.25 | 7.65 × 10−4 | |
| drug metabolic process | GO:0017144 | 1.85 | 5.51 × 10−3 |
Figure 2Comparison of subnetworks composed of all those genes showing larger local variation in the 0–4 h to 4–8 h comparison. The network shown is formed by 594 nodes (44 TFs) and 3107 edges coloured according to their existence in the earlier network, in the later network or in both.
Figure 3Comparison of subnetworks composed of TF coding genes showing larger local variation in the 0–4 h to 4–8 h comparison. The network shown is formed by 44 TFs and 42 edges coloured according to their existence in the earlier network, in the later network or in both.
Figure 4Dorsal–ventral patterning in the 0–4 h network. (a) shows the conservation of the dorsal cascade in the reference networks, while (b) displays the same subnetwork in the 0–4 h time interval using the 5 kb distance threshold. TFs are shown in orange, non-TF nodes are depicted in blue, dark green edges indicate gene regulations present in all three reference networks, light green edges are regulations present only in the reference network of 5 kb, and red edges are gene regulations that were not detected by the procedure employed to generate all reference networks (a) or did not pass all filters in the 0–4 h network.
Qualitative comparison of different methods and Fly T-WEoN.
The table indicates for each tool the language used in its implementation, its purpose, its advantages and disadvantages and general user-friendliness.
| tool | language | purpose | input data | (dis) advantages | GUI |
|---|---|---|---|---|---|
| CENTIPEDE [ | R | infers bound TFBS from Chip-seq of histone modifications and DNAse-seq | matrix of read counts around motif matches based on DNAse-seq or ChIP reads and the following prior information: PWM score for motif matches represented in the genome, conservation score based on evolutionary information of motif and motif distance to TSS | easy to run and very intuitive to generate output data; however it needs many previous steps of data preprocessing to generate the correct input file | no |
| Anchor [ | Python | predicts | genomic coordinates (BED file), DNase-seq data (BAM file and BigWig file), DNA sequence (genome fasta file), TFs motifs and Gencode GFF file | needs various preprocessing steps of all data (long times, computing intensive), then, it is easy to run | no |
| Tepic [ | C++, R, Python | prediction and analysis of TFBS from epigenetic data, supporting more than 30 species | genome sequence in fasta file, genome annotation file (GTF) | easy to run; however the output is not friendly for posterior analysis and requires post-processing | no |
| Mocap [ | R, Python | classification of TFBSs from integration of chromatin accessibility, motif scores, TF footprints, CpG/GC content, evolutionary conservation | DNase-Seq or ATAC-Seq counts, BigWig, motif matrix | low time consuming, but it requires high computing performance. It is easy to run, but it is only available for mouse and human and the output requires post-processing | no |
| Fly T-WEoN | Perl, Java | apply filters from different genomic and epigenomic experiments to a reference network in order to generate context-specific GRNs | BED files from histone PTMs, methylation sequencing, DNA accessibility sequencing and RNA-seq file of counts, RPKM, or FPKM | the major advantage is the possibility to generate a context-specific GRN without further preprocessing of data in a friendly way. However it is only implemented for fly | yes |