| Literature DB >> 28968390 |
Louis Verny1,2, Nadir Sella1,2, Séverine Affeldt1,2, Param Priya Singh1,2, Hervé Isambert1,2.
Abstract
Learning causal networks from large-scale genomic data remains challenging in absence of time series or controlled perturbation experiments. We report an information- theoretic method which learns a large class of causal or non-causal graphical models from purely observational data, while including the effects of unobserved latent variables, commonly found in many genomic datasets. Starting from a complete graph, the method iteratively removes dispensable edges, by uncovering significant information contributions from indirect paths, and assesses edge-specific confidences from randomization of available data. The remaining edges are then oriented based on the signature of causality in observational data. The approach and associated algorithm, miic, outperform earlier methods on a broad range of benchmark networks. Causal network reconstructions are presented at different biological size and time scales, from gene regulation in single cells to whole genome duplication in tumor development as well as long term evolution of vertebrates. Miic is publicly available at https://github.com/miicTeam/MIIC.Entities:
Mesh:
Substances:
Year: 2017 PMID: 28968390 PMCID: PMC5685645 DOI: 10.1371/journal.pcbi.1005662
Source DB: PubMed Journal: PLoS Comput Biol ISSN: 1553-734X Impact factor: 4.475
Fig 1Learning causal networks with latent variables.
() A v-structure. () Bidirected edges in collider paths indicate the presence of latent common cause(s), L, unobserved in the dataset. () Conditional independence in the presence of latent variables requires to be conditioned on non-adjacent variables, in general [9, 10], such as for the pair {Z,T} which needs to be conditioned on X, Y and non-adjacent W, I(Z; T|X,Y,W) = 0, as one cannot condition on the unobserved latent variables, L or L′, e.g. I(Z; T|X,L) = 0 or I(Z; T|Y,L′) = 0. () Outline of the successive steps of constraint-based approaches (see also Algorithm steps in Materials and methods). () F-score (harmonic mean of Precision and Recall, S1, S2 and S3 Figs) of miic algorithm (warm colors) for 0%, 5%, 10% and 20% of latent variables (top to bottom curves), compared to the RFCI algorithm [10] (cold colors) on benchmark networks of increasing complexity disregarding (dashed lines) or including (solid lines) edge orientations: Alarm [37 nodes, avg. deg. 2.5, 509 parameters], Insurance [27 nodes, avg. deg. 3.9, 984 parameters] and Barley [48 nodes, avg. deg. 3.5, 114,005 parameters]. () Computation times of miic (warm colors) compared to RFCI (cold colors). Inserts: computation times in log scale showing a linear scaling (solid bar) in the limit of large datasets, τ ∼ N1±0.1, with miic, and a close to quadratic scaling (dashed bar), τ ∼ N1.8±0.3, with RFCI.
Fig 2Network reconstruction at cellular level.
() Hematopoietic / endothelial differentiation in single cells from mouse embryos [24]. () Principal component analysis and () K-means clustering of gene expression data [24] with histograms showing the relative proportions of cell populations at each data point (E7.0 to E8.25). () Hematopoietic / endothelial differentiation regulatory network between hematopoietic specific (red), endothelial (violet), common (blue) and unclassified (gray) TFs. Graph predicted with miic R-package and visualized using cytoscape (blue edges correspond to repressions).
Fig 3Network reconstruction at tissue level.
() Tumor development and drug resistance in the presence of tetraploid tumor cells following whole genome duplication (WGD). () Ploidy distribution in the 807 tumor samples and () genomic alterations: ploidy, mutations, normalized under-expression and over-expression changes from COSMIC database [34]. () Genomic alteration network obtained between average ploidy (violet), gene mutations (yellow, lower case) and under- or over-expressions (green, upper case). Graph predicted with miic R-package and visualized using cytoscape (blue edges correspond to repressions).
Fig 4Network reconstruction at organismal and phylogenetic levels.
() Two rounds of whole genome duplication (WGD) have led to the evolutionary radiation of vertebrates (and similarly with a third 300-MY-old WGD in teleost fish). () Biased distributions of genomic properties within ‘non-ohnolog’ and ‘ohnolog’ genes retained from WGDs in early vertebrates [45]. Numbers in brackets indicate the numbers of genes for which each property is identified, Materials and Methods and S1 Data. () Genomic property network of human genes, see main text. Graph predicted with miic R-package and visualized using cytoscape (blue edges correspond to repressions).