| Literature DB >> 33286399 |
Pritam Chanda1,2, Eduardo Costa3, Jie Hu1, Shravan Sukumar1, John Van Hemert4, Rasna Walia4.
Abstract
"A Mathematical Theory of Communication" was published in 1948 by Claude Shannon to address the problems in the field of data compression and communication over (noisy) communication channels. Since then, the concepts and ideas developed in Shannon's work have formed the basis of information theory, a cornerstone of statistical learning and inference, and has been playing a key role in disciplines such as physics and thermodynamics, probability and statistics, computational sciences and biological sciences. In this article we review the basic information theory based concepts and describe their key applications in multiple major areas of research in computational biology-gene expression and transcriptomics, alignment-free sequence comparison, sequencing and error correction, genome-wide disease-gene association mapping, metabolic networks and metabolomics, and protein sequence, structure and interaction analysis.Entities:
Keywords: computational biology; disease-gene association mapping; entropy; error correction; gene expression; information theory; interaction analysis; metabolic networks; metabolomics; protein structure; sequence comparison; transcriptomics
Year: 2020 PMID: 33286399 PMCID: PMC7517167 DOI: 10.3390/e22060627
Source DB: PubMed Journal: Entropy (Basel) ISSN: 1099-4300 Impact factor: 2.524
Figure 1Schematic showing the Gene-regulatory Networks (GRN) reconstruction problem where undirected edges are inferred using information theoretic methods.
Figure 2Schematic showing an overview of alignment-free sequence comparison.
Figure 3Illustration of an information theoretic communication model inspired representation for error correction in genomic sequencing based on the work of Chen et al. [65].
Figure 4Overview of the genome-disease association problem and some example information theoretic metrics calculated for identifying GxG and GxE, a binary phenotype and environmental variable is shown for simplicity.
Figure 5Simplified illustration of how mutual information (MI) can be used to capture links between two proteins from their Multiple sequence Alignments. High MI between 2 columns of the alignments are associated with interactions in their structures.
Figure 6An example of metabolic networks. Nodes represent metabolites and are transformed into each other through chemical reactions catalyzed by metabolic enzymes (red). Metabolic pathways are formed by a linked series of chemical reactions that collectively perform a biological function. Reconstruction of metabolic networks is typically done by reverse engineering of metabolome data, where information theoretic methods ARACNE and CLR have been applied. Reaction fluxes depicted by the width of the reaction arrows can be predicted using FBA. Also shown are examples of downstream analysis of a metabolic network using information theory for metabolic diversity characterization, structure-dynamic relationship analysis, single-cell level simulation of metabolic networks with FBA and nutrient information sensing.
Figure 7Biology presents many optimization problems. Shown here is a hypothetical search space that could represent anything from an organism’s traits to experiment designs to visualization parameters. In these cases, the vertical axis could be fitness/MEP, experiment value, and network edge crossovers, respectively.
Figure 8Independent Components Analysis (ICA) can compute latent factors in data like Principle Components Analysis (PCA). ICA is not limited to orthogonal mixtures like PCA, as shown in this simulated dataset. The ICA mixing vectors align to the data better than the PCA rotation vectors.
Summary of key information theoretic methods for areas discussed in Section 3.1, Section 3.2, Section 3.3 and Section 3.4.
| Area | Information Theoretic Methods | How Information Theory Is Used |
|---|---|---|
| Reconstructing Gene Regulatory Networks | Relevance Networks [ | Used MI larger than a given threshold to construct GRNs |
| CLR [ | Used MI to construct GRN, filters spurious edges by estimating its background distribution | |
| ARACNE and its extensions [ | Used MI to construct GRN, filters edges using null distribution and DPI, higher order DPI to improve inference performance, adaptive binning strategy to estimate MI efficiently | |
| PIDC [ | GRN constructed using PID to explore dependencies between triplets of genes in single-cell gene expression datasets | |
| Alignment-free phylogeny | FFP [ | Calculated JSD as pairwise distances between two genomes using normalized k-mer frequencies |
| kWIP [ | Constructed a sketch data structure using all k-mers from a genomic sequence, computed inner product between the two sketches weighted by their Shannon’s entropy across the given dataset | |
| Sequencing and Error Correction | Motahari et al. [ | Used Renyi entropy of order 2 to find the minimum fragment length and coverage depth needed for the assembling reads to reconstruct the original DNA sequence with a given reliability |
| Chen et al. [ | Analyzed the information redundancy in dual-base degenerate sequencing by comparing entropy information content of multiple DPL (degenerate polymer length) arrays | |
| Anavy et al. [ | Proposed encoding and decoding for composite DNA based storage and error correction through developing composite DNA alphabets and using KLD to select the best alphabet model | |
| Choi et al. [ | Used eleven degenerate bases as encoding characters in addition to ACGT to increase information capacity limit and reduce the cost of DNA per unit data | |
| Genome-wide disease-gene association mapping | Fan et al. [ | Proposed test statistics based on MI and conditional entropy involving two SNPs and the disease phenotype from case-control studies |
| Varadan et al. [ | Used synergy to analyze GXG statistical interactions | |
| Chanda et al. [ | Used multivariate information theoretic metrics and higher order models (e.g., KWII) to analyze statistical GXG and GxE interactions | |
| Moore et al. [ | Used KWII to represent edges in epistasis networks | |
| Tahmasebi et al. [ | Used entropy to formulate the capacity of recovering the causal subsequence | |
| Chanda et al. [ | Discussion and analysis of Information theoretic methods dealing with quantitative phenotypes and environmental variables | |
| Andrade et al. [ | Developed information theoretic test statistics for single-group or case-only studies. | |
| Tzeng et al. [ | Developed entropy-based tests using haplotypes | |
| Cui et al. [ | Developed entropy-based association test for gene-centric analysis considering variants within one gene as testing units | |
| Zhao et al. [ | Designed and discussed entropy based disease association test metrics for family-based studies |
Summary of key information theoretic methods for areas discussed in Section 3.5, Section 3.6 and Section 3.7.
| Area | Information Theoretic Methods | How Information Theory Is Used |
|---|---|---|
| Reconstruction and analysis of Metabolic Networks | CLR [ | Uses MI to construct metabolic networks from metabolite concentration data, filters spurious edges by estimating its background distribution [ |
| Marr. et al. [ | Network analysis of metabolic networks using Shannon and word entropy to reveal regularization dynamics encoded in network topology | |
| Nykter et al. [ | Studied network structure-dynamics relationships, using Kolmogorov complexity as a measure of distance between pairs of network structures and between their associated dynamic state trajectories | |
| Grimbs et al. [ | Stoichiometric analysis to parameterize the metabolic states, assessed the effect of enzyme-kinetic parameters on the stability properties of a metabolic state using MI and Kolmogorov–Smirnov test | |
| Fuente et al. [ | Studied properties of dissipative metabolic structures at different organizational levels using entropy | |
| De Martino et al. [ | Introduced a generalization of FBA to single-cell level based on maximum entropy principle | |
| Saccenti et al. [ | Investigated the associations and the interconnections among different metabolites by means of network modeling using maximum entropy ensemble null model | |
| Wagner et al. [ | Proposed an information theoretic way to relate the nutrient information to the error in a cell’s measurement of nutrient concentration in its environment | |
| Protein interaction analysis | Wollenberg et al. [ | Mutual Information combined with evolutionary information and refined with structural information to identify protein interactions |
| Optimization, Dimensionality Reduction | Cannon et al. [ | Used MEP to simulate central metabolism in the fungus Neurospora crassa [ |
| Hyvarinen et al. [ | Used negentropy and minimization of MI to obtain the components in ICA |