Literature DB >> 27980397

Integrating Epigenomics into the Understanding of Biomedical Insight.

Yixing Han1, Ximiao He2.   

Abstract

Epigenetics is one of the most rapidly expanding fields in biomedical research, and the popularity of the high-throughput next-generation sequencing (NGS) highlights the accelerating speed of epigenomics discovery over the past decade. Epigenetics studies the heritable phenotypes resulting from chromatin changes but without alteration on DNA sequence. Epigenetic factors and their interactive network regulate almost all of the fundamental biological procedures, and incorrect epigenetic information may lead to complex diseases. A comprehensive understanding of epigenetic mechanisms, their interactions, and alterations in health and diseases genome widely has become a priority in biological research. Bioinformatics is expected to make a remarkable contribution for this purpose, especially in processing and interpreting the large-scale NGS datasets. In this review, we introduce the epigenetics pioneering achievements in health status and complex diseases; next, we give a systematic review of the epigenomics data generation, summarize public resources and integrative analysis approaches, and finally outline the challenges and future directions in computational epigenomics.

Entities:  

Keywords:  DNA methylation; NGS; chromatin; computational epigenomics; epigenetics; histone modification; integrative analysis; ncRNAs

Year:  2016        PMID: 27980397      PMCID: PMC5138066          DOI: 10.4137/BBI.S38427

Source DB:  PubMed          Journal:  Bioinform Biol Insights        ISSN: 1177-9322


Introduction

With the advance of the next-generation sequencing (NGS) technology, large-scale omics data are accumulating at an exponential growth rate. It drives the biomedical study and the understanding of the life science to be increasingly data intensive. Scientific discoveries are based more and more on the genome-wide scale data and systematic data analysis. However, genome research is still facing significant challenges, including the shifts of the bottleneck from data generation to data analysis and data interpretation and aggravation of the difficulty of the integrative analysis in dimensions. The field of epigenetics and epigenomics is attracting immense interest with countless studies. Epigenetics is defined as the “stably heritable phenotype resulting from changes in a chromosome without alterations in the DNA sequence”.1 Epi-genetic regulation comprises many different pathways such as DNA methylation, histone modifications, histone variants, nucleosome positioning, and noncoding RNAs (ncRNAs). These factors work on the interface of the environment and the genome and play an essential role in fundamental biological processes, which touch upon the main central problems of biology: How do the epigenetic mechanisms work as a driving force in the cell specialization during development2? Which molecular mechanisms contribute to phenotypic inheritance and evolutionary adaptation3,4? And how epigenetic factors influence the complex diseases4–6? Different categories of epigenetic regulatory factors are involved in an interactive network and act coordinately within or between chromosomes to shape the genomic architecture, regulation, and transcriptional and translational outcomes. Epigenomics extends the epigenetics study from locus and single factors to global and multiple layers of regulatory cues. It is essential studies for the landscapes establishment of epigenetic marks under various conditions, which facilitates the understanding that the epigenetic profiles are maintained and affected via machinery that is regulated by the cross talk among these layers and the interplay with binding proteins, chromatin accessibility, and 3D conformation.7,8 From the genome and interaction network points of view, NGS was widely adopted promptly after its development in this field and generated comprehensive massive genome-wide datasets in all the epigenetic regulation layers.9 Hence, the joint analysis of multilayer epigenomic data, together with genomic, transcriptomic, and proteomic data through integration methods, is critical to comprehend how epigenetic information contributes controlling complex regulatory processes. Here, we review pioneering epigenomic studies and computational analyses that have contributed to biomedical research. In addition, we summarize the data, tools, and resources and outline future challenges in computational epigenetics that is super valued in addressing the full picture of the biological system.

Epigenetic Mechanisms

In eukaryotic cells, genomic DNA is compacted more than 10,000-fold in the nucleus by wrapping around highly conserved proteins termed as histones. This highly assembled DNA–protein structure is called nucleosome that forms the building blocks of chromatin. In general, the tighter the DNA is wrapped up, the more likely the gene is repressively expressed, while more accessible chromatin (less condense chromatin structure) indicates that the transcription machinery will be easy to bind and start up the gene transcription. It is the covalent modification that the epigenetic inheritance is encoded in, rather than the DNA sequence (which is the genetic inheritance encoded in). Epigenetic information can faithfully propagate between generations of cells (mitotic inheritance)2 and between generations of species (meiotic inheritance),10 but with substantially lower fidelity than genetic information.11,12 There are four types of epigenetic regulators: DNA methylation, histone modification, nonhistone binding proteins, and ncRNAs that act synergistically to control the chromatin architecture for cellular processes such as transcription, replication, and DNA repair.13 DNA is subject to be methylated at specific regions, so that it can foster a locally more compact chromatin structure and influence the accessibility for transcription factors.14 The histones consist of four core histones (two copies of H2A, H2B, H3, and H4) that are subject to a large number of posttranslational modifications on the unstructured N-terminal tails, including lysine and arginine methylation, lysine acetylation, and serine phosphorylation.15 Moreover, nonhistone proteins can affect the chromatin structure by interacting with histone and DNA in a variety of ways. ATP-dependent chromatin remodeling factors can directly mobilize nucleosomes or work together with enzymes in DNA methylation pattern determination and histone code programming.16–19 Epigenetic modifiers can dynamically “write”, “read”, and “erase” modifications to program/reprogram the chromatin accessibility to regulate gene expression during cell differentiation and disease occurrence.20,21 They work jointly with DNA, histone, and nonhistone proteins to form a complex interaction network in regulating chromatin accessibility for transcription.22 ncR-NAs are RNAs that are transcribed from DNA but function as structural, functional, and regulatory molecules rather than serving as templates for proteins, which take up to 70% of the genome.23–26 Based on the length of the ncRNAs and the biogenesis procedure, the epigenetic-related ncRNAs can be grouped into long noncoding RNAs (lncRNAs) (>200 nt), mid-size RNAs (20–300 nt), which include small nucleolar RNAs (snoRNAs),27,28 promoter-associated small RNAs (PASRs), TSS-associated RNAs (TSSa-RNAs),29 and short ncRNAs (<200 nt), which include microRNAs (miRNAs; 21–23 nt),30,31 short interfering RNAs (siRNAs; 20–30 nt),31 Piwi-interacting RNAs (piRNAs; 27–30 nt),32 and tRNA-derived RNAs (tDRs; 20–35 nt).33,34 The mechanism by which a vast void of these ncRNAs function and process remains to be discovered; however, well-studied cases show that these ncRNAs can interact with DNA, RNA, and proteins and generally function as cis-acting silencers and also trans-acting mediators for site-specific transcriptional and posttranscriptional processes, nuclear organization, RNA processing, and transposon suppression through sequence complementary.35–38 The study of mid-size ncRNAs and lncRNA is still in its infancy, and their biological functions are predicted to be transcription relevant but remain to be well defined. However, several possible mechanisms for lncRNA have been proposed based on the few relatively well-studied examples. It has been uncovered that lncRNAs can form complexes with other factors against cis- (eg, enhancer-like activities) or trans-targets (eg, Hox transcript antisense intergenic RNA [HOTAIR] binding with polycomb repressive complex) and function both in nuclear and cytoplasm to regulate transcription and translation.39–46 Thus, epigenetic mechanisms are fundamental to the regulation of many cellular processes, including the spatial and temporal expressions of gene and ncRNA, cell differentiation, embryogenesis, DNA replication, DNA repair, alternative splicing, X-chromosome inactivation, genome imprinting, and suppression of transposable element mobility.23,47–53 Beyond the epigenetic modifications occurring at linear chromatin domains, higher order chromatin territories are emerging with NGS technology as an important regulator of genes, which confirmed the findings by microscopy studies decades ago that chromosomes are positioned with preferential spatial in a nucleus to facilitate necessary long-range domain interaction and regulation.54 Interactome studies revealed that the boundaries of topological domains are highly conserved across species and enriched for essential genes, repeat elements, and insulator-binding motifs.55 Histone modification patterns can also be identified at topologically associating domains.56 Active chromatin reorganizations occur in and regulate extensive biological procedures, including cell differentiation and tumorigenesis,57,58 and have been found playing more and more instrumental roles in the epigenomics network.

Epigenomics Complex Diseases

The faithful propagation of epigenetic information is as important as the genetic information, which ensures the precise regulation of biological process over multiple cell divisions. Stochastic and environment-induced epigenetic defects are known to play a major role in occurrence of complex diseases, including cancer, aging, mental disorders,59 and autoimmune diseases.60 Epigenetic mutations accumulate along with age and may result in nonproper activation of normally downregulated genes,61,62 affecting genome stability.63 These changes underlie general effects of aging and aging-related diseases, like cancer and neurodegenerative diseases.59,64 For instance, DNA methylation pattern, due to the delicate balance between stability and plasticity, has been suggested to provide a lifetime record of environmental exposures and a valuable biomarker for risk stratification and disease diagnosis. Monozygotic twins, as they age, exhibit remarkable difference in genome methylation patterns that result in differential gene expression and, ultimately, life span.65 A global demethylation occurred in DNA repeat elements in cancer and aging,66,67 while the cancer epigenome is characterized by a massive global loss of DNA methylation68 and a certain promoter CpG islands hypermethylation69 that frequently overlaps with enhancers and other regulatory elements. The genome-wide DNA methylation changes mediate genome instability, chromosomal translocations, gene mutations, and reactivation of endoparasitic sequences. DNA methyltransferases (DNMTs) have been identified in elevated expression during aging and tumorigenesis, which are responsible for the hypomethylation feature.70–72 Accumulating evidence supports the notion that DNA methylation constitutes a promising and reliable biomarker in clinical practice for earlier and more reliable cancer diagnosis73,74 and more precise tumor subtype classification.75,76 Global profile changes of histone modifications and chromatin-modifying enzymes expression are also critical in aging62,77 and cancer initiation and progression.67,78 For example, cancer cells suffer a global reduction of activation markers H4K16ac79 and H3K4me380 and a gain in the repressive markers H4K20me3,79 H3K9me3,81 and H3K27me3,82 while H4K16ac,83 H3K4me2,62 H3K4me3,84 and H4K20me385 are increased with age. Distribution alteration of the histone modifications is mainly due to the abnormal expression of histone-modifying enzymes, such as histone deacetylases (HDACs) in the sirtuin family,86 SETD2,87 and EZH2,88 which lead to nonadaptive alterations of epigenetic landscape, thereby gene expression change. Deregulated epigenetic mediators that lead to the complex disorders may serve as potential targets of therapeutics, termed as epigenetic therapy. A great example is the sirtuin family of protein deacetylases that can be used as target to extend the health span of life. Small molecules that increase nicotinamide adenine dinucleotide phosphate (NADP) level can activate the sirtuins to mimic the effect of caloric restriction on genome-wide gene expression,89,90 so that it represents an epigenetic interventional path to prevent neurodegeneration,91 type II diabetes,92 cancer,93 and aging.94 ncRNAs are instrumental regulatory elements for cellular homeostasis. Rapidly growing evidences have consistently proved that the deregulation in their precise transcription and maturation, correct interaction with target mRNAs, and mutations in the ncRNA-processing machinery are causal factors in neurological, tumor genesis, and cardiovascular and developmental diseases. Among the variety of ncRNAs, miRNA is the most thoroughly studied one, especially in cancer and neurological disorders. miRNA can serve as both oncogenes and tumor suppressors, and the expression profiles are different between tumor and normal tissues and also among different cancer types,95,96 which provide important information for cancer prognosis and classification. For example, the dysregulation of miR-15, miR-16,97 and miR-20098 family is associated with genetic alterations that affect their primary processing, maturation, and interaction with mRNA targets and leads to chronic lymphocytic leukemia (CLL), ovarian cancer, and breast cancer. The impairs in miRNA processing complexes that cause the abnormal maturation are involved in cancer, for example, mutations on TARBP299 and DICER1,100 which are key processors in primary miRNA maturation, can cause downregulation of miRNA and then tumor genesis. It has been documented that approximately 70% of miRNAs are expressed in brain and specifically function in neural differentiation, maintenance, and synapsis plasticity, and their dysregulation has been found in almost all of the neurological disorders. For example, miR-29,101 miR-107,102 miR-298, and miR-328103 regulate the beta-amyloid precursor protein-cleaving enzyme I and can accelerate Alzheimer’s disease progression. Mutations on miRNA-processing machinery factors and RNA-binding proteins (RBPs) such as fragile X mental retardation 1 protein (FMRP) in RISC complex can cause fragile X syndrome (FXS),104 lucine-rich repeat serine/threonine-protein kinase 2 (LRRK2) is a cause of Parkinson’s disease,105 and RBP Musashi1 is associated with many cancers, including breast, colon, glioblastoma, and medulloblastoma, as well as neurodegenerative diseases.106 Numerous evidences are rapidly increasing about the lncRNA dysregulation in diseases, such as HOTAIR107 and lincRNA-p21108 in cancer and H19 in Silver–Russell syndrome and Beckwith–Wiedemann syndrome.109 Further understandings of the global patterns of these epigenetic modifications and their corresponding changes in complex diseases have enabled the diagnosis improvement, therapy target discovery, and better treatment strategy design.

Epigenomics Data Generation

Different approaches have been developed to capture the multiple levels of epigenetic signal for finally disentangling the epigenetic regulation network. Actually, most of these approaches follow a three-phase strategy. First, epigenetic information is converted into genetic information through biochemical methods. Next, standard DNA array technology or high-throughput sequencing is applied. Finally, computational and statistical analyses are then used to extract the sequence and infer the outcomes for biological insight interpretation. With the combination of experiment and high-throughput sequencing technology, we have been able to acquire the data at large genomic regions and even genome-wide scale. Various experimental methods have been developed to identify DNA methylation patterns. Pretreatments for these methods use endonuclease digestion (such as CHARM110 and MCA111), affinity enrichment (such as MeDIP112 and MIRA113), and bisulfite conversion.114 With the advances in NGS technologies, bisulfite conversion of unmethylated Cs to Ts followed by high-throughput sequencing (BS-seq) is a golden standard method to study the methylation status of every cytosine in the genome and produce the detailed DNA methylation maps.115 Among them, the popular strategies include Whole-Genome Bisulfite Sequencing (WGBS)116 and Reduced Representation Bisulfite Sequencing (RRBS).117 These popular new methods produce huge volume of DNA methylation datasets (from hundred gigabytes to terabytes), which pose enormous challenges in terms of computational approaches to analyze and interpret these data.114,115 The histone modification signals and chromatin-binding factors can be captured by chromatin immunoprecipitation (ChIP)-based techniques, such as ChIP-seq118,119 and ChIP-chip,120 in which specific antibodies were used to enrich the DNA fragments at modification sites. The ultra high throughput flow approaches are becoming more and more popular due to its high coverage, high resolution, and low cost.118,119,121,122 At specific region of the genome, chromatin has lost its condensed structure and exposed the DNA and makes it accessible for DNA degradation enzymes such as DNase I and transcriptional machinery. DNase-seq123 utilizes the dynamic DNase I hypersensitive sites (DHSs) and combines the NGS to understand the chromatin package under various circumstances. Recently, chromosome conformation capture (3C)-based techniques have been used increasingly to facilitate the detection of genome folding, chromosome spatial conformation, and long-range gene–gene interaction.124 Particularly, an advanced 3C – Hi-C has been developed as a powerful tool for genome-wide intra- and interchromosomal interplay, which provides unbiased large-scale information for reconstruction of the 3D structure of the chromosome.125 Furthermore, single-cell Hi-C significantly promoted the discovery of cell-to-cell variability in chromosome structure under normal cell status and disease conditions.126,127 Increasing novel classes of ncRNA are emerging from the 90% transcribed genome25 with the application of NGS, which offers unprecedented opportunity to obtain higher throughput and accuracy and lower experimental complexity. The discovery and detection of the ncRNAs are mostly based on the size fractionation methods in the isolated RNAs that led to the identified classes of ncRNAs as small, mid-size, and long. Small RNA-seq is the popularly used approach for small ncRNA identification; the library construction has a large overlap with the RNA-seq with ribosome RNA elimination, cDNA synthesis, 3′-A addition, adaptor ligation, and PCR enrichment.128 However, precise size selection of 18–30 nt or 30–200 nt fragments instead of the RNA fragmentation step is critical.129 Due to the poly(A) tail and mRNA-like features, lncRNAs are able to be detected in the cDNA cloning, tilling array, and polyadenylated transcriptome data. For example, cDNA cloning followed by Sanger sequencing was used in the first large-scale (>34,000) lncRNAs cataloging from the FANTOM project130,131 and in the lncRNA annotation from RefSeq and Ensembl projects.132,133 Genome-wide tiling array of transcriptome also contribute more efficiently in the ncRNA identification,26 and recently, high-throughput RNA-seq promotes the discovery sensitive dramatically and enables the reconstruction of the transcript models with or without a reference genome.134,135 The question that how the ncRNAs interact with DNA, mRNA, and proteins is in the central place of the ncRNA epigenetic regulation and functional annotation studies. Experimental approaches that were used for mRNA detection and quantification such as qPCR, Northern blots,136 fluorescence in situ hybridization (FISH),137 and RNA interference (RNAi)138,139 can also be applied to the characterization of ncRNAs. RNA-binding protein immunoprecipitation (RIP)140 followed by chip141 or NGS sequencing142 and UV cross-linking and immunoprecipitation (CLIP)143 enable various RNA–RBP interaction studies with lower background and higher affinity. The application of CLIP-seq is expanding from mRNA to miRNA,144 lncRNA,145 cirRNA,146 and mitochondrial RNA.147 The genome-wide CLIP experiments should be designed specifically to accommodate the different aims of each study, for example, studies may focus on RBP-binding site identification, RBP interactions with other factors, and RBP function in different biological processes including transcription, splicing, and translation. Furthermore, chromatin isolation by RNA purification (ChIRP)-seq can apply to illuminate the interaction of RNA, chromatin, and protein.148 We summarized these approaches in Table 1. These rapidly advancing technologies create ample opportunities for epigenome research; however, at the meantime, they also pose substantial challenges in terms of large datasets’ storage and processing, statistical analysis, and biological interpretation for observed differences.
Table 1

Main epigenomics data generation methods.

APPLICATIONMETHODSPRINCIPLEREFS
DNA methylation pattern detectionMethylated DNA immunoprecipitation (MeDIP)Purified DNA is immunoprecipitated with an antibody against methylated cytosines, giving rise to genomic maps of DNA methylation111
Bisulfite sequencingBisulfite to convert the unmethylated cytosines to uracils114
Reduced representation bisulfite sequencing (RRBS)Combines restriction enzymes and bisulfite sequencing in order to enrich for the areas of the genome that have a high CpG content116
Histone modification patter detection, chromatin binding protein pattern detectionChIP chipSpecific antibodies used for enrichment of the DNA fragments at modification sites followed by array hybridazation119
ChIP-seqSpecific antibodies used for enrichment of the DNA fragments at modification sites followed by high-throughput sequencing117,118
3D structure of chromatinDNase-seqAt Dnase I hypersensitive sites (DHSs), chromatin are sensitive to cleavage by the Dnase I enzyme. These accessible chromatin zones are functionally related to transcriptional activity122
Hi-C chromosome conformation capturing techniqueChromosome contacts are captured by formaldehyde cross-linking124,125,127
RNA-protein and RNA-DNA interactionRIP-chipSpecific antibodies used for immunopreciptation of the RNA fragments at RNA-binding sites followed by reverse transcription and microarray141
RIP-seqSpecific antibodies used for immunopreciptation of the RNA fragments at RNA-binding sites followed by reverse transcription and high-throughput sequencing142
CLIP-seqUV cross-linking with immunoprecipitation to analyze protein interactions with RNA to precisely locate RNA-protein binding site and RNA modifications. Modified versions including PAR-CLIP (photoactivatable-ribonucleoside-enhanced CLIP) can improve the signal-to-noise ratio and iCLIP (Individual-nucleotide resolution CLIP) can achieve a higher efficiency in reverse-transcription.143,321,322
ChIRP-seqBiotin labeled oligos that are complement to interested RNA are used to hybridize crosslinked chromatin fragments to capture biotin-oligo-RNA-DNA-protein complexes, DNA then isolated from the complexes for high-throughput sequencing to illustrate the RNA-DNA interaction148

Epigenomics Data Analysis

DNA methylations

In vertebrates, the most common form of DNA methylation is 5-methylcytosine (5-mC), which mainly occurs in the sequence context of CG dinucleotides. The non-CG methylation in a CHG or CHH context (where H stands for A, C, or T) exists in embryonic stem cells,116 brain,149,150 and plant.151 In mammalian genomes, CG dinucleotides are rare but tend to occur in clusters called CG islands (CGI) that are often located in the proximal promoters of genes, particularly housekeeping genes,152–154 but are typically not methylated. In the early embryo, there is little CG methylation, but CG dinucleotides outside of CGI typically become methylated during the blastula stage of development.14 It is mainly CG-rich regions outside of proximal promoters that become demethylated upon cellular differentiation.117,155 However, genomic analyses have identified low CG promoters that are both methylated and transcriptionally active.156–158 Since the principles, computational methods, and challenges of DNA methylation have been heavily reviewed,114,115,159 this review aims to put a particular emphasis on the computational approaches to BS-seq data (WGBS and RRBS), including essential steps of mapping BS-seq reads to the reference genome, determining DNA methylation level, detecting the differentially methylated regions (DMRs) between cases and controls, as well as storing, retrieving, and visualizing DNA methylation data.

Mapping BS-seq reads

Bisulfite treatment of DNA followed by PCR amplification and then sequencing leads to the vast majority of unmethylated Cs that are changed to Ts in the sequencing reads, without affecting As, Gs, Ts, or methylated Cs. To calculate the absolute DNA methylation level for each C from BS-seq data, the sequencing reads are required to align to the reference genome to determine the position where the reads were most likely to be derived. Various alignment tools, including the general aligners with BS-seq module and the specific BS-seq aligners, have been developed to map the BS-seq short reads (Table 2). Due to the specificity of the BS-seq reads, some general aligners are developed with BS-seq modules (such as GSNAP,160 LAST,161 Novoalign,162 RMAP,163 and segemehl164). Specific BS-seq aligners were also developed to map the BS-seq reads. Among these tools, two alternative approaches have been widely used. The three-letter aligners (such as Bismark,165 BRAT,166 BS-Seeker,167 and MethylCoder168) simplify the alignment by converting all Cs into Ts for the BS-seq reads and both strands of the reference genome (only three alphabets of A, G, and T remaining in the converted sequences) then using the standard aligner. In contrast, the wild-card aligners (such as BSMAP,169 Pash,170–172 and RRBSMAP173) only convert Cs to the wild-card letter Y (stands for pyrimidine: C or T), which matches both Cs and Ts in the BS-seq reads. The three-letter aligners reduce the sequence complexity, resulting in a higher percentage of discard reads owing to multiple alignments in the reference genome, while the wild-card aligner can achieve a higher genomic coverage but with some bias toward increased DNA methylation level.115 After the alignment, the DNA methylation level can be determined by comparing the frequency of Cs and Ts that align to each C in the reference genome.
Table 2

Software and tools for epigenomic data analysis.

SOFTWARE/TOOLDESCRIPTIONURLREFS
1. DNA methylation
1.1. Mapping BS-seq reads
1.1.1. General aligners with a BS-Seq module
GSNAPA wild-card bisulfite aligner included in a general-purpose alignment tool (Genomic Short-read Nucleotide Alignment Program)http://share.gene.com/gmap323
LASTA wild-card bisulfite aligner included in a general-purpose alignment toolhttp://last.cbrc.jp161
RMAPA Wild-card bisulfite aligner included in a general-purpose alignment toolhttp://rulai.cshl.edu/rmap/6
segemehlA wild-card bisulfite aligner included in a general-purpose alignment toolhttp://www.bioinf.uni-leipzig.de/Software/segemehl304
1.1.2 Specific BS-Seq aligner that use a three-letter approach
BismarkA widely used three-letter bisulfite aligner based on Bowtie/Bowtie2http://www.bioinformatics.babraham.ac.uk/projects/bismark165
BRATA bisulfite-treated reads tool using the three-letter alignmenthttp://compbio.cs.ucr.edu/brat166
BS-SeekerA three-letter bisulfite aligner based on Bowtiehttps://github.com/BSSeeker/Bsseeker2324
MethylCoderA three-letter bisulfite aligner based on Bowtie/GSNAPhttps://github.com/brentp/methylcode168
1.1.3 The specific BS-Seq aligner by wild-card approch
BSMAPA widely used wild-card aligner for bisulfite sequencing readshttp://code.google.com/p/bsmap325
PashA wild-card bisulfite aligner using gapped k-mer and multi-positional hash tablehttp://brl.bcm.tmc.edu/pash170172
1.1.4 Other BS-seq aligners
BISMAMapping and clustering of bisulfite sequencing data for individual clones from unique and repetitive sequenceshttp://biochem.jacobs-university.de/BDPC/BISMA/326
BRAT-BWA fast, accurate and memory-efficient BS aligner using the FM-index (Burrows-Wheeler transform)http://compbio.cs.ucr.edu/brat/304
B-SOLANAA aligner for bisulfite-sequencing data of ABI SOLiD sequencershttp://code.google.com/p/bsolana327
RRBSMAPA wild-card aligner for RRBS readshttp://rrbsmap.computational-epigenetics.org328
1.2. Detecting differential methylated regions (DMRs)
1.2.1 Software for DMR calling only
BiSeqAn R package for detect differentially methylated regions (DMRs) for BS datahttps://www.bioconductor.org/packages/release/bioc/html/BiSeq.html175
bumphunterBump hunting to identify differentially methylated regionshttp://bioconductor.org/packages/release/bioc/html/bumphunter.html177
DMRcateAn R package for detecting differentially methylated regions (DMRs) based on tunable kernel smoothingwww.bioconductor.org/packages/release/bioc/html/DMRcate.html178
IMAAn R package for high-throughput analysis of Illumina’s 450K Infinium methylation datahttp://www.rforge.net/IMA329
M3DAn R package for detecting differentially methylated regions (DMRs) using a non-parametric, kernel-based methodhttps://www.bioconductor.org/packages/release/bioc/html/M3D.html330
methylSigAn R package for detecting differentially methylated sites (DMCs) or regions (DMRs) using a beta-binomial modelhttps://github.com/sartorlab/methylSig331
metileneA fast and sensitive tool for detecting DMR by a binary segmentation algorithm combined with a two-dimensional statistical testhttp://www.bioinf.uni-leipzig.de/Software/metilene/185
MOABSA tool for detecting differentially methylated sites (DMCs) or regions (DMRs) based on a Beta-Binomial hierarchical model with relative low CpG coverage (~10X)https://code.google.com/archive/p/moabs/332
NHMMfdrAn R package for detecting differential DNA methylation based on non-homogeneeous hidden Markov model (NHMM) by estimating false discovery rates (FDRs)http://www.ams.sunysb.edu/~pfkuan/NHMMfdr/182
QDMRA tool for detecting DMR based on Shannon entropyhttp://bioinfo.hrbmu.edu.cn/qdmr333
1.2.2 Pipeline for both BS-seq mapping and DMR calling
BsmoothBsmooth is a pipeline for analyzing whole genome bisulfite sequencing (WGBS) data. It includes tools for aligning the data, quality control, and identifying differentially methylated regions (DMRs).http://rafalab.jhsph.edu/bsmooth/304
MethPipeA computational pipeline for analyzing bisulfite sequencing data (WGBS and RRBS), including BS mapping (Wild-Card aligner) and DMR callinghttp://smithlabresearch.org/software/methpipe/334
RefFreeDMAMapping for RRBS reads and DMR calling without a reference genomehttps://github.com/jklughammer/RefFreeDMA335
2. Histone Modifications and DNA-binding Proteins
2.1 Short-read Alignment
BWAA fast and efficientlight-weighted tool that aligns short sequences to a sequence database; based on the Burrows–Wheeler transformhttp://bio-bwa.sourceforge.net233
BowtieUltrafast, memory-efficient short read aligner. Uses a Burrows-Wheeler-Transformed (BWT) indexhttp://bowtie-bio.sourceforge.net232
ELANDEfficient Large-Scale Alignment of Nucleotide Databases. Whole genome alignments to a reference genomehttp://support.illumina.com/help/SequencingAnalysisWorkflow/Content/Vault/Informatics/Sequencing_Analysis/CASAVA/swSEQ_mCA_ReferenceFiles.htmIllumina
GenomeMapperGenomeMapper is a short read mapping tool designed for accurate read alignments. It quickly aligns millions of reads either with ungapped or gapped alignmentshttp://1001genomes.org/software/genomemapper. html336
GNUMAPGenomic Next-generation Universal MAPper is a program designed to accurately map sequence data obtained from next-generation sequencing machines back to a genome of any size. It seeks to align reads from nonunique repeats using statisticshttp://dna.cs.byu.edu/gnumap/323
HiCUPA tool for mapping and performing quality control on Hi-C datahttp://www.bioinformatics.babraham.ac.uk/projects/hicup/337
GSNAPConsiders a set of variant allele inputs to better align to heterozygous siteshttp://research-pub.gene.com/gmap160
MAQMapping and Assembly with Qualities (renamed from MAPASS2). Particularly designed for Illumina with preliminary functions to handle ABI SOLiD datahttp://maq.sourceforge.net/230
SOAPSOAP (Short Oligonucleotide Alignment Program). A program for efficient gapped and ungapped alignment of short oligonucleotides onto reference sequenceshttp://soap.genomics.org.cn/229
SOAP2SOAP2 used a Burrows Wheeler Transformation (BWT) compression index to substitute the seed strategy for indexing the reference sequence in the main memoryhttp://soap.genomics.org.cn/soapaligner.html234
ZOOMZOOM (Zillions Of Oligos Mapped) is designed to map millions of short reads, emerged by next-generation sequencing technology, back to the reference genomes, and carry out post-analysishttp://omictools.com/zoom-tool231
2.2 Peak Detection
2.2.1 Peak Caller
BroadPeakA novel algorithm for identifying broad peaks in diffuse ChIP-seq datasetshttp://jordan.biology.gatech.edu/page/software/broadpeak/237
MACSMACS fits data to a dynamic Poisson distribution; works with and without control datahttp://liulab.dfci.harvard.edu/MACS238
PeakSeqPeakSeq takes into account differences in mappability of genomic regions; enrichment based on FDR calculationhttp://info.gersteinlab.org/PeakSeq338
SICERA clustering approach for identification of enriched domains from histone modification ChIP-Seq datahttp://home.gwu.edu/~wpeng/Software.htm236
SISSRSA novel algorithm for precise identification of binding sites from short reads generated from ChIP-Seq experimentshttp://sissrs.rajajothi.com/239
ZINBAZINBA can incorporate multiple genomic factors, such as mappability and GC content; can work with point-source and broad-source peak datahttp://code.google.com/p/zinba339
2.2.2 Differential Peak Caller
baySeqAn R package that uses empirical Bayes approach to identify significant differences; assumes negative binomial distribution of datahttp://www.bioconductor.org/packages/release/bioc/html/baySeq.html340
ChIPDiffA toolkit for the genome-wide comparison of histone modification sites identified by ChIP-seq, differential histone modification sites (DHMS) identification, uses binomial distribution, Baum-Welch expectation maximization (EM) algorithm, forward-backward algorithmhttp://cmb.gis.a-star.edu.sg/ChIPSeq/paperChIP-Diff.htm341
edgeRAn R package that uses negative binomial distribution to model differences in tag counts; uses replicates to better estimate significant differenceshttp://www.bioconductor.org/packages/2.9/bioc/html/edgeR.html257
DESeqDESeq uses negative binomial distribution, but differs in the calculation of the mean and variance of the distributionhttp://www-huber.embl.de/users/anders/DESeq253
SAMSeqSAMSeq based on the popular SAM software; a non-parametric method that uses resampling to normalize for differences in sequencing depthhttp://www.stanford.edu/~junli07/research.html#SAM342
3. ncRNAs
3.1 ncRNAs detection and quantification
miRDeepmiRDeep was developed to discover active known or novel miRNAs from deep sequencing data after the removal of adapters with a number of scripts to preprocess and score the mapped datahttps://www.mdc-berlin.de/8551903/en/248
miRDeep2miRDeep2 is more sensitively and robustly to carry out identifying known and novel miRNAs by evaluating the structure and signature for each precursor, quantifying known miRNAs based on the annotation in miRBase and predicting secondary structure by RNAfold toolhttps://www.mdc-berlin.de/8551903/en/252
miRDeep*miRDeep* is an integrated standalone miRNA identification application with a user-friendly graphic interface to conduct sequence alignment, pre-miRNA secondary structure calculation, and graphical display with low memory requirementhttp://www.australianprostatecentre.org/research/software/mirdeep-star249
DARIODARIO is a web service for studying short read data from small RNA-seq experiments. It provides a wide range of analysis features, including quality control, read normalization, ncRNA quantification and prediction of putative ncRNA candidateshttp://dario.bioinf.uni-leipzig.de/index.py343
ncPRO-seqncPRO-seq is a tool for annotation and profiling of ncRNAs from small-RNA sequencing data. It aims to interrogate and perform detailed analysis on small RNAs derived from annotated non-coding regions in miRBase, piRBase, Rfam and repeatMasker, and regions defined by users. The ncPRO pipeline also has a module to identify regions significantly enriched with short reads that cannot be classified as known ncRNA familieshttps://sourceforge.net/projects/ncproseq/344
CoRALCoRAL is a machine-learning package that can predict the precursor class of small RNAs present in a high-throughput RNA-sequencing dataset and produces information about the features that are most important for discriminating different populations of small non-coding RNAshttp://wanglab.pcbi.upenn.edu/coral/345
RNA-CODERNA-CODE is designed for ncRNA identification in NGS data that lack quality reference genomes. Given a set of short reads, it classifies the reads into different types of ncRNA families. The classification results can be used to quantify the expression levels of different types of ncRNAs in RNA-seq data and ncRNA composition profiles in metagenomic data, respectivelyhttp://www.cse.msu.edu/~chengy/RNA_CODE/346
CAP-miRSeqA comprehensive analysis pipeline for deep microRNA sequencing that integrates read preprocessing, alignment, mature/precursor/novel miRNA qualification, variant detection in miRNA coding region, and flexible differential expression between experimental conditionshttp://bioinformaticstools.mayo.edu/research/capmirseq/256
iMirA modular pipeline for comprehensive analysis of smallRNA-Seq data, comprising specific tools for adapter trimming, quality filtering, differential expression analysis, biological target prediction and other useful options by integrating multiple open source modules and resources in an automated workflowhttp://www.labmedmolge.unisa.it/inglese/research/imir250
UEA sRNA workbenchUEA sRNA workbench performs complete analysis of single or multiple-sample small RNA datasets to identify novel micro RNA sequences and profiling small RNA expression patterns in genetic datahttp://srna-workbench.cmp.uea.ac.uk/260
omiRasomiRas is a web server for annotation, comparison and visualization of interaction networks of non-coding RNAs derived from small RNA-Sequencinghttp://tools.genxpro.net/omiras/259
sRNAtoolboxsRNAtoolbox provide several tools including sRNAbench for sRNA expression profiling and prediction of novel microRNAs, sRNAde for differential expression analysis, miRNA-consTarget for prediction of miRNAs, sRNAjBrowserDE for visualization differential expression as a fuction of read length and sRNAfuncTerms for determination of over represented functional annotations in target gene sethttp://bioinfo5.ugr.es/srnatoolbox347
iSeeRNAiSeeRNA is a support vector machine (SVM)-based classifier for the identification of lincRNAshttp://137.189.133.71/software.html261
SebnifSebnif is an Integrated Bioinformatics Pipeline for the Identification of Novel Large Intergenic Noncoding RNAs (lincRNAs) base on iSeeRNAhttp://137.189.133.71/sebnif/262
LncRNA2FunctionLncRNA2Function – a comprehensive resource for functional investigation of human lncRNAs based on RNA-seq datahttp://mlg.hit.edu.cn/lncrna2function/264
3.2 RIP-seq and CLIP-seq
3.2.1 Differential Peak Caller and Binding site detector from C LIP-seq
NovoalignAn accurate NGS short reads aligner for aligning to reference genomehttp://www.novocraft.com/products/novoalign/267
PIPE-CLIPA Galaxy framework-based comprehensive online pipeline for reliable analysis of data generated by three types of CLIP-seq protocolhttp://pipeclip.qbrc.org/270
PARalyzerIt utilizes this nucleotide ubstation in a kernel density estimate classifier to generate the high-resolution set of Protein-RNA interaction siteshttps://ohlerlab.mdc-berlin.de/software/PARalyzer_85/271
PiranhaPiranha is a peak finding and differential binding detection algorithmhttp://smithlabresearch.org/software/piranha/266
wavClusteRAn integrated pipeline for the analysis of PAR-CLIP datahttps://bioconductor.org/packages/release/bioc/html/wavClusteR.html272
dCLIPdCLIP is designed for quantitative CLIP-seq comparative analysis is able to effectively identify differential binding regions of RBPs in four CLIP-seq datasetshttp://qbrc.swmed.edu/software/273
3.2.2 Motif Discovery
GraphProtGraphProt is a machine learning computational framework for learning sequence- and structure-binding preferences of RNA-RBPs from high-throughput experimental datahttp://www.bioinf.uni-freiburg.de/Software/GraphProt/280
MEMEPerform motif discovery on DNA, RNA or protein datasetshttp://meme-suite.org/348
cERMITcERMIT is a computationally efficient motif discovery tool based on analyzing genome-wide quantitative regulatory evidencehttps://ohlerlab.mdc-berlin.de/software/cERMIT_82/276
GLAM2 (Gapped Local Alignment of Motifs)GLAM2 is a motif detection tool for discovering motifs allowing indels in a fully general manner from DNA, RNA and protein datasetshttp://bioinformatics.org.au/glam2277
MatrixREDUCEA motif discovery tool for genome-wide ChIP-seq and CLIP-seq data analysishttp://www.bussemakerlab.org/278
RNA Bind-n-SeqA quantitative assessment of the sequence and structural binding specificity349
CapRAn efficient algorithm that calculates the probability that each RNA base position is located within each secondary structural contexthttps://sites.google.com/site/fukunagatsu/software/capr281
RNAcontextAn efficient motif finding method ideally suited for using large-scale RNA-binding affinity datasets to determine the relative binding preferences of RBPs for a wide range of RNA sequences and structureshttp://www.cs.toronto.edu/~hilal/rnacontext/279
ViennaRNA Package 2.0A widely used compilation of RNA secondary structurehttp://www.tbi.univie.ac.at/RNA/279
4. Storing, retrieving and visualizing epigenomics data
4.1 Genome browser for visualizing DNA methylation
EnsemblA widely used Web-based genome browser with various epigenome data setshttp://www.ensembl.org283
IGVA widely used graphical genome browser that is run locally on the user’s computerhttp://www.broadinstitute.org/igv286
UCSC Genome BrowserWidely used Web-based genome browser hosting all ENCODE datahttp://genome.ucsc.edu282
BDPCWeb-based tool for bisulfite sequencing data presentation and compilationhttp://biochem.jacobs-university.de/BDPC350
DaVIEThe database with an intuitive user interface to perform visual comparisons across large DNA methylation data setshttps://github.com/apfejes/epigenetics-software285
EpiExplorerA web server provides an interactive gateway for exploring large-scale epigenetic datasets of the human and mouse genomehttp://epiexplorer.mpi-inf.mpg.de351
EpiGRAPHA user-friendly software for advanced (epi-) genome analysis and prediction by powerful machine learning algorithmshttp://epigraph.mpi-inf.mpg.de352
WashU Epigenome BrowserWeb-based genome browser focusing on the human epigenomehttp://epigenomegateway.wustl.edu353
4.2 Specialized-DNA methylation databases
MethBaseA central reference methylome database created from public BS-seq datasetshttp://smithlabresearch.org/software/methbase/334
MethDBA database for DNA methylation and environmental epigenetic effectshttp://www.methdb.de288
MethyCancerDatabase of cancer DNA methylation datahttp://methycancer.psych.ac.cn354
PubMethDatabase of DNA methylation literaturehttp://www.pubmeth.org290
4.3 Specialized histone modification databases
ChromatinDBA database of genome-wide histone modification patterns for Saccharomyces cerevisiaehttp://integbio.jp/dbcatalog/en/record/nbdc00939?jtpl=56294
CR CistromeA ChIP-Seq database for chromatin regulators and histone modification linkages in human and mousehttp://cistrome.org/cr/293
HistomeA relational knowledgebase of human histone proteins and histone modifying enzymeshttp://www.actrec.gov.in/histome/292
HHMDThe human histone modification databasehttp://202.97.205.78/hhmd/291
4.4 Specialized nc RNA and RBPs interaction database
starBase V2.0starBase is designed for decoding ncRNA and the RNA-protein interaction networks and predicting functions especially incancer sampleshttp://starbase.sysu.edu.cn/296,297
CLIPZCLIPZ supports the automatic functional annotation and visualization of CLIP-seq identified binding siteshttp://www.clipz.unibas.ch/298
doRiNAA database of RNA interactions in post-transcriptional regulationhttp://dorina.mdc-berlin.de/300
CLIPdbAn intergrated resource for characterizing the regulatory networks between RBPs and various RNA transcript classeshttp://lulab.life.tsinghua.edu.cn/clipdb/301

Note:

The descriptions are adapted from the software/tools website descriptions.

Detecting DMRs

After BS-seq reads mapping, the next step is typically the detection of DMRs that show significantly different DNA methylation levels between sample groups, such as disease versus normal, or cases versus controls. Based on the biological question of interest and different computational approaches of identification, these DMRs can range in size from as small as a single C site (differentially methylated C site [DMC]) to as large as an entire gene locus with length of megabase pairs. The most common methods to detect DMRs involve testing single C to identifying the DMCs by different statistical analysis and merging the significant DMCs into DMRs using various approaches.115 The basic statistical tests for comparing the DNA methylation levels of each C with sufficient pooled data between sample groups are t-test, Wilcoxon rank-sum test, or linear regression.115,174 Some more advanced models have been employed to improve the DMR detection, including beta regression and hierarchical testing (BiSeq175), weighted generalized linear model (BSmooth176), bump hunting with batch effect removal and peak detection (bumphunting177), tunable kernel smoothing (DMR-cate178), nonparametric and kernel-based method (M3D179), beta-binomial model (methylSig180), beta-binomial hierarchical model (MOABS181), hidden Markov model (NHMMfdr182), three-state HMM (MethPipe183), Shannon entropy (QDMR184), and a binary segmentation algorithm combined with a two-dimensional statistical test (metilene185). Usually, the latest software compares with some previous methods and claims best performance, such as MOABS, which can detect the DMRs with a relative low coverage (~10×)181 and metilene can identify DMRs with unrivaled specificity and sensitivity.185 However, without a systematic benchmarking study, it is difficult to determine which methods will work best for the DNA methylation datasets. To address this issue, it is necessary to carry out the comprehensive comparison between these different DMR callers. Currently, there are still some limitations for the BS-seq technology, such as its inability to distinguish between 5-mC and 5-hmC (5-hydroxymethylcytosine), and single-cell DNA methylation profile is yet to be developed. To overcome these limitations, the technologies such as oxidative bisulfite sequencing (oxBS-seq)186 and Tet-assisted bisulfite sequencing (TAB-seq)187 to distinguish 5-hmC from 5-mC, single-cell bisulfite sequencing using RRBS188 or PBAT (post-bisulfite adaptor tagging),189 and technologies enabling direct detection of modified bases (5-mC or 5-hmC) within individual DNA190–192 have been introduced. With these new technologies, more elaborate and powerful bioinformatics software as well as web-based tools and resources will be developed, which is another great opportunity for computational epigenomics.

Major Bioinformatics Challenges in Interpreting DNA Methylation Differences

There are some major bioinformatics challenges in downstream interpretation of DNA methylation differences after DMR detecting and DNA methylation data visualizing. First of all, as mentioned in the detecting DMR section, it is difficult to compare the different methods without knowing the true methylation status in a certain biological sample. Second, it is more complicated when considering the variation of biological samples.115 There are four major different levels of variations: (1) allele-specific DNA methylation is widespread even in the same cell, and some bioinformatic methods have been introduced to identify the DNA methylation differences between alleles193,194; (2) age-related and interindi-vidual differences in DNA methylation is common and may be influenced by genetic differences195–198; (3) cell-specific methylation is observed in different cell types in the same tissue or organ199,200; and (4) the most complicated case is cancer sample, which is a mixture of tumor and normal cells with increased methylation variations.201 Several bioinformatic tools have been developed to estimate the tumor purity.202,203 Third, the most challenging computational analysis is correlating the DNA methylation differences with diseases. The challenges include the following: (1) the correlation of DNA methylation in promoter and gene expression is modest200,201; (2) the methylation changes can occur not only in promoter regions but also in other genic and intergenic regions204; and (3) the correlation does not necessarily mean causation. However, epigenome-wide association study (EWAS) has been introduced to identify the loci with DNA methylation variation, which is associated with common diseases.205 Abnormal DNA methylation status (either in a CpG-rich region or a single CpG site) has been heavily studied as potential biomarkers for different cancer types,206 such as colon cancer,207–209 prostate cancer,210–212 and lung cancer.213,214

Histone Modifications and DNA-Binding Proteins

The modifications on the unstructured histone tails control the accessibility of the chromatin for the transcription machinery as actively transcribed euchromatin or transcriptionally inactive heterochromatin. Euchromatin is characterized by high levels of acetylation and trimethylated H3K4, H3K36, and H3K79, while heterochromatin is characterized by low levels of acetylation and high levels of methylation on H3K9, H3K27, and H4K20.215 DNA-binding proteins bind preferentially to certain DNA sequences (termed as motif) and work together with histone marks to carry out cellular functions. Evidences are accumulating that the gene expression is predictable by the key factors binding, the histone modification levels, and the cross talks among the different modifications that occurred on the histone simultaneously,216,217 for example, the active mark H3K4me3 and repressive mark H3K27me3 occupied “bivalent domain”, which is pivotal for the embryonic stem cells (ESCs) pluripotent and differentiation states determination.218 ChIP followed by microarray or sequencing has become the widely used technique for identifying the histone modification and protein-binding locations and patterns genome widely.219,220 Moreover, there are various adaptations of the standard ChIP protocol to overcome the limitations for a certain specific application. For instance, in order to use limited cells instead of the conventional 10 million cells for one ChIP reaction, Nano-ChIP-seq for H3K4me3 in 10,000 cells221 and single-tube linear DNA amplification (LinDA) for Erα in 5,000 cells222 have been successfully applied. ChIP-exo using the lambda phage exonuclease feature is able to remarkably enhance the binding precision to single base pair and significantly decrease the signal-to-noise ratio.223 Sequential ChIP assays (ChIP-reChIP)224 and ChIP followed bisulfite sequencing (BSChIP-seq)225,226 assays have been developed to identify the multiple binding events and determine whether these events are simultaneously present or occur on different chromosomes in the same cell or different cells. Numerous tools have been developed for ChIP-seq data analysis, and here, we review the computational processing pipelines emphasizing the essential steps of aligning the reads to the reference genome and detecting peaks.

Short-read alignment

During the ChIP procedure, the genomic DNA is sonicated or digested by MNase into a few hundred base pairs of DNA fragment, and during the sequencing procedure, 25–50 bp are sequenced at the two ends. Thus, short-read aligners must be fast and precise to locate their original position. There are two main strategies to achieve this goal: algorithms based on hash tables and algorithms based on suffix/prefix tries.227 The classical BLAST,228 ELAND (Illumina), SOAP,229 MAQ,230 RMAP,163 and ZOOM231 are hash table-based algorithms with different modifications on the spaced seed and sensitivity tolerance according to the reference genome. The algorithms based on suffix/prefix tries convert the inexact matching problem to the exact matching problem, which accelerate the computing speed remarkably. Of published aligners using this strategy, Bowtie,232 BWA,233 and SOAP2234 are gaining increasing popularity. The choice of alignment method and the parameters selection such as mismatch allowance can impact the percentage of the successfully aligned reads, thus the next peak calling. More number of tools are summarized in Table 2.

Differential peak detection

The aligned unique ChIP-seq reads are usually identified as sets of enriched signals, termed peaks, on certain genomic regions. Data from DNA input control experiments are used as background levels of signal to compute the enrichment that would be expected by chance, thus pointing the position of the histone modification or protein binding sites. Peak detection requires a series of distinct steps before generating the final peak list as follows: reads shifting, background subtraction, peak identification, significance test, and artifacts removal.235 Based on the signal characteristics, a variety of peak calling tools have been developed, and usually parameters in each step can be adjusted so that they dramatically affect the final peak. Histone modifications, histone variants, and histone-modifying enzymes usually give rise to diffuse signals and form peaks from several nucleosomes to large domains encompassing multiple genes, SICER236 and BroadPeak237 perform well under this circumstance. While for the exact binding locations of transcription factors and chromatin remodeling factors, MACS238 and SISSRS239 are of good achievements. There are comparison analyses for different peak callers, which may provide critical assessment idea when handling ChIP-seq data.240,241 Comprehensive peak detection tools are listed in Table 2.

ncRNAs

ncRNA discovery and quantification

Transcriptome studies have confirmed that the genome sequences are greatly transcribed, and the vast amount of genetic information transcribed indicates that there are hidden categories, and the functions and biological significance of ncRNAs remain unclear. Rapidly accumulating evidences suggest that ncRNAs act as regulatory molecules in an epigenetic manner that associate with almost all biological processes,242–245 and the complexity of the regulatory mechanisms stays in line with the complexity of organisms.246 With the wide use of the NGS, it is becoming more powerful to discover new classes of ncRNA and investigate their functions from deep sequencing data. Earlier ncRNA endeavors were based on machine learning methods prediction and experimental validation, which were based on the ncRNA features such as evolutionary sequence conservation, RNA secondary structure and distinct expression patterns across developmental stages, different tissues, and conditions.247 It has been proved that integrated analyzing of the RNA sequence, structure, and expression feature enables the ncRNA differentiation from protein-coding RNAs and regulatory elements and potentially different ncRNA categories,248,249 so that paving a way to detect novel ncRNAs from unannotated genomic regions with systematic searching. Recently developed high-throughput ncRNA sequencing data analysis tools are emerging as systematic analysis pipelines, which are usually compromising three main aspects including ncRNA identification and quantification, interactions with RBPs and target mRNAs, and function characterization. The general workflows are first filtering the adapters and aligning the deep sequencing reads or conducting the de novo assembly, next the known ncRNAs quantification and novel ncRNAs identification by inferring annotation databases, following the functional interaction analysis based on structural features and database annotations. iMir250 is such an integrated pipeline with graphical user interface (GUI) that allows ncRNAs’ identification such as miRNA and piRNA by miRAnalyzer251 or miRDeep2,252 differential expression analysis by DESeq,253 and prediction of target using Target-Scan254 and miRanda.255 Besides the alignments, quantification of known ncRNAs, CAP-miRSeq,256 can detect and quantify precursor, mature, and novel miRNAs, analyze differential expressions by edgeR,257 detect single-nucleotide variants (SNVs) by Genome Analysis Tool Kit (GATK),258 which represents a unique feature of this kind of pipelines, and visualize by IGV genome browser. omiRas259 and UEA sRNA workbench260 can take the raw small ncRNA seq data and visualize the ncRNAs interaction network through a web service leveraging on several miRNA–mRNA databases after differential expression and comprehensive analysis. LncRNA discovery and analysis has also been promoted by deep sequencing technology, while the challenges are the sensitivity and specificity of the detection due to the low expression level comparing to the protein-coding RNAs and limited annotation, so that it is difficult to uncover the biological functions. iSeeRNA261 is a support vector machine (SVM)-based classifier that utilized lncRNA features of conservation, ORFs, and sequences characteristics to precisely separate them from coding genes. Self-estimation-based novel lincRNA filtering (Sebnif)262 accurately detects lincRNAs through filtering the known and unknown, single-exon and multi-exon, size between 200 bp and 10 kb and other features based on iSeeRNA and annotates the detected lincRNAs with weighted gene coexpression network.263 Based on the idea that similar expression patterns across different conditions may share similar functions and biological pathways, LncRNA2Function264 provides an approach to annotate lncRNA by calculating the Pearson correlation coefficient (PCC) of lncRNA–mRNA pairs for the 10,000 lncRNAs in GENCODE project.265

ncRNA and protein interactions detection

Besides the chromatin modifications’ regulation on gene expression, post-transcriptional mechanisms play a crucial role to tune the RNA level and protein level. A principal mechanism under intensive study is the RBP binding and action mechanism, for which CLIP-seq protocols enable the transcriptome-wide examination of interaction regions for particular RBPs. Hence, computational data analysis is key to the further understanding of transcriptome level regulation mechanisms. CLIP-seq generates the selected short reads from the RBPs binding regions, so that the reads alignment or reads mapping to the genome and transcriptome are usually the first step of the data analysis pipelines. Many software developed for genomic sequencing reads mapping can be directly implemented such as Bowtie, RMAP,266 and Novoalign.267 The mapping tools that consider the splicing and can detect the exon–exon junctions are also commonly used, which includes TopHat268 and STAR.269 It is worth to note that at least one nucleotide mismatch should be allowed in alignment especially for the PAR-CLIP sequencing data since the cross-link step can induce the T to C transition. After the reads mapping and cluster detection, the following step will be peak calling and binding site detection, which greatly depend on the transcript abundance and cluster length. The most commonly used strategy for this step is to find the precise cluster distribution profiles through enhancing the signal-to-noise ratio and decreasing the false-positive rate. Data analysis methods developed for this purpose include PIPE-CLIP,270 PARalyzer,271 Piranha,266 wavClusteR,272 and dCLIP.273 The next downstream of the pipeline is the motif discovery, higher level structure prediction, and functional characterization. Previously developed tools for DNA and protein motif discoveries can be implemented to the RNA datasets and performed well, which include HOMER,274 MEME,275 cERMIT,276 GLAM2,277 MatrixREDUCE,278 and RNAcontext.279 Although there are tools for the ncRNA secondary structure prediction and functional annotation, such as GraphProt,280 CapR,281 and LncRNA2Function,264 there are still significant challenges in this field including increasing the sensitivity and specificity, decreasing the false-positive discovery rate, and expanding the algorithms for global prediction. After a complete understanding of the ncRNA and RBP regulatory mechanism is achieved, integrative approaches for a network-level interference can be explored.

Storing, Retrieving, and Visualizing Epigenomics Data

Once the most fundamental analysis of epigenomics data, including reads mapping and either DMR or peak calling, have been completed, the next main step is to store, retrieve, and visualize the epigenomic data across the sample groups. A common interest is to inspect or compare the DNA methylation and histone modification levels in a selected genomic region, such as gene locus, regulatory regions by a genome browser, either a Web-based genome browser (such as UCSC Genome Browser,282 Ensembl,283 WashU Human Epigenome Browser,284 DaVIE285) or desktop-based local genome browser (such as IGV286). To do so, the specialized format files such as bigBed or bigWig converted from the BED or WIG files are required to be uploaded or imported into a genome browser. Among them, UCSC Genome Browser is widely used by allowing uploading custom tracks as well as displaying the tracks publicly. A general user can store the large volumes of epigenetic data in Gene Expression Omnibus287 (GEO) from National Center for Biotechnology Information (NCBI) or DaVIE.285 Additionally, several large-scale initiatives host the data in the public hub, as described in the next section. Researchers can retrieve the datasets from either these public hubs or the specialized databases, such as MethyBase,183 MethDB,288 MethyCancer,289 and PubMeth290 for DNA methylation data and HHMD,291 Histome,292 CR Cistrome,293 and ChromatinDB294 for histone modification data. Accumulating CLIP-seq data generation and analysis improves the annotation in terms of ncRNA category, secondary structure, and RBP binding regions. Besides the GEO and ArrayExpress form European Bioinformatics Institute (EBI),295 there are currently four databases focusing on ncRNA and RBPs binding information as follows: starBase V2.0,296,297 CLIPZ,298 doRiNA,299,300 and CLIPdb.301 These epigenomic data storing and visualizing websites, tools, and databases are summarized in Table 2.

Public Resources for Reference Epigenome

Data analysis and algorithm development have been accelerated by several collaborative projects in addition to data generation, which has leveraged by experimental pipelines built around NGS technologies. Multiple comprehensive epigenomic projects have been launched with the goal of providing a public resource and delivering a collection of normal epigenomes as a reference framework to catalyze basic biology and disease-oriented research. The International Human Epigenome Consortium (IHEC)302 (http://ihec-epigenomes.org/) launched with “a goal to understand to what extent the epigenome has shaped the human population genetically and in response to the environment by coordinating the reference maps of human epigenomes for key cellular states in health and diseases status”. It has been distributed to multiple contributing projects including the NIH roadmap, the Encyclopedia of DNA Elements (ENCODE), and the BLUEPRINT projects. The NIH Roadmap Epigenomics Mapping Consortium303 (http://www.road-mapepigenomics.org/) aims to deliver a collection of normal epigenomes to provide a framework or reference for comparison and integration within a broad array of future studies. The Consortium has mapped DNA methylation, histone modifications, chromatin accessibility, and small RNA transcripts in representative tissues and cell lines that are the normal counterparts of tissues and organ systems frequently involved in human disease. The ENCODE304 (https://www.encodeproject.org/) and the modENCODE (http://www.modencode. org/) projects are dedicated to list all functional elements for gene expression regulation in the genome of human and model organisms by integrating epigenomic, transcriptomic, genomic, and proteomics data. It provides extensive epigenome data for cultured cell lines in addition if IHEC focus on primary cell types. GENCODE project (https://www.gencodegenes.org)305 is a scale-up of the ENCODE project for integrated annotation of gene features in human and mouse. The endeavor focuses on accurate annotation with all evidence-based gene features including protein-coding loci with alternatively spliced variants, noncoding loci, and pseudogenes. The European BLUEPRINT project (http://www.blueprint-epigenome.eu/) studies a variety of blood cell types and their associated diseases, and the German DEEP project (http://www.deutsches-epigenom-programm.de/) analyzes cell types that are related to metabolic and inflammatory diseases with high socioeconomic impact. The Human Epigenome Project (HEP) (http://www.epigenome.org/) focuses on the genome-wide DNA methylation pattern identification, catalog, and interpretation in all human genes with deciphering methylation variable positions (MVPs) to promise significant advance of human disease understanding and diagnoses. All IHEC data are being distributed via its GEO database in the global bioinformatics hubs of the US NCBI, and its European Genome–Phenome Archive (EGA) database in the EBI. FANTOM project (http://fantom.org/)131 is the first large-scale catalog for ncRNAs, from which over 67,000 cDNAs have been sequenced and 3,652 with confident experimental evidences.306 Meanwhile, cancer genomic projects including The Cancer Genome Atlas (TCGA)307 (http://cancergenome.nih.gov/) and International Cancer Genome Consortium (ICGC)308 (https://icgc.org/) aim to obtain a comprehensive and multidimensional description of genomic, transcriptomic, and epigenomic changes in different tumor types and help understanding what errors cause cells grow uncontrolled, how the cancer can be prevented, early diagnosis, and better treatment. Aggregated data are freely accessible from the TCGA Data Portal and ICGC Data Portal, but an application is required to access raw sequencing data and genotype information of individual patients. More comprehensive projects initiated in institutional and regional-wide that provide epigenomic data resources are listed in Table 3.
Table 3

Large-scale epigenome projects.

PROJECTS AND WEBSITESSUMMARY
IHEC (International Human Epigenome Consortium) (http://ihec-epigenomes.org)IHEC launched with a goal to understand to what extent the epigenome has shaped the human population genetically and in response to the environment by coordinating the reference maps of human epigenomes for key cellular states in health and diseases status. It has been distributed to multiple contributing projects including the NIH Roadmap, the ENCODE and the BLUEPRINT projects.
NIH Roadmap Epigenomics Mapping Consortium (http://www.roadmapepigenomics.org/)The NIH Roadmap Epigenomics Mapping Consortium was launched with the goal of producing a public resource of human epigenomic data to catalyze basic biology and disease-oriented research. The Consortium expects to deliver a collection of normal epigenomes that will provide a framework or reference for comparison and integration within a broad array of future studies.
ENCODE (Encyclopedia of DNA Elements) (https://www.encodeproject.org/)The ENCODE Consortium is an international collaboration of research groups funded by the National Human Genome Research Institute (NHGRI). The goal of ENCODE is to build a comprehensive parts list of functional elements in the human genome, including elements that act at the protein and RNA levels, and regulatory elements that control cells and circumstances in which a gene is active. Although epigenome mapping is not its main goal, the project includes largescale mapping of DNA methylation, histone modifications and other epigenetic information.
BLUEPRINT (http://www.blueprint-epigenome.eu/)BLUEPRINT is a large-scale research project receiving close to 30 million euro funding from the EU. 39 leading European universities, research institutes and industry entrepreneurs participate in what is one of the two first so-called high impact research initiatives to receive funding from the EU.
HEP (Human Epigenome Project) (http://www.epigenome.org/)The partially EU-funded HEP analyzed DNA methylation in 43 unrelated individuals at single basepair resolution. Although the analysis was confined to selected regions on three chromosomes, it is the largest high-resolution, multiindividual epigenome dataset published to date.
German DEEP project (http://www.deutsches-epigenom-programm.de/)DEEP focuses on the analysis of cells connected to complex diseases with high socio-economic impact: metabolic diseases such as steatosis and adipositas as well as inflammatory diseases of the joints and the intestine. DEEPs goal is to generate high-end data for comprehensive biomedical interpretation of healthy and diseased cells. With this DEEP will contribute to discover new functional epigenetic links useful for clinical diagnosis, therapy and health risk prevention. All data generated will be made publically available and will be integrated into a sustainable world-wide data structure comprised by the IHEC initiative.
HEROIC (High-throughput Epigenetic Regulatory Organisation In Chromatin) (EU) (http://projects.ensembl.org/heroic/)The HEROIC project is a multi-center EU project that applies ChIP-on-chip, chromosome interaction analysis and whole-genome nuclear localization assays to understanding human genome regulation.
AHEAD (Alliance for Human Epigenomics and Disease) Task Force (international) (http://graphy21.blogspot.com)The goal of the AHEAD is to initiate and coordinate a comprehensive human epigenome-mapping project. Initially, focus is set on developing a suitable bioinformatic infrastructure and on performing epigenome mapping in a selection of normal tissues, which may provide the reference for subsequent mapping in abnormal cells.
ICGC (International Cancer Genome Consortium) (https://icgc.org/)The goal of the ICGC is to obtain a comprehensive description of genomic, transcriptomic and epigenomic changes in 50 different tumor types and/or subtypes which are of clinical and societal importance across the globe.
TCGA (The Cancer Genome Atlas) (http://cancergenome.nih.gov)The Cancer Genome Atlas (TCGA), collaboration between the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI), aims to generate comprehensive, multi-dimensional maps of the key genomic changes in major types and subtypes of cancer.
FANTOM project (http://fantom.gsc.riken.jp/)FANTOM is an international research consortium established to assign functional annotations to the full-length cDNAs that were collected during the Mouse Encyclopedia Project at RIKEN. FANTOM developed and expanded over time to encompass the fields of transcriptome analysis. FANTOM database and the FANTOM full-length cDNA clone bank are worldwide available resources that already fueled the iPS development.
GENECODE project (https://www.gencodegenes.org/)GENCODE as a sub-project of the ENCODE scale-up project are aiming to integrated annotation of gene features. Currently running phase is continuously to improve the coverage and accuracy of the human and mouse gene set by enhancing and extending the annotation of all evidence-based gene features at a high accuracy, including protein-coding loci with alternatively splices variants, non-coding loci and pseudogenes.

Note: *The descriptions are adapted from indicated website sources.

Integration Analysis

The experimental and bioinformatics methods for epigenetics data analysis have undergone a revolution in the past decade along with the advances of NGS technology, especially in the throughput and multiple dimension of detection. Over the coming years, with more epigenomics data becoming available through public consortia, researchers can investigate the biological process and disease in a comprehensive way by mapping the DNA methylation, histone modifications, transcription factor binding, nucleosome positioning, and chromosomal organization combined with transcriptomic and proteomic data.309–311 Simple integration analysis is intersection analysis among features extracted from different approaches, such as histone modification data from ChIP-seq, DNA methylation data from BS-seq, and gene expression data from RNA-seq, exome, or whole-genome sequencing,128 which may facilitate understanding of developmental event or disease study. In addition, combination of different datasets from multiple projects and study of the feature of more subjects increasingly requires the integrative analysis. Biological systems are being deeply investigated at an unprecedented scale along with the rise of novel omics technologies and through large-scale consortia projects. However, the heterogeneity and large volume of these datasets are still obstacles of the integrative analysis, which encourage researchers to develop novel data integration methodologies.

Outlook

The advances of epigenomic study with NGS development has profoundly challenged the long-held traditional view of the genetic code being the key determinant of gene function and its alteration being the major cause of complex diseases. Advances in the epigenetic field have led to the realization that the packaging of the genome is as important as the genome sequence in regulating fundamental cellular processes and its alteration being an essential cause of human diseases. Comprehensive understandings of the global patterns and the interplays of these epigenetic regulators and their corresponding changes upon environmental stimuli have enabled the better understanding of biology and better diagnosis and treatment strategies for diseases. Computational analysis in epigenomics holding the great power of helping in revealing genome-wide landscape and interplay with genome will significantly increase in the coming years. It will increasingly take both the genome sequence and the proteins interacted with the genome into account as regulatory networks for the cellular processes. The decreasing cost of the NGS technologies will enable quantitative analysis of epigenetic variation from single-cell level to human individuals in a population level, which will greatly facilitate precision medicine and analysis of various effects of environmental factors on the human genome. Computational epigenomics data analysis will also promote the development of theoretical models and powerful tools, which will in turn facilitate further investigations toward the depiction of big picture of life science.
  354 in total

Review 1.  Functions and transcriptional regulation of adult human hepatic UDP-glucuronosyl-transferases (UGTs): mechanisms responsible for interindividual variation of UGT levels.

Authors:  Karl Walter Bock
Journal:  Biochem Pharmacol       Date:  2010-05-08       Impact factor: 5.858

2.  Distinct and predictive chromatin signatures of transcriptional promoters and enhancers in the human genome.

Authors:  Nathaniel D Heintzman; Rhona K Stuart; Gary Hon; Yutao Fu; Christina W Ching; R David Hawkins; Leah O Barrera; Sara Van Calcar; Chunxu Qu; Keith A Ching; Wei Wang; Zhiping Weng; Roland D Green; Gregory E Crawford; Bing Ren
Journal:  Nat Genet       Date:  2007-02-04       Impact factor: 38.330

Review 3.  Histone deacetylases and their inhibitors in cancer, neurological diseases and immune disorders.

Authors:  Katrina J Falkenberg; Ricky W Johnstone
Journal:  Nat Rev Drug Discov       Date:  2014-08-18       Impact factor: 84.694

Review 4.  lincRNAs: genomics, evolution, and mechanisms.

Authors:  Igor Ulitsky; David P Bartel
Journal:  Cell       Date:  2013-07-03       Impact factor: 41.582

5.  MicroRNA-298 and microRNA-328 regulate expression of mouse beta-amyloid precursor protein-converting enzyme 1.

Authors:  Vincent Boissonneault; Isabelle Plante; Serge Rivest; Patrick Provost
Journal:  J Biol Chem       Date:  2008-11-05       Impact factor: 5.157

6.  Small molecule activators of sirtuins extend Saccharomyces cerevisiae lifespan.

Authors:  Konrad T Howitz; Kevin J Bitterman; Haim Y Cohen; Dudley W Lamming; Siva Lavu; Jason G Wood; Robert E Zipkin; Phuong Chung; Anne Kisielewski; Li-Li Zhang; Brandy Scherer; David A Sinclair
Journal:  Nature       Date:  2003-08-24       Impact factor: 49.962

7.  Prognostic value of epidermal growth factor receptor mutations in resected lung adenocarcinomas.

Authors:  Wei-Shuai Liu; Lu-Jun Zhao; Qing-Song Pang; Zhi-Yong Yuan; Bo Li; Ping Wang
Journal:  Med Oncol       Date:  2013-11-19       Impact factor: 3.064

8.  Increased methylation variation in epigenetic domains across cancer types.

Authors:  Kasper Daniel Hansen; Winston Timp; Héctor Corrada Bravo; Sarven Sabunciyan; Benjamin Langmead; Oliver G McDonald; Bo Wen; Hao Wu; Yun Liu; Dinh Diep; Eirikur Briem; Kun Zhang; Rafael A Irizarry; Andrew P Feinberg
Journal:  Nat Genet       Date:  2011-06-26       Impact factor: 38.330

9.  dCLIP: a computational approach for comparative CLIP-seq analyses.

Authors:  Tao Wang; Yang Xie; Guanghua Xiao
Journal:  Genome Biol       Date:  2014-01-07       Impact factor: 13.583

10.  Design and analysis of ChIP-seq experiments for DNA-binding proteins.

Authors:  Peter V Kharchenko; Michael Y Tolstorukov; Peter J Park
Journal:  Nat Biotechnol       Date:  2008-11-16       Impact factor: 54.908

View more
  4 in total

Review 1.  Molecular networks in Network Medicine: Development and applications.

Authors:  Edwin K Silverman; Harald H H W Schmidt; Eleni Anastasiadou; Lucia Altucci; Marco Angelini; Lina Badimon; Jean-Luc Balligand; Giuditta Benincasa; Giovambattista Capasso; Federica Conte; Antonella Di Costanzo; Lorenzo Farina; Giulia Fiscon; Laurent Gatto; Michele Gentili; Joseph Loscalzo; Cinzia Marchese; Claudio Napoli; Paola Paci; Manuela Petti; John Quackenbush; Paolo Tieri; Davide Viggiano; Gemma Vilahur; Kimberly Glass; Jan Baumbach
Journal:  Wiley Interdiscip Rev Syst Biol Med       Date:  2020-04-19

2.  HDAC inhibitors rescue multiple disease-causing CFTR variants.

Authors:  Frédéric Anglès; Darren M Hutt; William E Balch
Journal:  Hum Mol Genet       Date:  2019-06-15       Impact factor: 6.150

Review 3.  Recent advancements in understanding the role of epigenetics in the auditory system.

Authors:  Rahul Mittal; Nicole Bencie; George Liu; Nicolas Eshraghi; Eric Nisenbaum; Susan H Blanton; Denise Yan; Jeenu Mittal; Christine T Dinh; Juan I Young; Feng Gong; Xue Zhong Liu
Journal:  Gene       Date:  2020-07-29       Impact factor: 3.688

Review 4.  Mitochondrial Epigenetics: Non-Coding RNAs as a Novel Layer of Complexity.

Authors:  Giovanna C Cavalcante; Leandro Magalhães; Ândrea Ribeiro-Dos-Santos; Amanda F Vidal
Journal:  Int J Mol Sci       Date:  2020-03-06       Impact factor: 5.923

  4 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.