Literature DB >> 23843771

Human Transcriptome and Chromatin Modifications: An ENCODE Perspective.

Li Shen¹, Inchan Choi, Eric J Nestler, Kyoung-Jae Won.

Abstract

A decade-long project, led by several international research groups, called the Encyclopedia of DNA Elements (ENCODE), recently released an unprecedented amount of data. The ambitious project covers transcriptome, cistrome, epigenome, and interactome data from more than 1,600 sets of experiments in human. To make use of this valuable resource, it is important to understand the information it represents and the techniques that were used to generate these data. In this review, we introduce the data that ENCODE generated, summarize the observations from the data analysis, and revisit a computational approach that ENCODE used to predict gene expression, with a focus on the human transcriptome and its association with chromatin modifications.

Entities: CellLine Disease Gene Species

Keywords: ENCODE; GENCODE; chromatin modification; transcriptome

Year: 2013 PMID： 23843771 PMCID： PMC3704928 DOI： 10.5808/GI.2013.11.2.60

Source DB: PubMed Journal: Genomics Inform ISSN： 1598-866X

Introduction

In September 2012, 30 research papers, including 6 in Nature, were published online (http://www.nature.com/encode/) about genomescale data from a decade-long project, the Encyclopedia of DNA Elements (ENCODE) [1]. Aiming to delineate all functional elements encoded in the human genome [1-4], the ENCODE project examined 1% of the human genome in its pilot phase and scaled up to the whole genome in its second phase. It released data from more than 1,600 sets of experiments from 147 types of tissues [4]. These data include a catalog of human protein-coding and noncoding RNAs as well as protein-DNA interactions, chromatin and DNA accessibility, histone modifications, DNA methylation, and long-range chromosomal interactions. The ENCODE consortium reported that 80.4% of the human genome serves some type of known biochemical function [4]. The unprecedented volume and span of this study will make it an excellent resource for biological analyses. On the other hand, it can be overwhelming for a researcher to access and interpret the data. ENCODE also developed novel (or refined existing) techniques for data generation and computational analyses. These techniques are by no means restricted to ENCODE and can be employed in a wide array of applications. Because of the vast span of ENCODE, we will limit our scope to the human transcriptome and its association with chromatin modifications in this review. We aim to introduce the data that ENCODE generated and the techniques (both experimental and computational) that were used to generate them. We summarize the observations that ENCODE found. We also provide a detailed discussion of a machine learning method that was used to predict gene expression from chromatin modifications with higher accuracy than its predecessors.

Chromatin Modifications Measured by Chromatin Immunoprecipitation Sequencing (ChIP-seq)

DNA sequences are wrapped around octamers of histone proteins to form nucleosomes, the unit of chromatin. The nucleosome core is composed of two copies each of four histone proteins. Each histone has an N-terminal tail that faces outward from the nucleosome and can be chemically modified to influence the accessibility of the chromatin and interactions with other chromatin-binding proteins. These histone modifications are associated with the activation or repression of gene transcription [5] and many other activities of the genome, such as enhancer activities [6] and splicing regulation [7, 8]. There has been great interest in dissecting the interactions between histones and other chromatin modifications and transcriptional regulation in recent years [9]. One of ENCODE's major goals [4] is to use these chromatin modifications to define regulatory elements of the human genome in multiple cell lines and to investigate their interactions with RNA transcription. Chromatin immunoprecipitation (ChIP) is an experimental technique used to investigate protein-DNA interactions in the cell. ChIP, followed by microarray (ChIP-chip) or high-throughput sequencing (ChIP-seq), is widely used to determine genomewide binding locations of proteins, including transcription factors, covalent modifications of histones, and other chromatin regulatory proteins. ChIP-chip was the chosen technology for the pilot phase of ENCODE to dissect the regulatory regions in the 1% selected portion of the human genome [2, 10, 11]. As next-generation sequencing technology advanced [12], it soon became clear that ChIP-seq was a superior approach [13, 14]. ChIP-seq was the chosen technology for the second phase of the ENCODE project [3, 4, 15]. Although ChIP-seq shows clear advantages over ChIP-chip, it is by no means a perfect technology. In an evaluation study performed by Chen et al. [16], multiple factors were found to influence the fidelity of ChIP-seq data. For example, ChIP-seq data were found to be biased toward open chromatin regions, leading to false positives if not corrected; comparison of different algorithms also showed notable variation in sensitivity and specificity. The ENCODE consortium has established a set of guidelines to ensure the quality of ChIP-seq experiments [15]. Because antibodies play a predominant role in the success of ChIP experiments, a significant effort has been made by ENCODE to characterize the specificity and efficiency of a large number of antibodies. The list of the antibodies used and validated by ENCODE can be found at: http://genome.ucsc.edu/ENCODE/antibodies.html. A large-scale assessment of histone modification antibodies (>200) was performed by multiple groups [17] of the ENCODE consortium, with the most up-to-date information available at: http://compbio.med.harvard.edu/antibodies/. In addition, multiple quality metrics have been developed and used by the ENCODE project [15], as listed in Table 1. Having employed these quality metrics, the ENCODE consortium mapped 11 histone modifications plus one histone variant across 46 human cell types, including a complete matrix of the 12 marks across two groups of cell lines, designated as tier 1 and tier 2.

Table 1

Quality metrics of chromatin modification ChIP-seq data employed by ENCODE

ChIP-seq, chromatin immunoprecipitation sequencing; ENCODE, Encyclopedia of DNA Elements; NRF, nonredundancy fraction; PCR, polymerase chain reaction; FRiP, fraction of reads in peaks; CC, cross correlation; NSC, normalized strand cross-correlation; RSC, relative strand cross-correlation; IDR, irreproducible discovery rate; TSS, transcription start site; TES, transcription end site.

Transcriptome in Human Cells

ENCODE releases a reference gene set (GENCODE) and RNA expression catalogs

The ENCODE project has produced a reference gene set, referred to as GENCODE (http://www.gencodegenes.org) [18]. GENCODE (version 7) identified a comprehensive set of 20,687 protein-coding and 9,640 manually curated long noncoding RNA (lncRNA) loci (representing 15,512 noncoding transcripts/isoforms). Currently, in version 15, it has 20,447 coding genes and 13,249 lncRNA loci (representing 22,531 noncoding transcripts/isoforms). For this reference gene set, GENCODE used manual gene annotation from the Human and Vertebrate Analysis and Annotation (HAVANA) group (http://www.sanger.ac.uk/research/projects/vertebrategenome/havana/) and automatic gene annotation from Ensembl [19]. To construct a comprehensive RNA expression catalog, ENCODE sequenced RNA (using RNA-seq) from 15 different cell lines in multiple subcellular fractions [18]. The expression catalog in multiple cell types provides the transcriptome of coding and noncoding genes, short (<200 bp) and long (>200 bp) in length, polyadenylated or nonpolyadenylated, as well as the compartment in cells where RNAs in each type are populated.

Long noncoding RNAs

LncRNAs are nonprotein-coding transcripts longer than 200 nucleotides. LncRNAs have been known to have similar characteristics as coding RNAs. They are polyadenylated, associated with chromatin signatures, and have multiexonic structure [20, 21]. Some lncRNAs use identical or almost identical transcription initiation complexes [22] and sometimes overlap with protein-coding genes and can be transcribed from either strand [22, 23]. GENECODE interrogated the properties of 15,512 lncRNAs. Table 2 summarizes the comparison between lncRNAs and coding RNAs. Expressed lncRNAs have an activating histone modification profile similar to that of protein-coding genes, with slightly excess levels of both silencing (H3K27me3) and activating (H3K36me3) marks in lncRNAs [24]. Although many lncRNAs are polyadenylated, they are significantly enriched in nonpolyadenylated transcripts compared with coding RNAs [24]. As expected, the lncRNAs have significantly lower protein potentials compared with mRNAs. Interestingly, they are biased toward two-exon transcripts and predominately localized in the chromatin and nucleus, while coding RNAs are predominantly observed in the cytosol; only 6% of them have a 2-exon structure.

Table 2

Comparison of lncRNAs and coding RNAs

LncRNA, long noncoding RNA; TSS, transcriptional start site.

In order to identify the function and targets of lncRNAs, Derrien et al. [24] investigated the correlation between lncRNAs and coding RNAs; both positive and negative correlations were found. Overall, the number of positive correlations was larger than the negative correlations. Compared with trans-acting lncRNAs, cis-acting lncRNAs showed more positive correlations. Interestingly, lncRNAs that intersected protein-coding exons in the antisense orientation showed a strong pattern of coexpression [24]. They also checked sequence conservation by BLASTing human lncRNAs against all available mammalian genomes. Around 30% (n = 4,546) of lncRNAs appeared to have arisen in the primate lineage, and 0.7% (n = 101) of them appeared to be human-specific [24].

RNA splicing

Recent studies suggest that many pre-mRNA processing events are cotranscriptional [25-27]. RNA imaging found that splicing can follow the completion of transcription [28]. Tilgner et al. [29] investigated cotranscriptional splicing by interrogating RNA fractions from several cellular compartments in K562 cells. They found that only a tiny fraction of exons were found to be surrounded completely by an unspliced intron in chromatin-associated RNA. This suggests that splicing is already occurring during transcription [18, 29]. The strong enrichment of spliceosomal small nuclear RNAs (snRNAs) in the chromatin-associated fraction compared with other fractions also supports cotranscriptional splicing. These observations confirm the idea that chromatin structure could play a role in splicing [7, 30-34]. However, for alternative exons and lncRNAs, splicing tends to occur later and might remain unspliced in some cases [29].

RNA editing

Li et al. [35] reported 10,210 exonic sites in the human genome where an RNA sequence did not match with the DNA sequence, suggestive of RNA editing. However, there have been debates about whether their observations occurred by sequencing error, gene duplication, mapping error, or read-end misalignment [36-38]. Park et al. [39] developed a pipeline to filter sequencing artifacts in identifying RNA editing. They found that the majority of non-A-to-G variants came from incorrect read mapping across splice junctions. Most of the edits they found were A-to-G(I) variants, which corresponds with recent observations [40] but differs from Li et al.'s report [35] of a substantial number of noncanonical single nucleotide variant edits in the RNA of human lymphoblastoid cells. Most A-to-G(I) edits were located in introns and untranslated regions, with only a fraction of sites reproducibly edited across multiple cell lines.

Pseudogenes

Pseudogenes have been considered nonfunctional sequences of genomic DNA that lost their coding potential due to disruptive mutations, such as frameshifts and premature stop codons [41-44]. Recent studies have shown that pseudogenes can regulate their parent genes [45-49]. Using manual annotation, with the assistance of computational pipelines, GENCODE created a database called Pseudogene Decoration Resource (psiDR). psiDR provides a variety of information on 11,216 peudogenes, including transcription activity, chromatin features, functional genomics, and evolutionary constraints [50]. Transcribed pseudogenes show enhanced chromatin accessibility and enrichment with histone marks, although they are lower than those of coding genes. The majority of pseudogenes contains no or very few transcription factor binding sites (TFBSs), but the differences between the number of TFBSs associated with transcribed and nontranscribed pseudogenes are significant.

Small RNAs

GENCODE also categorized 7,054 small RNAs into 2,756 micro RNAs (miRNAs), snRNAs, small nucleolar RNAs (snoRNAs), and transfer RNAs (tRNAs). They found that miRNAs and tRNAs were abundant in cytosol, snoRNAs were in the nucleus, and snRNAs were in both the nucleus and cytosol. snRNAs were found abundantly in the chromatin-associated RNA fractions, which further supports predominant splicing during transcription.

Enhancer RNAs

RNAs at enhancers (eRNAs) were first characterized by observing transcription activities at the promoter-distal CREB binding protein-binding sites in mouse cortical neurons [51]. The bidirectional property of eRNAs and their association with gene expression were further studied using nascent RNAs from global run-on sequencing (GROseq) data [52, 53]. ENCODE used RNA assays to detect transcription activity. Besides the bidirectional property, ENCODE identified transcriptional initiation using cap analysis gene expression (CAGE) [54] signals. Interestingly, they found polyadenylated eRNAs, although most eRNAs were prevalent in the nonpolyadenylated form. They also observed that histone marks associated with eRNAs were the factors for transcriptional initiation and elongation: H3K27ac, H3K79me2, and RNA polymerase. These lines of evidence suggest regulatory functions of eRNAs.

Predicting Gene Expression by Chromatin Features

Enabled by the unprecedented volume of data generated by the ENCODE project, Dong et al. [57] performed a very interesting study that attempted to predict gene expression from chromatin features using machine learning techniques. As usually practiced, a machine learning study is composed of a series of procedures that typically involve data collection, data representation, model building, and testing. Using machine learning terminology, the response variables here are gene expression patterns that are predicted/modeled, while the predictors or features are various chromatin measures. Chromatin data were collected from the ENCODE project with 11 chromatin modifications, one histone variant, and DNase I hypersensitivity, all mapped in seven cell lines (Table 3). Gene expression data were from different cellular compartments, using two different RNA isolation approaches and sequenced with different technologies (Table 3). With such diversity of RNA sources, Dong et al. [57] answered not only the general question of whether gene expression can be predicted with satisfactory accuracy but also the questions of whether different RNA sources that are sequenced by different technologies can be predicted differently using the same chromatin features.

Table 3

Summary of data collected for modeling gene expression using chromatin features

The gene expression data are relatively easy to represent. They were separated into two classes: transcriptional start site (TSS)-based and transcript-based (Tx-based). TSS-based expression data are read counts within a 101-bp window centered on the TSS, which measures transcriptional initiation. Tx-based expression data are summarized read counts from the whole transcript, which measures transcriptional elongation. However, the representation of chromatin data seems to be tricky and requires further research. Dong et al. [57] used a strategy called "bestbin," which considers the chromatin signals across the entire gene body, including 2-kb flanking regions. It basically segregates each genic region into equal bins of 100 bp and summarizes the chromatin signals within each. A training dataset was used to identify the bin that correlates most with gene expression, and the learned parameter values were applied to testing data. Other strategies [58, 59] are possible, but the "bestbin" strategy was found to be superior [57]. RNA sequencing data are known to contain very little or no background noise, with a large portion of the genes having 0 read counts. Therefore, the response variables become a mixture of a 0 component and a positive counting component. Neither a classifier nor regression method seems to be able to capture the variability of both. To deal with the challenge, a two-step approach was used by [57], so that a classifier first categorized genes as "expressed" or "unexpressed," Then, a regression method was used to predict the expression levels of the expressed genes. The final prediction is the product of the classifier and the regression method. To test the performance of their approach, each dataset was separated into a training set and a testing set. On the training set, the "best bin" and a few other parameters were determined. After that, a 10-fold crossvalidation was performed on the testing set to evaluate the model. AUC [60] was used to represent the accuracy of the classifier. Two criteria were used to represent the accuracy of the regression method. Pearson correlation coefficient (PCC) was used to measure the similarity between the predicted value and experimental value. Root mean square error was used to measure the disparity between the predicted value and experimental value. Overall, the two-step model achieved very satisfactory performance, with a PCC > 0.9 for a number of datasets and a PCC > 0.8 for 71% of the whole data. Looking at the two steps separately, the AUC can be as high as 0.95 for the classifier, and the PCC can be as high as 0.77 for the regression method when predicting CAGE-measured polyA+ cytosolic RNA expression in K562 cells. Similar performance was achieved in other datasets. It was also found that H3K9ac and H3K4me3 are the most important predictors for the classifier, strengthening their roles as activation marks at TSSs. In contrast, H3K79me2 and H3K36me3 are the most important predictors for the regression methods, strengthening their roles as elongation marks at gene bodies. These findings show that the two-step model not only improved the accuracy of prediction but also enabled the identification of the chromatin features that are associated with different transcriptional roles.

Discussion

During the past decade, the ENCODE project has evolved into a genomewide scale, and the dataset it generated has expanded in quantity as well as in scope. The ENCODE project has provided a global view about the human transcriptome and most noticeably found that the transcribed region of the human genome is more abundant than we previously thought. This finding significantly reduced the so-called intergenic regions, as defined in the traditional sense. The quantitative measurement of RNA species in several cellular compartments as well as their polyadenylation provided a comprehensive view of RNA generation. In this review, we have revisited the characteristics of both coding and noncoding transcripts in association with their structures and locations in cells. Besides the human transcriptome and the associated chromatin modification data that we discussed, the ENCODE consortium also mapped transcription factor binding sites and their associated DNA motifs, as well as DNA methylation and long-range chromosomal interactions [4]. In parallel, the Roadmap Epigenomics Project (http://www.roadmapepigenomics.org) and International Human Epigenome Consortium (http://www.ihec-epigenomes.org) have been accumulating data of a similar scale to understand the human genome in other tissues and conditions. It is remarkable that the ENCODE data altogether have associated more than 80% of the human genome with some type of biochemical function so far, and the coverage will continue to increase as we map additional protein-DNA interactions in the near future. It has now become very clear that so-called "junk DNA" is not evolutionarily vestigial but has specific structural or biochemical functions. While data generation has been a major goal of ENCODE, the need to integrate the current datasets is becoming more and more important. Computational approaches have been developed to exploit the ENCODE data at to a fuller potential. For instance, chromatin features were used to model gene expression [57, 61]; integrative methods were developed to annotate genomes [62-64]; visualization tools were developed to investigate epigenomic regulation at a global scale (also see ngs.plot at https://code.google.com/p/ngsplot/) [65, 66]; and large regulatory networks were reconstructed, based on TFs and DNaseI footprinting [67, 68]. The network-based approach as well as the chromosomal interactions [69] provided novel angles in studying gene regulation at higher levels. New approaches to integrate the large amount of data to provide new biological insights are on the horizon.

68 in total

1. CEAS: cis-regulatory element annotation system.

Authors: Hyunjin Shin; Tao Liu; Arjun K Manrai; X Shirley Liu
Journal: Bioinformatics Date: 2009-08-18 Impact factor: 6.937

2. Histone modification levels are predictive for gene expression.

Authors: Rosa Karlić; Ho-Ryun Chung; Julia Lasserre; Kristian Vlahovicek; Martin Vingron
Journal: Proc Natl Acad Sci U S A Date: 2010-02-01 Impact factor: 11.205

3. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project.

Authors: Ewan Birney; John A Stamatoyannopoulos; Anindya Dutta; Roderic Guigó; Thomas R Gingeras; Elliott H Margulies; Zhiping Weng; Michael Snyder; Emmanouil T Dermitzakis; Robert E Thurman; Michael S Kuehn; Christopher M Taylor; Shane Neph; Christoph M Koch; Saurabh Asthana; Ankit Malhotra; Ivan Adzhubei; Jason A Greenbaum; Robert M Andrews; Paul Flicek; Patrick J Boyle; Hua Cao; Nigel P Carter; Gayle K Clelland; Sean Davis; Nathan Day; Pawandeep Dhami; Shane C Dillon; Michael O Dorschner; Heike Fiegler; Paul G Giresi; Jeff Goldy; Michael Hawrylycz; Andrew Haydock; Richard Humbert; Keith D James; Brett E Johnson; Ericka M Johnson; Tristan T Frum; Elizabeth R Rosenzweig; Neerja Karnani; Kirsten Lee; Gregory C Lefebvre; Patrick A Navas; Fidencio Neri; Stephen C J Parker; Peter J Sabo; Richard Sandstrom; Anthony Shafer; David Vetrie; Molly Weaver; Sarah Wilcox; Man Yu; Francis S Collins; Job Dekker; Jason D Lieb; Thomas D Tullius; Gregory E Crawford; Shamil Sunyaev; William S Noble; Ian Dunham; France Denoeud; Alexandre Reymond; Philipp Kapranov; Joel Rozowsky; Deyou Zheng; Robert Castelo; Adam Frankish; Jennifer Harrow; Srinka Ghosh; Albin Sandelin; Ivo L Hofacker; Robert Baertsch; Damian Keefe; Sujit Dike; Jill Cheng; Heather A Hirsch; Edward A Sekinger; Julien Lagarde; Josep F Abril; Atif Shahab; Christoph Flamm; Claudia Fried; Jörg Hackermüller; Jana Hertel; Manja Lindemeyer; Kristin Missal; Andrea Tanzer; Stefan Washietl; Jan Korbel; Olof Emanuelsson; Jakob S Pedersen; Nancy Holroyd; Ruth Taylor; David Swarbreck; Nicholas Matthews; Mark C Dickson; Daryl J Thomas; Matthew T Weirauch; James Gilbert; Jorg Drenkow; Ian Bell; XiaoDong Zhao; K G Srinivasan; Wing-Kin Sung; Hong Sain Ooi; Kuo Ping Chiu; Sylvain Foissac; Tyler Alioto; Michael Brent; Lior Pachter; Michael L Tress; Alfonso Valencia; Siew Woh Choo; Chiou Yu Choo; Catherine Ucla; Caroline Manzano; Carine Wyss; Evelyn Cheung; Taane G Clark; James B Brown; Madhavan Ganesh; Sandeep Patel; Hari Tammana; Jacqueline Chrast; Charlotte N Henrichsen; Chikatoshi Kai; Jun Kawai; Ugrappa Nagalakshmi; Jiaqian Wu; Zheng Lian; Jin Lian; Peter Newburger; Xueqing Zhang; Peter Bickel; John S Mattick; Piero Carninci; Yoshihide Hayashizaki; Sherman Weissman; Tim Hubbard; Richard M Myers; Jane Rogers; Peter F Stadler; Todd M Lowe; Chia-Lin Wei; Yijun Ruan; Kevin Struhl; Mark Gerstein; Stylianos E Antonarakis; Yutao Fu; Eric D Green; Ulaş Karaöz; Adam Siepel; James Taylor; Laura A Liefer; Kris A Wetterstrand; Peter J Good; Elise A Feingold; Mark S Guyer; Gregory M Cooper; George Asimenos; Colin N Dewey; Minmei Hou; Sergey Nikolaev; Juan I Montoya-Burgos; Ari Löytynoja; Simon Whelan; Fabio Pardi; Tim Massingham; Haiyan Huang; Nancy R Zhang; Ian Holmes; James C Mullikin; Abel Ureta-Vidal; Benedict Paten; Michael Seringhaus; Deanna Church; Kate Rosenbloom; W James Kent; Eric A Stone; Serafim Batzoglou; Nick Goldman; Ross C Hardison; David Haussler; Webb Miller; Arend Sidow; Nathan D Trinklein; Zhengdong D Zhang; Leah Barrera; Rhona Stuart; David C King; Adam Ameur; Stefan Enroth; Mark C Bieda; Jonghwan Kim; Akshay A Bhinge; Nan Jiang; Jun Liu; Fei Yao; Vinsensius B Vega; Charlie W H Lee; Patrick Ng; Atif Shahab; Annie Yang; Zarmik Moqtaderi; Zhou Zhu; Xiaoqin Xu; Sharon Squazzo; Matthew J Oberley; David Inman; Michael A Singer; Todd A Richmond; Kyle J Munn; Alvaro Rada-Iglesias; Ola Wallerman; Jan Komorowski; Joanna C Fowler; Phillippe Couttet; Alexander W Bruce; Oliver M Dovey; Peter D Ellis; Cordelia F Langford; David A Nix; Ghia Euskirchen; Stephen Hartman; Alexander E Urban; Peter Kraus; Sara Van Calcar; Nate Heintzman; Tae Hoon Kim; Kun Wang; Chunxu Qu; Gary Hon; Rosa Luna; Christopher K Glass; M Geoff Rosenfeld; Shelley Force Aldred; Sara J Cooper; Anason Halees; Jane M Lin; Hennady P Shulha; Xiaoling Zhang; Mousheng Xu; Jaafar N S Haidar; Yong Yu; Yijun Ruan; Vishwanath R Iyer; Roland D Green; Claes Wadelius; Peggy J Farnham; Bing Ren; Rachel A Harte; Angie S Hinrichs; Heather Trumbower; Hiram Clawson; Jennifer Hillman-Jackson; Ann S Zweig; Kayla Smith; Archana Thakkapallayil; Galt Barber; Robert M Kuhn; Donna Karolchik; Lluis Armengol; Christine P Bird; Paul I W de Bakker; Andrew D Kern; Nuria Lopez-Bigas; Joel D Martin; Barbara E Stranger; Abigail Woodroffe; Eugene Davydov; Antigone Dimas; Eduardo Eyras; Ingileif B Hallgrímsdóttir; Julian Huppert; Michael C Zody; Gonçalo R Abecasis; Xavier Estivill; Gerard G Bouffard; Xiaobin Guan; Nancy F Hansen; Jacquelyn R Idol; Valerie V B Maduro; Baishali Maskeri; Jennifer C McDowell; Morgan Park; Pamela J Thomas; Alice C Young; Robert W Blakesley; Donna M Muzny; Erica Sodergren; David A Wheeler; Kim C Worley; Huaiyang Jiang; George M Weinstock; Richard A Gibbs; Tina Graves; Robert Fulton; Elaine R Mardis; Richard K Wilson; Michele Clamp; James Cuff; Sante Gnerre; David B Jaffe; Jean L Chang; Kerstin Lindblad-Toh; Eric S Lander; Maxim Koriabine; Mikhail Nefedov; Kazutoyo Osoegawa; Yuko Yoshinaga; Baoli Zhu; Pieter J de Jong
Journal: Nature Date: 2007-06-14 Impact factor: 49.962

Review 4. RNA-Seq: a revolutionary tool for transcriptomics.

Authors: Zhong Wang; Mark Gerstein; Michael Snyder
Journal: Nat Rev Genet Date: 2009-01 Impact factor: 53.242

5. Architecture of the human regulatory network derived from ENCODE data.

Authors: Mark B Gerstein; Anshul Kundaje; Manoj Hariharan; Stephen G Landt; Koon-Kiu Yan; Chao Cheng; Xinmeng Jasmine Mu; Ekta Khurana; Joel Rozowsky; Roger Alexander; Renqiang Min; Pedro Alves; Alexej Abyzov; Nick Addleman; Nitin Bhardwaj; Alan P Boyle; Philip Cayting; Alexandra Charos; David Z Chen; Yong Cheng; Declan Clarke; Catharine Eastman; Ghia Euskirchen; Seth Frietze; Yao Fu; Jason Gertz; Fabian Grubert; Arif Harmanci; Preti Jain; Maya Kasowski; Phil Lacroute; Jing Jane Leng; Jin Lian; Hannah Monahan; Henriette O'Geen; Zhengqing Ouyang; E Christopher Partridge; Dorrelyn Patacsil; Florencia Pauli; Debasish Raha; Lucia Ramirez; Timothy E Reddy; Brian Reed; Minyi Shi; Teri Slifer; Jing Wang; Linfeng Wu; Xinqiong Yang; Kevin Y Yip; Gili Zilberman-Schapira; Serafim Batzoglou; Arend Sidow; Peggy J Farnham; Richard M Myers; Sherman M Weissman; Michael Snyder
Journal: Nature Date: 2012-09-06 Impact factor: 49.962

6. A user's guide to the encyclopedia of DNA elements (ENCODE).

Authors:
Journal: PLoS Biol Date: 2011-04-19 Impact factor: 8.029

7. ChIP-chip versus ChIP-seq: lessons for experimental design and data analysis.

Authors: Joshua W K Ho; Eric Bishop; Peter V Karchenko; Nicolas Nègre; Kevin P White; Peter J Park
Journal: BMC Genomics Date: 2011-02-28 Impact factor: 3.969

8. Integrative annotation of chromatin elements from ENCODE data.

Authors: Michael M Hoffman; Jason Ernst; Steven P Wilder; Anshul Kundaje; Robert S Harris; Max Libbrecht; Belinda Giardine; Paul M Ellenbogen; Jeffrey A Bilmes; Ewan Birney; Ross C Hardison; Ian Dunham; Manolis Kellis; William Stafford Noble
Journal: Nucleic Acids Res Date: 2012-12-05 Impact factor: 16.971

9. An expansive human regulatory lexicon encoded in transcription factor footprints.

Authors: Shane Neph; Jeff Vierstra; Andrew B Stergachis; Alex P Reynolds; Eric Haugen; Benjamin Vernot; Robert E Thurman; Sam John; Richard Sandstrom; Audra K Johnson; Matthew T Maurano; Richard Humbert; Eric Rynes; Hao Wang; Shinny Vong; Kristen Lee; Daniel Bates; Morgan Diegel; Vaughn Roach; Douglas Dunn; Jun Neri; Anthony Schafer; R Scott Hansen; Tanya Kutyavin; Erika Giste; Molly Weaver; Theresa Canfield; Peter Sabo; Miaohua Zhang; Gayathri Balasundaram; Rachel Byron; Michael J MacCoss; Joshua M Akey; M A Bender; Mark Groudine; Rajinder Kaul; John A Stamatoyannopoulos
Journal: Nature Date: 2012-09-06 Impact factor: 49.962

10. The GENCODE pseudogene resource.

Authors: Baikang Pei; Cristina Sisu; Adam Frankish; Cédric Howald; Lukas Habegger; Xinmeng Jasmine Mu; Rachel Harte; Suganthi Balasubramanian; Andrea Tanzer; Mark Diekhans; Alexandre Reymond; Tim J Hubbard; Jennifer Harrow; Mark B Gerstein
Journal: Genome Biol Date: 2012-09-26 Impact factor: 13.583

2 in total

1. ngs.plot: Quick mining and visualization of next-generation sequencing data by integrating genomic databases.

Authors: Li Shen; Ningyi Shao; Xiaochuan Liu; Eric Nestler
Journal: BMC Genomics Date: 2014-04-15 Impact factor: 3.969

2. Histone arginine methylation in cocaine action in the nucleus accumbens.

Authors: Diane M Damez-Werno; HaoSheng Sun; Kimberly N Scobie; Ningyi Shao; Jaclyn Rabkin; Caroline Dias; Erin S Calipari; Ian Maze; Catherine J Pena; Deena M Walker; Michael E Cahill; Ramesh Chandra; Amy Gancarz; Ezekiell Mouzon; Joseph A Landry; Hannah Cates; Mary-Kay Lobo; David Dietz; C David Allis; Ernesto Guccione; Gustavo Turecki; Paola Defilippi; Rachael L Neve; Yasmin L Hurd; Li Shen; Eric J Nestler
Journal: Proc Natl Acad Sci U S A Date: 2016-08-09 Impact factor: 11.205

2 in total