Mark B Gerstein1, Joel Rozowsky2, Koon-Kiu Yan2, Daifeng Wang2, Chao Cheng3, James B Brown4, Carrie A Davis5, LaDeana Hillier6, Cristina Sisu2, Jingyi Jessica Li7, Baikang Pei2, Arif O Harmanci2, Michael O Duff8, Sarah Djebali9, Roger P Alexander10, Burak H Alver11, Raymond Auerbach10, Kimberly Bell12, Peter J Bickel13, Max E Boeck14, Nathan P Boley15, Benjamin W Booth16, Lucy Cherbas17, Peter Cherbas17, Chao Di18, Alex Dobin12, Jorg Drenkow12, Brent Ewing14, Gang Fang10, Megan Fastuca12, Elise A Feingold19, Adam Frankish20, Guanjun Gao18, Peter J Good19, Roderic Guigó21, Ann Hammonds16, Jen Harrow20, Roger A Hoskins16, Cédric Howald22, Long Hu18, Haiyan Huang13, Tim J P Hubbard23, Chau Huynh14, Sonali Jha12, Dionna Kasper24, Masaomi Kato25, Thomas C Kaufman26, Robert R Kitchen10, Erik Ladewig27, Julien Lagarde21, Eric Lai27, Jing Leng10, Zhi Lu18, Michael MacCoss14, Gemma May28, Rebecca McWhirter29, Gennifer Merrihew14, David M Miller29, Ali Mortazavi30, Rabi Murad30, Brian Oliver31, Sara Olson32, Peter J Park11, Michael J Pazin19, Norbert Perrimon33, Dmitri Pervouchine21, Valerie Reinke24, Alexandre Reymond34, Garrett Robinson13, Anastasia Samsonova33, Gary I Saunders35, Felix Schlesinger12, Anurag Sethi10, Frank J Slack25, William C Spencer29, Marcus H Stoiber15, Pnina Strasbourger14, Andrea Tanzer36, Owen A Thompson14, Kenneth H Wan16, Guilin Wang24, Huaien Wang12, Kathie L Watkins29, Jiayu Wen27, Kejia Wen18, Chenghai Xue12, Li Yang37, Kevin Yip38, Chris Zaleski12, Yan Zhang10, Henry Zheng10, Steven E Brenner39, Brenton R Graveley8, Susan E Celniker40, Thomas R Gingeras5, Robert Waterston6. 1. 1] Program in Computational Biology and Bioinformatics, Yale University, Bass 432, 266 Whitney Avenue, New Haven, Connecticut 06520, USA [2] Department of Molecular Biophysics and Biochemistry, Yale University, Bass 432, 266 Whitney Avenue, New Haven, Connecticut 06520, USA [3] Department of Computer Science, Yale University, 51 Prospect Street, New Haven, Connecticut 06511, USA [4] [5]. 2. 1] Program in Computational Biology and Bioinformatics, Yale University, Bass 432, 266 Whitney Avenue, New Haven, Connecticut 06520, USA [2] Department of Molecular Biophysics and Biochemistry, Yale University, Bass 432, 266 Whitney Avenue, New Haven, Connecticut 06520, USA [3]. 3. 1] Department of Genetics, Geisel School of Medicine at Dartmouth, Hanover, New Hampshire 03755, USA [2] Institute for Quantitative Biomedical Sciences, Norris Cotton Cancer Center, Geisel School of Medicine at Dartmouth, Lebanon, New Hampshire 03766, USA [3]. 4. 1] Department of Genome Dynamics, Lawrence Berkeley National Laboratory, Berkeley, California 94720, USA [2] Department of Statistics, University of California, Berkeley, 367 Evans Hall, Berkeley, California 94720-3860, USA [3]. 5. 1] Functional Genomics, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York 11724, USA [2]. 6. 1] Department of Genome Sciences and University of Washington School of Medicine, William H. Foege Building S350D, 1705 Northeast Pacific Street, Box 355065 Seattle, Washington 98195-5065, USA [2]. 7. 1] Department of Statistics, University of California, Berkeley, 367 Evans Hall, Berkeley, California 94720-3860, USA [2] Department of Statistics, University of California, Los Angeles, California 90095-1554, USA [3] Department of Human Genetics, University of California, Los Angeles, California 90095-7088, USA [4]. 8. 1] Department of Genetics and Developmental Biology, Institute for Systems Genomics, University of Connecticut Health Center, 400 Farmington Avenue, Farmington, Connecticut 06030, USA [2]. 9. 1] Centre for Genomic Regulation, Doctor Aiguader 88, 08003 Barcelona, Catalonia, Spain [2] Departament de Ciències Experimentals i de la Salut, Universitat Pompeu Fabra, 08003 Barcelona, Catalonia, Spain [3]. 10. 1] Program in Computational Biology and Bioinformatics, Yale University, Bass 432, 266 Whitney Avenue, New Haven, Connecticut 06520, USA [2] Department of Molecular Biophysics and Biochemistry, Yale University, Bass 432, 266 Whitney Avenue, New Haven, Connecticut 06520, USA. 11. Center for Biomedical Informatics, Harvard Medical School, 10 Shattuck Street, Boston, Massachusetts 02115, USA. 12. Functional Genomics, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York 11724, USA. 13. Department of Statistics, University of California, Berkeley, 367 Evans Hall, Berkeley, California 94720-3860, USA. 14. Department of Genome Sciences and University of Washington School of Medicine, William H. Foege Building S350D, 1705 Northeast Pacific Street, Box 355065 Seattle, Washington 98195-5065, USA. 15. 1] Department of Genome Dynamics, Lawrence Berkeley National Laboratory, Berkeley, California 94720, USA [2] Department of Biostatistics, University of California, Berkeley, 367 Evans Hall, Berkeley, California 94720-3860, USA. 16. Department of Genome Dynamics, Lawrence Berkeley National Laboratory, Berkeley, California 94720, USA. 17. 1] Department of Biology, Indiana University, 1001 East 3rd Street, Bloomington, Indiana 47405-7005, USA [2] Center for Genomics and Bioinformatics, Indiana University, 1001 East 3rd Street, Bloomington, Indiana 47405-7005, USA. 18. MOE Key Lab of Bioinformatics, School of Life Sciences, Tsinghua University, Beijing 100084, China. 19. National Human Genome Research Institute, National Institutes of Health, 5635 Fishers Lane, Bethesda, Maryland 20892-9307, USA. 20. Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK. 21. 1] Centre for Genomic Regulation, Doctor Aiguader 88, 08003 Barcelona, Catalonia, Spain [2] Departament de Ciències Experimentals i de la Salut, Universitat Pompeu Fabra, 08003 Barcelona, Catalonia, Spain. 22. 1] Center for Integrative Genomics, University of Lausanne, Genopode building, Lausanne 1015, Switzerland [2] Swiss Institute of Bioinformatics, Genopode building, Lausanne 1015, Switzerland. 23. 1] Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK [2] Medical and Molecular Genetics, King's College London, London WC2R 2LS, UK. 24. Department of Genetics, Yale University School of Medicine, New Haven, Connecticut 06520-8005, USA. 25. Department of Molecular, Cellular and Developmental Biology, PO Box 208103, Yale University, New Haven, Connecticut 06520, USA. 26. Department of Biology, Indiana University, 1001 East 3rd Street, Bloomington, Indiana 47405-7005, USA. 27. Sloan-Kettering Institute, 1275 York Avenue, Box 252, New York, New York 10065, USA. 28. 1] Department of Genetics and Developmental Biology, Institute for Systems Genomics, University of Connecticut Health Center, 400 Farmington Avenue, Farmington, Connecticut 06030, USA [2] Department of Biological Sciences, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213 USA. 29. Department of Cell and Developmental Biology, Vanderbilt University, 465 21st Avenue South, Nashville, Tennessee 37232-8240, USA. 30. 1] Developmental and Cell Biology, University of California, Irvine, California 92697, USA [2] Center for Complex Biological Systems, University of California, Irvine, California 92697, USA. 31. Section of Developmental Genomics, Laboratory of Cellular and Developmental Biology, National Institute of Diabetes and Digestive and Kidney Diseases, National Institutes of Health, Bethesda, Maryland 20892, USA. 32. Department of Genetics and Developmental Biology, Institute for Systems Genomics, University of Connecticut Health Center, 400 Farmington Avenue, Farmington, Connecticut 06030, USA. 33. 1] Department of Genetics and Drosophila RNAi Screening Center, Harvard Medical School, 77 Avenue Louis Pasteur, Boston, Massachusetts 02115, USA [2] Howard Hughes Medical Institute, Harvard Medical School, 77 Avenue Louis Pasteur, Boston, Massachusetts 02115, USA. 34. Center for Integrative Genomics, University of Lausanne, Genopode building, Lausanne 1015, Switzerland. 35. 1] Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK [2] European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, CB10 1SD, UK. 36. 1] Bioinformatics and Genomics Programme, Center for Genomic Regulation, Universitat Pompeu Fabra (CRG-UPF), 08003 Barcelona, Catalonia, Spain [2] Institute for Theoretical Chemistry, Theoretical Biochemistry Group (TBI), University of Vienna, Währingerstrasse 17/3/303, A-1090 Vienna, Austria. 37. 1] Department of Genetics and Developmental Biology, Institute for Systems Genomics, University of Connecticut Health Center, 400 Farmington Avenue, Farmington, Connecticut 06030, USA [2] Key Laboratory of Computational Biology, CAS-MPG Partner Institute for Computational Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai 200031, China. 38. 1] Hong Kong Bioinformatics Centre, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong [2] 5 CUHK-BGI Innovation Institute of Trans-omics, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong. 39. 1] Department of Molecular and Cell Biology, University of California, Berkeley, California 94720, USA [2] Department of Plant and Microbial Biology, University of California, Berkeley, California 94720, USA [3]. 40. 1] Department of Genome Dynamics, Lawrence Berkeley National Laboratory, Berkeley, California 94720, USA [2].
Abstract
The transcriptome is the readout of the genome. Identifying common features in it across distant species can reveal fundamental principles. To this end, the ENCODE and modENCODE consortia have generated large amounts of matched RNA-sequencing data for human, worm and fly. Uniform processing and comprehensive annotation of these data allow comparison across metazoan phyla, extending beyond earlier within-phylum transcriptome comparisons and revealing ancient, conserved features. Specifically, we discover co-expression modules shared across animals, many of which are enriched in developmental genes. Moreover, we use expression patterns to align the stages in worm and fly development and find a novel pairing between worm embryo and fly pupae, in addition to the embryo-to-embryo and larvae-to-larvae pairings. Furthermore, we find that the extent of non-canonical, non-coding transcription is similar in each organism, per base pair. Finally, we find in all three organisms that the gene-expression levels, both coding and non-coding, can be quantitatively predicted from chromatin features at the promoter using a 'universal model' based on a single set of organism-independent parameters.
The transcriptome is the readout of the genome. Identifying common features in it across distant species can reveal fundamental principles. To this end, the ENCODE and modENCODE consortia have generated large amounts of matched RNA-sequencing data for human, worm and fly. Uniform processing and comprehensive annotation of these data allow comparison across metazoan phyla, extending beyond earlier within-phylum transcriptome comparisons and revealing ancient, conserved features. Specifically, we discover co-expression modules shared across animals, many of which are enriched in developmental genes. Moreover, we use expression patterns to align the stages in worm and fly development and find a novel pairing between worm embryo and fly pupae, in addition to the embryo-to-embryo and larvae-to-larvae pairings. Furthermore, we find that the extent of non-canonical, non-coding transcription is similar in each organism, per base pair. Finally, we find in all three organisms that the gene-expression levels, both coding and non-coding, can be quantitatively predicted from chromatin features at the promoter using a 'universal model' based on a single set of organism-independent parameters.
The transcriptome is the readout of the genome. Identifying common features in it
across distant species can reveal fundamental principles. To this end, the ENCODE and
modENCODE consortia have generated large amounts of matched RNA-sequencing data for
human, worm and fly. Uniform processing and comprehensive annotation of these data allow
comparison across metazoan phyla, extending beyond earlier within-phylum transcriptome
comparisons and revealing ancient, conserved features[1,2,3,4,5,6].
Specifically, we discovered co-expression modules shared across animals, many of which
are enriched in developmental genes. Moreover, we used expression patterns to align the
stages in worm and fly development, finding a novel pairing between worm embryo and fly
pupae, in addition to the expected embryo-to-embryo and larvae-to-larvae pairings.
Furthermore, we found that the extent of non-canonical, non-coding transcription is
similar in each organism, per base-pair. Finally, we found in all three organisms the
gene-expression levels, both coding and non-coding, can be quantitatively predicted from
chromatin features at the promoter using a “universal model,” based on a
single set of organism-independent parameters.Our comparison used the ENCODE-modENCODE RNA resource (Fig. ED1). This resource comprises: (1) deeply sequenced
RNA-Seq data from many distinct samples from all three organisms; (2) comprehensive
annotation of transcribed elements and (3) uniformly processed, standardized analysis
files, focusing on non-coding transcription and expression patterns. Where practical,
these datasets match comparable samples across organisms and to other types of
functional genomics data. In total, the resource contains 575 different experiments
containing >67B sequence reads. It encompasses many different RNA types,
including poly(A)+, poly(A)- and ribosomal-RNA-depleted RNA and short and long
RNA.
Fig ED1
Overview of the data
(A) Schematic of the RNA-seq data generated for human (red), worm
(green), and fly (blue), showing how it samples developmental stages and
various tissues and cell lines. (B) The number and size of data sets
generated. The amount of new data beyond that in the previous ENCODE
publications[8, 11, 20] is indicated by white bars, with previous ENCODE
data indicated by solid bars. (See Supplement section B.2 for a
detailed description of these data.)
The annotation in the resource represents capstones for the decade-long efforts
in human, worm, and fly. The new annotation sets have numbers, sizes and families of
protein-coding genes similar to previous compilations; however, the number of
pseudogenes and annotated ncRNAs differ (Figs. ED2,
ED3, S1). Also, the number of splicing events is
greatly increased, resulting in a concomitant increase in protein complexity. We find
the proportion of the different types of alternative splicing (e.g., exon skipping or
intron retention) is generally similar across the three organisms; however, skipped
exons predominate in human while retained introns are most common in worm and
fly[7] (Figs. ED4, S1 and Table S1).
Fig ED2
Summary plots for the protein-coding gene annotations
(A) Distributions of key summary statistics - gene span, longest ORF
per gene, CDS exon length, and CDS exons per gene; note that the x axes are
in log scale. Both fly and worm genes span similar genomic lengths while
human genes span larger regions (mostly due to the size of human introns).
(B) Left: Venn diagram of protein domains (from the Pfam database version
26.0) present in annotated protein-coding genes in each species. Right:
Shared domain combinations. (For more information on domain combinations,
see Fig S1h and Supplement
section B.4.1.)
Fig ED3
Summary of annotated ncRNAs, TARs, and ncRNA predictions in each
species
The number of elements, the base pairs covered and the fraction of
the genome for each class (see also Supplement section C). There
are comparable numbers of tRNAs in humans and worms but about half as many
in fly. While the number of lncRNAs in human is more than an order of
magnitude greater than in either worms or flies, the fractional genomic
coverage in all three species is similar. Finally, humans have at least
5-fold more miRNAs, snoRNAs and snRNAs compared to worm or fly. The fraction
of the genome covered by TARs (highlighted squares) for each species is
similar. A large amount of non-canonical transcription occurs in the introns
of annotated genes, presumably representing a mixture of unprocessed mRNAs
and internally initiated transcripts. The remaining non-canonical
transcription (249Mb, 16Mb, and 14Mb in human, worm, and fly) is intergenic
and occurs at low levels, comparable to that observed for introns (Table S2). Overall,
the fraction of the genome transcribed -- including intronic, exonic, and
non-canonical transcription -- is consistent with that previously reported
for human despite the methodological differences in the analysis (Fig. S2, Supplement section
C).
Fig ED4
Analysis of Alternative Splicing
(A) Representative orthologous genes do not share the same
exon/intron structure or alternative splicing across species. (B)
Distribution of the number of isoforms per gene. (C) Comparison of the
fraction of various alternative splicing event classes in human, worm, and
fly -- skipped exons “SE”, retained introns
“RI”, alternative 3' splice sites
“A3SS”, alternative 5' splice sites
“A5SS”, alternative first exons “AFE”,
alternative last exons “ALE”, tandem 3' UTRs
“TandemUTR”, coordinately skipped exons
“CSE”, and mutually exclusive exons “MXE”.
(See Supplement section
B.5 for a further discussion of splicing.)
A fraction of the transcription comes from genomic regions not associated with
standard annotations, representing “non-canonical transcription” (Table S2)[8]. Using a minimum-run/maximum-gap algorithm to
process reads mapping outside of protein-coding transcripts, pseudogenes and annotated
ncRNAs, we identified read clusters, i.e. transcriptionally active regions (TARs).
Across all three genomes we found roughly one third of the bases gives rise to TARs,
representing non-canonical transcription (Fig.
ED3). To determine the extent that this transcription represents an expansion of
the current established classes of ncRNAs, we identified the TARs most similar to known
annotated ncRNAs using a supervised classifier[9] (Fig. S2, Table
S2). We validated the classifier’s predictions using RT-PCR,
demonstrating high accuracy. Overall, the predictions encompass only a small fraction of
all TARs, suggesting that most TARs have features distinct from annotated ncRNAs and
that the majority of ncRNAs of established classes have already been identified. To shed
further light on the possible roles of TARs we intersected them with enhancers and HOT
regions [8,10,11,12,13], finding statistically
significant overlaps (Fig. ED5, Table S2).
Fig ED5
Characterizing Non-canonical Transcription
(A) The overlap of enhancers and distal HOT regions with supervised
ncRNA predictions and TARs in human, worm, and fly. The overlap of enhancers
and distal HOT regions with respect to both supervised ncRNA predictions as
well as TARs are significantly enriched compared to a randomized
expectation. (B) The left side highlights ncRNA/TARs that are highly
correlated with corresponding HOX orthologues in human (HOXB4), worm
(lin-39), and fly (Dfd). The expression of mir-10 correlates strongly with
Dfd in fly (r=0.66, p<6e-4 in fly), as does mir-10a in
human, which correlates strongly with HOXB4 (r=0.88,
p<2e-9). A TAR (chrIII:8871234-2613) strongly correlates with
lin-39 (r=0.91, p<4e-13) in worm. The
right side shows TARs in human (chr19:7698570-7701990), worm
(chrII:11469045-440), and fly (chr2L:2969620-772) that are negatively
correlated with the expression of three orthologous genes: SGCB
(r=-0.91, p<3e-16), sgcb-1 (r=-0.86,
p<2e-7), and Scgb (r=-0.82, p<4e-8), respectively.
(More details on all parts of this figure are in Supplement section C and Table
S2.)
Given the uniformly processed nature of the data and annotations, we were able to
make comparisons across organisms. First, we built co-expression modules, extending
earlier analysis[14](Fig. 1a). To detect modules consistently across the
three species, we combined across-species orthology and within-species co-expression
relationships. In the resulting multilayer network we searched for dense subgraphs
(modules), using simulated annealing[15,16]. We found some modules
dominated by a single species, whereas others contain genes from two or three. As
expected, the modules with genes from multiple species are enriched in orthologs.
Moreover, a phylogenetic analysis shows that the genes in such modules are more
conserved across 56 diverse animal species (Figs.
ED6, S3). To focus
on the cross-species conserved functions, we restricted the clustering to orthologs,
arriving at 16 conserved modules, which are enriched in a variety of functions, ranging
from morphogenesis to chromatin remodeling (Fig.
1a, Table S3). Finally,
we annotated many TARs based on correlating their expression profiles with these modules
(Fig. ED5).
Fig 1
Expression Clustering
(A) Left: Human, worm, and fly gene-gene co-association matrix; darker coloring
reflects the increased likelihood that a pair of genes are assigned to the same
module. A dark block along the diagonal represents a group of genes within a
species. If this is associated with an off-diagonal block then it is a
cross-species module (e.g. a three-species conserved module is shown with a
circle and a worm-fly module, with a star). However, if a diagonal block has no
off-diagonal associations, then it forms a species-specific module (e.g. green
pentagon). Right: The GO functional enrichment of genes within the 16 conserved
modules is shown. (B) Alignment of worm-and-fly developmental stages based on
all worm-fly orthologs. Inset shows worm-fly stage alignment using only
hourglass orthologs is more significant and exhibits a gap (brown) matching the
phylotypic stage. (C) Normalized expression of the conserved modules in fly
shows the smallest intra-organism divergence during the phylotypic stage
(brown). (See Figs. ED 6 and 7 for further details.)
Fig ED6
Details on Expression Clustering
(A) Pie charts showing gene conservation across 56 Ensembl species
for the blocks in the Fig. 1 heatmap
enclosed with the same symbol (i.e. pentagon here matches pentagon in Fig.1a). Overall, species-specific
modules tend to have fewer orthologs across 56 Ensembl species. (B) The
expression levels of a conserved module (Module No. 5) in D.
melanogaster and its orthologous counterparts in other 5
Drosophila species are plotted against time. The x-axis
represents the middle time points of two-hour periods at fly embryo stages.
The boxes represent the log10 modular expression levels from microarray data
of 6 Drosophila species centered by their medians. The
modular expression divergence (inter-quartile region) becomes minimal during
the fly phylotypic stage (brown, 8-10 hours). (C) The modular expression
correlations over a sliding 2-hour window (Pearson correlation per 5 stages,
middle time of two-hour period in x-axis) among 16 modules in worm are
plotted. The modular correlations (median shown as bar height in y-axis) are
highest during the worm phylotypic stages (brown), 6-8 hours. One can, in
fact, directly see this coordination as a local maximum in the
between-module correlation for the worm, which has a more densely sampled
developmental time course. (This figure provides more detail on Fig. 1a and 1c. More details on all parts
of this figure are in Supplement section D and Figure S3.)
Next, we used the expression profiles of orthologous genes to align the
developmental stages in worm and fly (Fig. 1b,
ED7). For every developmental stage, we
identified stage-associated genes, i.e. genes highly expressed at a particular stage but
not across all stages. We then counted the number of orthologous pairs among these
stage-associated genes for each possible worm-and-fly stage correspondence, aligning
stages by the significance of the overlap. Strikingly, worm stages map to two sets of
fly stages. First, they match in the expected one-to-one fashion to the fly (i.e.
embryos-to-embryos, larvae-to-larvae). However, worm late embryonic stages also match
fly pupal stages, suggesting a shared expression program between embryogenesis and
metamorphosis. The ~50 stage-associated genes involved in this dual alignment
are enriched in functions such as ion transport and cation-channel activity (Table S3).
Fig ED7
Details on Stage Alignment
This figure provides further detail beyond Fig. 1b. (A) An alignment of worm and fly
developmental stages based on all worm-fly orthologs (11,403 pairs,
including one-to-one, one-to-many, many-to-many pairs). (B) Alignment of
worm and fly developmental stages based on just worm-fly hourglass
orthologs. Note the prominent gap in the aligned stages coincides with the
worm and fly phylotypic stages (brown band). This make sense: since the
expression values of genes in all hourglass modules converge at the
phylotypic stage, no hourglass genes can be phylotypic-stage specific, and
hence, the gap. (C) Key aligned stages from part (A). The correspondence
between parts (A) and (C) is indicated by the small Greek letters. Worm
“early embryo” and “late embryo” stages are
matched with fly “early embryo” and “late
embryo” respectively in the “lower diagonal” set of
matches, and they are also matched with fly “L1” and
“prepupa-pupa” stages respectively in the “upper
diagonal” set of matches. (More details on all parts of this figure
are in Supplement section
D.4 and Table S3.)
To gain further insight into the stage alignment, we examined our conserved
modules in terms of the “hourglass hypothesis”, which posits that all
animals go through a particular stage in embryonic development (the tight point of the
hourglass or “phylotypic” stage) during which the expression divergence
across species for orthologous genes is smallest[4,5,17]. For genes in 12 of the 16 modules, we observed canonical
hourglass behavior, i.e. inter-organism expression divergence across
closely related fly species during development is minimal[5](Fig. S3). Moreover, we find a subset of TARs also exhibit this
“hourglass” behavior (Fig. S2). Beyond looking at inter-species divergence, we
also investigated the intra-species divergence within just D.
melanogaster and C. elegans. Strikingly, we observed that
divergence of gene expression between modules is minimized during the worm and fly
phylotypic stages (Fig. 1c). This suggests, for an
individual species, the expression patterns of different modules are most tightly
coordinated (low divergence) during the phylotypic stage, but each module has its own
signature before and after this. One can, in fact, directly see this coordination as a
local maximum in between-module correlations for the worm (Fig. ED6). Finally, using genes from just the 12 “hourglass
modules,” we found that the alignment between worm and fly stages becomes
stronger (Fig. 1b, S3). The alignment shows a gap where no
changes are observed, perfectly matching the phylotypic stage.The uniformly processed and matched nature of the transcriptome data also
facilitates integration with upstream factor-binding and chromatin-modification signals.
We investigated the degree to which these upstream signals can quantitatively predict
gene expression and how consistent this prediction is across organisms. Similar to
previous reports[11,18,19], we
found consistent correlations, around the TSS, in each of the three species between
various histone-modification signals and the expression level of the downstream gene:
H3K4me1, H3K4me2, H3K4me3 and H3K27ac are positively correlated, whereas H3K27me3 is
negatively correlated (Figs. 2, ED8, S4). Then for each organism, we integrated these individual correlations
into a multivariate, statistical model, obtaining high accuracy in predicting expression
for protein-coding genes and ncRNAs. The promoter-associated marks, H3K4me2 and H3K4me3,
consistently have the highest contribution to the model.
Fig 2
Histone Models for Gene Expression
Top: Normalized correlations of two representative histone marks with expression.
Left: Relative importance of the histone marks in organism-specific models and
the universal model. Right: Prediction accuracies (Pearson correlations all
significant, p<1e-100) of the organism-specific and universal models.
(See Figs. ED 8 and 9 for further details.)
Fig ED8
Further Detail on Statistical Models for Predicting Gene
Expression
This figure provides further information beyond that in Fig. 2. Binding/expression correlations
of (A) various histone marks and (C) TFs. For instance, H3K36me3 shows
positive correlation in worm and fly, but weak negative correlation in human
at the promoter, with positive correlation over the gene body. (B) The
positional accuracy from the TF and histone-mark models for predicting mRNA
and ncRNA expression about the TSS. (More details on all parts of this
figure are in Supplement
section E and Fig. S4.)
A similar statistical analysis with TFs showed the correlation between gene
expression and transcription-factor (TF) binding to be the greatest at the TSS,
positively for activators and negatively for repressors (Fig. ED8). Integrated TF models in each organism also achieved high accuracy
for protein-coding genes and ncRNAs, with as few as five TFs necessary for accurate
predictions (Fig. ED9). This, perhaps, reflects an
intricate, correlated structure to regulation. The relative importance of the upstream
regions is more peaked for the TF models than for the histone ones, likely reflecting
the fact that histone modifications are spread over broader regions, including the gene
body, whereas most TFs bind near the promoter.
Fig ED9
Average predictive accuracy of models with different number of randomly
selected TFs
We randomly selected n TFs as predictors and examined the predictive
accuracy by cross-validation, where n varied from 2 to 28. The curve shows
the average predictive accuracy (Fig. S4 indicates the standard
deviation of all models with the same number of predictors). Surprisingly,
models with as few as 5 TFs have predictive accuracy. This may reflect an
intricate, correlated structure to regulation. However, it could also be
that open chromatin is characteristic of gene expression and TFs bind
somewhat indiscriminately. (More details on all parts of this figure are in
Supplement section
E.)
Finally, we constructed a “universal model,” containing a single
set of organism-independent parameters (Figs 2,
S4). This achieved accuracy
comparable to the organism-specific models. In the universal model, the consistently
important promoter-associated marks such as H3K4me2 and H3K4me3 are weighted most
highly. In contrast, the enhancer mark H3K4me1 is down-weighted, perhaps reflecting that
signals for most human enhancers are not near the TSS. Using the same set of
organism-independent parameters derived from training on protein-coding genes, the
universal model can also accurately predict ncRNA expression.Overall, our comparison of the transcriptomes of three phylo-genetically distant
metazoans highlights fundamental features of transcription conserved across animal
phyla. First, there are ancient co-expression modules across organisms, many of which
are enriched for developmentally important “hourglass” genes. These
conserved modules have highly coordinated intra-organism expression during the
phylotypic stage, but display diversified expression before and after. The expression
clustering also aligns developmental stages between worm and fly, revealing shared
expression programs between embryogenesis and metamorphosis. Finally, we were able to
build a single model that could predict transcription in all three organisms from
upstream histone marks using a single set of parameters for both protein-coding genes
and ncRNAs. Overall, our results underscore the importance of comparing divergent model
organisms to human to highlight conserved biological principles (and dis-entangle them
from lineage-specific adaptations).
Methods
Detailed methods are in the supplement. (See first section of this
for a guide.) Data sets described here can be obtained from the ENCODE project
website at http://www.encodeproject.org via accession number ENCSR145VDW. More
detail on data availability is in section F of the supplement.
Overview of the data
(A) Schematic of the RNA-seq data generated for human (red), worm
(green), and fly (blue), showing how it samples developmental stages and
various tissues and cell lines. (B) The number and size of data sets
generated. The amount of new data beyond that in the previous ENCODE
publications[8, 11, 20] is indicated by white bars, with previous ENCODE
data indicated by solid bars. (See Supplement section B.2 for a
detailed description of these data.)
Summary plots for the protein-coding gene annotations
(A) Distributions of key summary statistics - gene span, longest ORF
per gene, CDS exon length, and CDS exons per gene; note that the x axes are
in log scale. Both fly and worm genes span similar genomic lengths while
human genes span larger regions (mostly due to the size of human introns).
(B) Left: Venn diagram of protein domains (from the Pfam database version
26.0) present in annotated protein-coding genes in each species. Right:
Shared domain combinations. (For more information on domain combinations,
see Fig S1h and Supplement
section B.4.1.)
Summary of annotated ncRNAs, TARs, and ncRNA predictions in each
species
The number of elements, the base pairs covered and the fraction of
the genome for each class (see also Supplement section C). There
are comparable numbers of tRNAs in humans and worms but about half as many
in fly. While the number of lncRNAs in human is more than an order of
magnitude greater than in either worms or flies, the fractional genomic
coverage in all three species is similar. Finally, humans have at least
5-fold more miRNAs, snoRNAs and snRNAs compared to worm or fly. The fraction
of the genome covered by TARs (highlighted squares) for each species is
similar. A large amount of non-canonical transcription occurs in the introns
of annotated genes, presumably representing a mixture of unprocessed mRNAs
and internally initiated transcripts. The remaining non-canonical
transcription (249Mb, 16Mb, and 14Mb in human, worm, and fly) is intergenic
and occurs at low levels, comparable to that observed for introns (Table S2). Overall,
the fraction of the genome transcribed -- including intronic, exonic, and
non-canonical transcription -- is consistent with that previously reported
for human despite the methodological differences in the analysis (Fig. S2, Supplement section
C).
Analysis of Alternative Splicing
(A) Representative orthologous genes do not share the same
exon/intron structure or alternative splicing across species. (B)
Distribution of the number of isoforms per gene. (C) Comparison of the
fraction of various alternative splicing event classes in human, worm, and
fly -- skipped exons “SE”, retained introns
“RI”, alternative 3' splice sites
“A3SS”, alternative 5' splice sites
“A5SS”, alternative first exons “AFE”,
alternative last exons “ALE”, tandem 3' UTRs
“TandemUTR”, coordinately skipped exons
“CSE”, and mutually exclusive exons “MXE”.
(See Supplement section
B.5 for a further discussion of splicing.)
Characterizing Non-canonical Transcription
(A) The overlap of enhancers and distal HOT regions with supervised
ncRNA predictions and TARs in human, worm, and fly. The overlap of enhancers
and distal HOT regions with respect to both supervised ncRNA predictions as
well as TARs are significantly enriched compared to a randomized
expectation. (B) The left side highlights ncRNA/TARs that are highly
correlated with corresponding HOX orthologues in human (HOXB4), worm
(lin-39), and fly (Dfd). The expression of mir-10 correlates strongly with
Dfd in fly (r=0.66, p<6e-4 in fly), as does mir-10a in
human, which correlates strongly with HOXB4 (r=0.88,
p<2e-9). A TAR (chrIII:8871234-2613) strongly correlates with
lin-39 (r=0.91, p<4e-13) in worm. The
right side shows TARs in human (chr19:7698570-7701990), worm
(chrII:11469045-440), and fly (chr2L:2969620-772) that are negatively
correlated with the expression of three orthologous genes: SGCB
(r=-0.91, p<3e-16), sgcb-1 (r=-0.86,
p<2e-7), and Scgb (r=-0.82, p<4e-8), respectively.
(More details on all parts of this figure are in Supplement section C and Table
S2.)
Details on Expression Clustering
(A) Pie charts showing gene conservation across 56 Ensembl species
for the blocks in the Fig. 1 heatmap
enclosed with the same symbol (i.e. pentagon here matches pentagon in Fig.1a). Overall, species-specific
modules tend to have fewer orthologs across 56 Ensembl species. (B) The
expression levels of a conserved module (Module No. 5) in D.
melanogaster and its orthologous counterparts in other 5
Drosophila species are plotted against time. The x-axis
represents the middle time points of two-hour periods at fly embryo stages.
The boxes represent the log10 modular expression levels from microarray data
of 6 Drosophila species centered by their medians. The
modular expression divergence (inter-quartile region) becomes minimal during
the fly phylotypic stage (brown, 8-10 hours). (C) The modular expression
correlations over a sliding 2-hour window (Pearson correlation per 5 stages,
middle time of two-hour period in x-axis) among 16 modules in worm are
plotted. The modular correlations (median shown as bar height in y-axis) are
highest during the worm phylotypic stages (brown), 6-8 hours. One can, in
fact, directly see this coordination as a local maximum in the
between-module correlation for the worm, which has a more densely sampled
developmental time course. (This figure provides more detail on Fig. 1a and 1c. More details on all parts
of this figure are in Supplement section D and Figure S3.)
Details on Stage Alignment
This figure provides further detail beyond Fig. 1b. (A) An alignment of worm and fly
developmental stages based on all worm-fly orthologs (11,403 pairs,
including one-to-one, one-to-many, many-to-many pairs). (B) Alignment of
worm and fly developmental stages based on just worm-fly hourglass
orthologs. Note the prominent gap in the aligned stages coincides with the
worm and fly phylotypic stages (brown band). This make sense: since the
expression values of genes in all hourglass modules converge at the
phylotypic stage, no hourglass genes can be phylotypic-stage specific, and
hence, the gap. (C) Key aligned stages from part (A). The correspondence
between parts (A) and (C) is indicated by the small Greek letters. Worm
“early embryo” and “late embryo” stages are
matched with fly “early embryo” and “late
embryo” respectively in the “lower diagonal” set of
matches, and they are also matched with fly “L1” and
“prepupa-pupa” stages respectively in the “upper
diagonal” set of matches. (More details on all parts of this figure
are in Supplement section
D.4 and Table S3.)
Further Detail on Statistical Models for Predicting Gene
Expression
This figure provides further information beyond that in Fig. 2. Binding/expression correlations
of (A) various histone marks and (C) TFs. For instance, H3K36me3 shows
positive correlation in worm and fly, but weak negative correlation in human
at the promoter, with positive correlation over the gene body. (B) The
positional accuracy from the TF and histone-mark models for predicting mRNA
and ncRNA expression about the TSS. (More details on all parts of this
figure are in Supplement
section E and Fig. S4.)
Average predictive accuracy of models with different number of randomly
selected TFs
We randomly selected n TFs as predictors and examined the predictive
accuracy by cross-validation, where n varied from 2 to 28. The curve shows
the average predictive accuracy (Fig. S4 indicates the standard
deviation of all models with the same number of predictors). Surprisingly,
models with as few as 5 TFs have predictive accuracy. This may reflect an
intricate, correlated structure to regulation. However, it could also be
that open chromatin is characteristic of gene expression and TFs bind
somewhat indiscriminately. (More details on all parts of this figure are in
Supplement section
E.)
Authors: Zhi John Lu; Kevin Y Yip; Guilin Wang; Chong Shou; Ladeana W Hillier; Ekta Khurana; Ashish Agarwal; Raymond Auerbach; Joel Rozowsky; Chao Cheng; Masaomi Kato; David M Miller; Frank Slack; Michael Snyder; Robert H Waterston; Valerie Reinke; Mark B Gerstein Journal: Genome Res Date: 2010-12-22 Impact factor: 9.043
Authors: Alex T Kalinka; Karolina M Varga; Dave T Gerrard; Stephan Preibisch; David L Corcoran; Julia Jarrells; Uwe Ohler; Casey M Bergman; Pavel Tomancak Journal: Nature Date: 2010-12-09 Impact factor: 49.962
Authors: Rosa Karlić; Ho-Ryun Chung; Julia Lasserre; Kristian Vlahovicek; Martin Vingron Journal: Proc Natl Acad Sci U S A Date: 2010-02-01 Impact factor: 11.205
Authors: Mark B Gerstein; Zhi John Lu; Eric L Van Nostrand; Chao Cheng; Bradley I Arshinoff; Tao Liu; Kevin Y Yip; Rebecca Robilotto; Andreas Rechtsteiner; Kohta Ikegami; Pedro Alves; Aurelien Chateigner; Marc Perry; Mitzi Morris; Raymond K Auerbach; Xin Feng; Jing Leng; Anne Vielle; Wei Niu; Kahn Rhrissorrakrai; Ashish Agarwal; Roger P Alexander; Galt Barber; Cathleen M Brdlik; Jennifer Brennan; Jeremy Jean Brouillet; Adrian Carr; Ming-Sin Cheung; Hiram Clawson; Sergio Contrino; Luke O Dannenberg; Abby F Dernburg; Arshad Desai; Lindsay Dick; Andréa C Dosé; Jiang Du; Thea Egelhofer; Sevinc Ercan; Ghia Euskirchen; Brent Ewing; Elise A Feingold; Reto Gassmann; Peter J Good; Phil Green; Francois Gullier; Michelle Gutwein; Mark S Guyer; Lukas Habegger; Ting Han; Jorja G Henikoff; Stefan R Henz; Angie Hinrichs; Heather Holster; Tony Hyman; A Leo Iniguez; Judith Janette; Morten Jensen; Masaomi Kato; W James Kent; Ellen Kephart; Vishal Khivansara; Ekta Khurana; John K Kim; Paulina Kolasinska-Zwierz; Eric C Lai; Isabel Latorre; Amber Leahey; Suzanna Lewis; Paul Lloyd; Lucas Lochovsky; Rebecca F Lowdon; Yaniv Lubling; Rachel Lyne; Michael MacCoss; Sebastian D Mackowiak; Marco Mangone; Sheldon McKay; Desirea Mecenas; Gennifer Merrihew; David M Miller; Andrew Muroyama; John I Murray; Siew-Loon Ooi; Hoang Pham; Taryn Phippen; Elicia A Preston; Nikolaus Rajewsky; Gunnar Rätsch; Heidi Rosenbaum; Joel Rozowsky; Kim Rutherford; Peter Ruzanov; Mihail Sarov; Rajkumar Sasidharan; Andrea Sboner; Paul Scheid; Eran Segal; Hyunjin Shin; Chong Shou; Frank J Slack; Cindie Slightam; Richard Smith; William C Spencer; E O Stinson; Scott Taing; Teruaki Takasaki; Dionne Vafeados; Ksenia Voronina; Guilin Wang; Nicole L Washington; Christina M Whittle; Beijing Wu; Koon-Kiu Yan; Georg Zeller; Zheng Zha; Mei Zhong; Xingliang Zhou; Julie Ahringer; Susan Strome; Kristin C Gunsalus; Gos Micklem; X Shirley Liu; Valerie Reinke; Stuart K Kim; LaDeana W Hillier; Steven Henikoff; Fabio Piano; Michael Snyder; Lincoln Stein; Jason D Lieb; Robert H Waterston Journal: Science Date: 2010-12-22 Impact factor: 47.728
Authors: Nuno L Barbosa-Morais; Manuel Irimia; Qun Pan; Hui Y Xiong; Serge Gueroussov; Leo J Lee; Valentina Slobodeniuc; Claudia Kutter; Stephen Watt; Recep Colak; TaeHyung Kim; Christine M Misquitta-Ali; Michael D Wilson; Philip M Kim; Duncan T Odom; Brendan J Frey; Benjamin J Blencowe Journal: Science Date: 2012-12-21 Impact factor: 47.728
Authors: Brenton R Graveley; Angela N Brooks; Joseph W Carlson; Michael O Duff; Jane M Landolin; Li Yang; Carlo G Artieri; Marijke J van Baren; Nathan Boley; Benjamin W Booth; James B Brown; Lucy Cherbas; Carrie A Davis; Alex Dobin; Renhua Li; Wei Lin; John H Malone; Nicolas R Mattiuzzo; David Miller; David Sturgill; Brian B Tuch; Chris Zaleski; Dayu Zhang; Marco Blanchette; Sandrine Dudoit; Brian Eads; Richard E Green; Ann Hammonds; Lichun Jiang; Phil Kapranov; Laura Langton; Norbert Perrimon; Jeremy E Sandler; Kenneth H Wan; Aarron Willingham; Yu Zhang; Yi Zou; Justen Andrews; Peter J Bickel; Steven E Brenner; Michael R Brent; Peter Cherbas; Thomas R Gingeras; Roger A Hoskins; Thomas C Kaufman; Brian Oliver; Susan E Celniker Journal: Nature Date: 2010-12-22 Impact factor: 49.962
Authors: Alan P Boyle; Carlos L Araya; Cathleen Brdlik; Philip Cayting; Chao Cheng; Yong Cheng; Kathryn Gardner; LaDeana W Hillier; Judith Janette; Lixia Jiang; Dionna Kasper; Trupti Kawli; Pouya Kheradpour; Anshul Kundaje; Jingyi Jessica Li; Lijia Ma; Wei Niu; E Jay Rehm; Joel Rozowsky; Matthew Slattery; Rebecca Spokony; Robert Terrell; Dionne Vafeados; Daifeng Wang; Peter Weisdepp; Yi-Chieh Wu; Dan Xie; Koon-Kiu Yan; Elise A Feingold; Peter J Good; Michael J Pazin; Haiyan Huang; Peter J Bickel; Steven E Brenner; Valerie Reinke; Robert H Waterston; Mark Gerstein; Kevin P White; Manolis Kellis; Michael Snyder Journal: Nature Date: 2014-08-28 Impact factor: 49.962
Authors: Sarah Djebali; Carrie A Davis; Angelika Merkel; Alex Dobin; Timo Lassmann; Ali Mortazavi; Andrea Tanzer; Julien Lagarde; Wei Lin; Felix Schlesinger; Chenghai Xue; Georgi K Marinov; Jainab Khatun; Brian A Williams; Chris Zaleski; Joel Rozowsky; Maik Röder; Felix Kokocinski; Rehab F Abdelhamid; Tyler Alioto; Igor Antoshechkin; Michael T Baer; Nadav S Bar; Philippe Batut; Kimberly Bell; Ian Bell; Sudipto Chakrabortty; Xian Chen; Jacqueline Chrast; Joao Curado; Thomas Derrien; Jorg Drenkow; Erica Dumais; Jacqueline Dumais; Radha Duttagupta; Emilie Falconnet; Meagan Fastuca; Kata Fejes-Toth; Pedro Ferreira; Sylvain Foissac; Melissa J Fullwood; Hui Gao; David Gonzalez; Assaf Gordon; Harsha Gunawardena; Cedric Howald; Sonali Jha; Rory Johnson; Philipp Kapranov; Brandon King; Colin Kingswood; Oscar J Luo; Eddie Park; Kimberly Persaud; Jonathan B Preall; Paolo Ribeca; Brian Risk; Daniel Robyr; Michael Sammeth; Lorian Schaffer; Lei-Hoon See; Atif Shahab; Jorgen Skancke; Ana Maria Suzuki; Hazuki Takahashi; Hagen Tilgner; Diane Trout; Nathalie Walters; Huaien Wang; John Wrobel; Yanbao Yu; Xiaoan Ruan; Yoshihide Hayashizaki; Jennifer Harrow; Mark Gerstein; Tim Hubbard; Alexandre Reymond; Stylianos E Antonarakis; Gregory Hannon; Morgan C Giddings; Yijun Ruan; Barbara Wold; Piero Carninci; Roderic Guigó; Thomas R Gingeras Journal: Nature Date: 2012-09-06 Impact factor: 49.962
Authors: Yong H Woo; Hifzur Ansari; Thomas D Otto; Christen M Klinger; Martin Kolisko; Jan Michálek; Alka Saxena; Dhanasekaran Shanmugam; Annageldi Tayyrov; Alaguraj Veluchamy; Shahjahan Ali; Axel Bernal; Javier del Campo; Jaromír Cihlář; Pavel Flegontov; Sebastian G Gornik; Eva Hajdušková; Aleš Horák; Jan Janouškovec; Nicholas J Katris; Fred D Mast; Diego Miranda-Saavedra; Tobias Mourier; Raeece Naeem; Mridul Nair; Aswini K Panigrahi; Neil D Rawlings; Eriko Padron-Regalado; Abhinay Ramaprasad; Nadira Samad; Aleš Tomčala; Jon Wilkes; Daniel E Neafsey; Christian Doerig; Chris Bowler; Patrick J Keeling; David S Roos; Joel B Dacks; Thomas J Templeton; Ross F Waller; Julius Lukeš; Miroslav Oborník; Arnab Pain Journal: Elife Date: 2015-07-15 Impact factor: 8.140
Authors: Yan-Bo Sun; Ting-Ting Fu; Jie-Qiong Jin; Robert W Murphy; David M Hillis; Ya-Ping Zhang; Jing Che Journal: Proc Natl Acad Sci U S A Date: 2018-10-22 Impact factor: 11.205
Authors: Cristina Sisu; Baikang Pei; Jing Leng; Adam Frankish; Yan Zhang; Suganthi Balasubramanian; Rachel Harte; Daifeng Wang; Michael Rutenberg-Schoenberg; Wyatt Clark; Mark Diekhans; Joel Rozowsky; Tim Hubbard; Jennifer Harrow; Mark B Gerstein Journal: Proc Natl Acad Sci U S A Date: 2014-08-25 Impact factor: 11.205