Literature DB >> 18842623

MachiBase: a Drosophila melanogaster 5'-end mRNA transcription database.

Budrul Ahsan1, Taro L Saito, Shin-ichi Hashimoto, Keigo Muramatsu, Manabu Tsuda, Atsushi Sasaki, Kouji Matsushima, Toshiro Aigaki, Shinichi Morishita.   

Abstract

MachiBase (http://machibase.gi.k.u-tokyo.ac.jp/) provides a comprehensive and freely accessible resource regarding Drosophila melanogaster 5'-end mRNA transcription at different developmental states, supporting studies on the variabilities of promoter transcriptional activities and gene-expression profiles in the fruitfly. The data were generated in conjunction with the recently developed high-throughput genome sequencer Illumina/Solexa using a newly developed 5'-end mRNA collection method.

Entities:  

Mesh:

Substances:

Year:  2008        PMID: 18842623      PMCID: PMC2686457          DOI: 10.1093/nar/gkn694

Source DB:  PubMed          Journal:  Nucleic Acids Res        ISSN: 0305-1048            Impact factor:   16.971


INTRODUCTION

Characterization of the complete repertoire of expressed messenger RNA (mRNA) is central to the functional analysis of a genome. To date, several studies have been undertaken to achieve a better understanding of the Drosophila melanogaster genome (1–4). The technical approaches used in these studies included in-depth, full-length cDNA cloning and tiling microarrays. However, despite the absence of prior knowledge of the locations of previously identified genes, the 5′-end SAGE (5) method has demonstrated efficacy in cataloging high numbers of expressed genes. Following the simple modification of adopting the recently developed high-throughput genome sequencer Illumina/Solexa, 5′-end SAGE has become a potent tool for elucidating transcriptional mechanisms. To achieve a deeper insight into transcriptional activity, we collected approximately 25 million 25–27 nt 5′-end mRNA tags from the embryos, larvae, young males, young females, old males, old females and S2 (culture cell line) of D. melanogaster with high mechanical reproducibility. After aligning these tags to unique positions in the fly genome while allowing three mismatches, 2.87–4.05 million uniquely mapped tags were amassed for each of the seven samples. These data constitute the most substantial transcriptional start site (TSS) and gene-expression database for D. melanogaster currently available. MachiBase is designed to assist fly biologists in their analyses of gene expression and in placing expression data in the context of functional genomics through genomic orientation. Thus, information on differentially expressed genes can be accessed by either inputting the gene name as a keyword or selecting a chromosomal location. Aside from providing information on gene expression, these data constitute a potent resource for analyses of transcriptional regulation. The core promoter, which is the region surrounding the TSS of a gene required for recruitment of the transcription apparatus, warrants analysis. However, TSSs and core promoters have previously been identified on a gene-by-gene basis. With the help of this database, biologists can explain transcriptional initiation mechanisms by combining additional information on chromatin structure and DNA methylation. In addition, these data allow accurate predictions of gene structures, particularly of the 5′-untranslated region (5′-UTR).

METHODS

The newly developed 5′-end mRNA collection method extends the range of the original 5′-end SAGE technique developed by Hashimoto et al. (5). This method initially profiles 25–27 nt tags using a novel strategy that incorporates the oligo-capping method (6). The 5′-end tags are then ligated directly to the Illumina/Solexa linker, to prepare for sequencing with the Illumina/Solexa system. Prior to construction of the Illumina/Solexa libraries, we confirmed the integrity of the cDNA using the Agilent 2100 Bioanalyser.

Collection of numerous 5′-end tags from seven libraries and testing the reproducibility of the method used

To characterize the transcriptional activity patterns of the D. melanogaster genome, we collected 25–27 nt 5′-end mRNA tags from embryos, larvae, young males, young females, old males, old females and the S2 cell line. Table 1 presents the results of this process. The second column shows more than five million raw tags collected from each of the seven libraries. As most of these tags were redundant, they were grouped into non-redundant representative tags, the statistics for which are shown in the third column. Each non-redundant tag represents a duplicated occurrence and is therefore associated with its frequency, i.e. the number of times that it occurs.
Table1.

Statistical analysis of collected tags and identified TSSs from the seven libraries

LibraryNumber of raw, redundant tags (A)Number of raw, non-redundant tagsNumber of uniquely aligned, redundant tags (B)B/A (%)
Embryo5 620 8212 123 6883 321 09559.1
Larvae6 711 8412 231 0783 556 02153.0
Young male11 349 9597 859 6453 721 53932.8
Young female6 882 1493 398 7542 875 24741.8
Old male7 198 6823 442 8863 873 07853.8
Old female6 787 4203 258 5233 683 53654.3
S27 214 1042 803 8784 052 96556.2
Total51 764 97625 083 48148.5
Statistical analysis of collected tags and identified TSSs from the seven libraries The frequency is expected to be reproducible, in that the frequency of each non-redundant tag is proportional to the total number of tag occurrences in independent experiments. To test for reproducibility, we performed an additional collection of 5′-end tags from the same young female Drosophila library. Figure 1A reveals a strong correlation between the two independent experiments. Furthermore, in a comparison with a quantitative PCR analysis, the employed method has been validated as a means to quantify the expression level of a transcript as the number of 5′-end tags (7).
Figure 1.

Statistical analysis of the TSS information. (A) A dot represents one non-redundant 5′-end tag, such that the values on the x-axis and y-axis indicate the frequencies of the focal tags in the respective experiments. (B) A case in which the known representative TSS is not consistent with the most frequent TSS. Note that the most abundant TSSs coincide across the seven different libraries. (C) The distribution of distances between the known representative TSSs and most frequent TSSs. Overall, 1033 (8.8%) of the 11 725 known representative TSSs coincide with the most frequent TSSs.

Statistical analysis of the TSS information. (A) A dot represents one non-redundant 5′-end tag, such that the values on the x-axis and y-axis indicate the frequencies of the focal tags in the respective experiments. (B) A case in which the known representative TSS is not consistent with the most frequent TSS. Note that the most abundant TSSs coincide across the seven different libraries. (C) The distribution of distances between the known representative TSSs and most frequent TSSs. Overall, 1033 (8.8%) of the 11 725 known representative TSSs coincide with the most frequent TSSs.

Identification of transcription start sites by millions of 5′-end tags

For the identification of TSSs, non-redundant tags were aligned to the genome of D. melanogaster (R5.3) in FlyBase (8). We observed that 5′-end tags tended to contain read errors, especially towards their termini. To correct these read errors, the tags were aligned to the genome while allowing, at most, three mismatches. The efficient mapping of millions of tags was an issue that needed to be resolved. We developed and used a parallel version of BLAT (9), which operates on massive parallel clusters. Another major technical issue involved the fact that a single 5′-end tag could be mapped to multiple locations, making it difficult to determine the original location of the tag. To eliminate false-positive positional data, these ambiguous tags were simply excluded from our analysis, so that only uniquely aligned tags were considered. A tag was considered to be uniquely aligned if, for a non-negative number k (⩽3), the tag was mapped to a unique location with ⩽k mismatches, although it could be mapped to multiple positions with more than k mismatches. The number of uniquely aligned and redundant tags in each library, and their ratios to the number of raw redundant tags, are shown in the fourth and fifth columns of Table 1, respectively. A uniquely aligned 5′-end tag identified a TSS in the genome. Distinct tags could be mapped to the same TSS, since the alignment step tolerated mismatches and replaced erroneous nucleotides with the correct nucleotides in the genome. From all seven libraries, a total of 25 083 481 tags were mapped to unique locations, thereby identifying 1 773 851 TSSs; the data breakdown in terms of chromosomes is presented in Table 2.
Table 2.

Breakdown of the uniquely aligned, redundant tags in terms of chromosomes

ChromosomeNumber of TSSs
2L323 720
2L heterochromatin406
2R369 405
2R heterochromatin1801
3L320 952
3L heterochromatin1930
3R442 109
3R heterochromatin1691
410 975
U3506
U extra9398
X286 427
X heterochromatin1305
Y heterochromatin93
Mitochondria133
Total1 773 851
Breakdown of the uniquely aligned, redundant tags in terms of chromosomes

Discrepancy between the known representative TSS and the most frequent TSS

In attempting genome annotation, it is usual to choose the longest cDNA sequence in a specific locus to define the representative cDNA. To examine the level of agreement between the newly collected 5′-end tags and the known representative cDNA sequences, we calculated how many of the uniquely aligned, redundant tags were located in the promoters and 5′-UTRs of the representative sequences, and found 96.2% of the 5′-end tags in the UTR regions (Table 3). Figure 1B illustrates the 5′-end tag expression patterns surrounding a representative TSS in the seven libraries. It was intriguing to observe that the representative TSS was not necessarily the most frequent TSS, but that another TSS slightly downstream of the representative was the most abundant, which motivated us to examine this discrepancy. We calculated the distances between the representative TSSs and the most frequent TSSs in the promoters and 5′-UTRs of the 11 725 longest cDNA sequences in FlyBase. Figure 1C shows the numbers of representative TSSs in terms of distances, highlighting that only 1033 (8.8%) of the 11 725 known representative TSSs were the most frequent TSSs. Our analysis indicates that the common practice of selecting the longest cDNA sequence as the representative one needs to be revised, and demonstrates the efficacy of 5′-end tag collection for detecting the most abundant TSS as an alternative to the representative TSS.
Table 3.

Ratios of uniquely aligned, redundant tags located in the promoters and 5′-UTR of the representative sequences

5′UTR + promoters (500 bp upstream)Uniquely aligned, redundant tags
Embryo3 212 489(96.7%)3 321 095
Larvae3 350 688(94.2%)3 556 021
Young male3 528 146(94.8%)3 721 539
Young female2 757 521(95.9%)2 875 247
Old male3 750 286(96.8%)3 873 078
Old female3 583 360(97.3%)3 683 536
S23 958 105(97.7%)4 052 965
Total24 140 595(96.2%)25 083 481
Ratios of uniquely aligned, redundant tags located in the promoters and 5′-UTR of the representative sequences

Database features and applications

We visualized the numbers of 5′-end tags for each position in a vertical bar (Figure 2). This arrangement of 5′-end data provides an insight into fly transcription, in combination with other annotated genomic information. In the MachiBase database server, users can browse the TSSs and frequencies of individual genes by querying the FlyBase gene ID, FlyBase transcription ID, etc. In addition to the gene-specific view, it is also possible to generate an overview of all the expressed transcripts for an assigned position on a chromosome. Furthermore, all these genes are linked with the FlyBase annotation server, which contains Gene Ontology (GO), orthologue information, etc. In addition to revealing differentially expressed genes, genome-wide TSS discovery is a valuable resource for biologists studying flies. This high-throughput study has revealed a surprisingly large number of novel genic (intron–exon regions) and intergenic TSSs, which has prompted a rethink of the relationships between gene transcription and promoter architecture. For example, if we display the location (2L: 2 391 450–2 391 850) by inputting 2L into the ‘Target’ box, 2 391 450 into the ‘Start’ box, 2 391 850 into the ‘End’ box, we can see the existence of an a new transcript supported by a significant number of 5′-end tags in the un-annotated intergenic region. Thus, the precise locations of the TSSs enable an in-depth analysis of cis-acting elements that are bound by transcription factors. This data resource provides a starting point for elucidating novel molecular details of transcription by reliably integrating TSS location data with related functional data, such as histone methylation and acetylation states (10,11), the positions of nucleosomes (12–14) and the occupancy of transcription factor binding sites (15), each of which, as features, can now be examined on a genome-wide basis.
Figure 2.

Snapshot of the MachiBase genome browser. The frequencies of the 5′-end mRNA tags mapped to individual positions on the fruitfly genome in the seven libraries are displayed as histograms in the bottom seven tracks. In the histograms, the vertical bars in log scale indicate the numbers of 5′-end mRNA tags aligned to each position on the x-axis. The upper track shows the exon–intron structures of two alternative splice variants. Observe that the peaks for the 5′-end tags are around the 5′-end of the longer splice variant in all of the seven libraries. In addition, note that many 5′-end tags are expressed from the second, second-to-last, and last exons in four adult samples (young/old and male/female).

Snapshot of the MachiBase genome browser. The frequencies of the 5′-end mRNA tags mapped to individual positions on the fruitfly genome in the seven libraries are displayed as histograms in the bottom seven tracks. In the histograms, the vertical bars in log scale indicate the numbers of 5′-end mRNA tags aligned to each position on the x-axis. The upper track shows the exon–intron structures of two alternative splice variants. Observe that the peaks for the 5′-end tags are around the 5′-end of the longer splice variant in all of the seven libraries. In addition, note that many 5′-end tags are expressed from the second, second-to-last, and last exons in four adult samples (young/old and male/female).

DISCUSSION

The vast transcriptional datasets have been used to characterize differentially expressed genes, especially in relation to age and sexual development. Using these datasets, we have confirmed that the representative TSSs, the abundantly expressed TSSa flanking FlyBase-annotated TSSs, differ from many of the known FlyBase-annotated TSSs. It has become evident that the rules for start site selection are fundamentally different for different promoters, and large-scale studies have given us the tools to partition promoters into functional classes with respect to TSS information in future studies. As a novel and high-quality data resource, MachiBase is a valuable tool for experimental biologists who are working on D. melanogaster. In future, we will empower this database with various annotated data on the fly genome.

FUNDING

Scientific Research on Priority Areas (C) from the Ministry of Education, Culture, Sports, Science and Technology of Japan partially; Bioinformatics Research and Development (BIRD); the Japan Science and Technology Agency (JST). Funding for open access charge: JST. Conflict of interest statement. None declared.
  15 in total

1.  5'-end SAGE for the analysis of transcriptional start sites.

Authors:  Shin-ichi Hashimoto; Yutaka Suzuki; Yasuhiro Kasai; Kei Morohoshi; Tomoyuki Yamada; Jun Sese; Shinichi Morishita; Sumio Sugano; Kouji Matsushima
Journal:  Nat Biotechnol       Date:  2004-08-08       Impact factor: 54.908

2.  A gene expression map for the euchromatic genome of Drosophila melanogaster.

Authors:  Viktor Stolc; Zareen Gauhar; Christopher Mason; Gabor Halasz; Marinus F van Batenburg; Scott A Rifkin; Sujun Hua; Tine Herreman; Waraporn Tongprasit; Paolo Emilio Barbano; Harmen J Bussemaker; Kevin P White
Journal:  Science       Date:  2004-10-22       Impact factor: 47.728

3.  Histone H3 acetylation and H3 K4 methylation define distinct chromatin regions permissive for transgene expression.

Authors:  Chunhong Yan; Douglas D Boyd
Journal:  Mol Cell Biol       Date:  2006-09       Impact factor: 4.272

4.  Histone H3 acetylated at lysine 9 in promoter is associated with low nucleosome density in the vicinity of transcription start site in human cell.

Authors:  Hiromi Nishida; Takahiro Suzuki; Shinji Kondo; Hisashi Miura; Yu-ichi Fujimura; Yoshihide Hayashizaki
Journal:  Chromosome Res       Date:  2006-03-17       Impact factor: 5.239

5.  Oligo-capping: a simple method to replace the cap structure of eukaryotic mRNAs with oligoribonucleotides.

Authors:  K Maruyama; S Sugano
Journal:  Gene       Date:  1994-01-28       Impact factor: 3.688

6.  Nucleosome organization in the Drosophila genome.

Authors:  Travis N Mavrich; Cizhong Jiang; Ilya P Ioshikhes; Xiaoyong Li; Bryan J Venters; Sara J Zanton; Lynn P Tomsho; Ji Qi; Robert L Glaser; Stephan C Schuster; David S Gilmour; Istvan Albert; B Franklin Pugh
Journal:  Nature       Date:  2008-04-13       Impact factor: 49.962

7.  FlyBase: genes and gene models.

Authors:  Rachel A Drysdale; Madeline A Crosby
Journal:  Nucleic Acids Res       Date:  2005-01-01       Impact factor: 16.971

8.  High-resolution analysis of the 5'-end transcriptome using a next generation DNA sequencer.

Authors:  Shin-ichi Hashimoto; Wei Qu; Budrul Ahsan; Katsumi Ogoshi; Atsushi Sasaki; Yoichiro Nakatani; Yongjun Lee; Masako Ogawa; Akio Ametani; Yutaka Suzuki; Sumio Sugano; Clarence C Lee; Robert C Nutter; Shinichi Morishita; Kouji Matsushima
Journal:  PLoS One       Date:  2009-01-01       Impact factor: 3.240

9.  Global analysis of patterns of gene expression during Drosophila embryogenesis.

Authors:  Pavel Tomancak; Benjamin P Berman; Amy Beaton; Richard Weiszmann; Elaine Kwan; Volker Hartenstein; Susan E Celniker; Gerald M Rubin
Journal:  Genome Biol       Date:  2007       Impact factor: 13.583

10.  Gene expression during the life cycle of Drosophila melanogaster.

Authors:  Michelle N Arbeitman; Eileen E M Furlong; Farhad Imam; Eric Johnson; Brian H Null; Bruce S Baker; Mark A Krasnow; Matthew P Scott; Ronald W Davis; Kevin P White
Journal:  Science       Date:  2002-09-27       Impact factor: 47.728

View more
  18 in total

1.  The TCT motif, a key component of an RNA polymerase II transcription system for the translational machinery.

Authors:  Trevor J Parry; Joshua W M Theisen; Jer-Yuan Hsu; Yuan-Liang Wang; David L Corcoran; Moriah Eustice; Uwe Ohler; James T Kadonaga
Journal:  Genes Dev       Date:  2010-08-27       Impact factor: 11.361

2.  Three key subregions contribute to the function of the downstream RNA polymerase II core promoter.

Authors:  Joshua W M Theisen; Chin Yan Lim; James T Kadonaga
Journal:  Mol Cell Biol       Date:  2010-05-10       Impact factor: 4.272

3.  Regulated post-transcriptional RNA cleavage diversifies the eukaryotic transcriptome.

Authors:  Tim R Mercer; Marcel E Dinger; Cameron P Bracken; Gabriel Kolle; Jan M Szubert; Darren J Korbie; Marjan E Askarian-Amiri; Brooke B Gardiner; Gregory J Goodall; Sean M Grimmond; John S Mattick
Journal:  Genome Res       Date:  2010-11-02       Impact factor: 9.043

Review 4.  Promoting developmental transcription.

Authors:  Uwe Ohler; David A Wassarman
Journal:  Development       Date:  2010-01       Impact factor: 6.868

5.  Polycomb preferentially targets stalled promoters of coding and noncoding transcripts.

Authors:  Daniel Enderle; Christian Beisel; Michael B Stadler; Moritz Gerstung; Prashanth Athri; Renato Paro
Journal:  Genome Res       Date:  2010-12-22       Impact factor: 9.043

6.  The Drosophila melanogaster transcriptome by paired-end RNA sequencing.

Authors:  Bryce Daines; Hui Wang; Liguo Wang; Yumei Li; Yi Han; David Emmert; William Gelbart; Xia Wang; Wei Li; Richard Gibbs; Rui Chen
Journal:  Genome Res       Date:  2010-12-22       Impact factor: 9.043

7.  Evolution of sex-peptide in Drosophila.

Authors:  Manabu Tsuda; Toshiro Aigaki
Journal:  Fly (Austin)       Date:  2016-05-26       Impact factor: 2.160

8.  Global analysis of short RNAs reveals widespread promoter-proximal stalling and arrest of Pol II in Drosophila.

Authors:  Sergei Nechaev; David C Fargo; Gilberto dos Santos; Liwen Liu; Yuan Gao; Karen Adelman
Journal:  Science       Date:  2009-12-10       Impact factor: 47.728

9.  Functional copies of the Mst77F gene on the Y chromosome of Drosophila melanogaster.

Authors:  Flavia J Krsticevic; Henrique L Santos; Suelen Januário; Carlos G Schrago; A Bernardo Carvalho
Journal:  Genetics       Date:  2009-11-06       Impact factor: 4.562

10.  Motif composition, conservation and condition-specificity of single and alternative transcription start sites in the Drosophila genome.

Authors:  Elizabeth A Rach; Hsiang-Yu Yuan; William H Majoros; Pavel Tomancak; Uwe Ohler
Journal:  Genome Biol       Date:  2009-07-09       Impact factor: 13.583

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.