Literature DB >> 19707298

BatchGenAna: a batch platform for large-scale genomic analysis of mammalian small RNAs.

Xiaomin Ying¹, You Jung Kim, Yiqing Mao, Ming Liu, Yanyan Hou, Hua Li, Xiaolei Wang, Yalin Zhao, Dongsheng Zhao, Jignesh M Patel, Wuju Li.

Abstract

UNLABELLED: An increasing number of small RNAs have been discovered in mammals. However, their primary transcripts and upstream regulatory networks remain largely to be determined. Genomic analysis of small RNAs facilitates identification of their primary transcripts, and hence contributes to researches of their upstream regulatory networks. We here report a batch platform, BatchGenAna, which is specifically designed for large-scale genomic analysis of mammalian small RNAs. It can map and annotate for as many as 1000 small RNAs or 10,000 genomic loci of small RNAs at a time. It provides genomic features including RefSeq genes, mRNAs, ESTs and repeat elements in tabular and graphical results. It also allows extracting flanking sequences of submitted queries, specified genomic regions and host transcripts, which facilitates subsequent analysis such as scanning transcription factor binding sites in upstream sequences and poly(A) signals in downstream sequences. Besides small RNA fields, BatchGenAna can also be applied to other research fields, e.g. in silico analysis of target genes of transcription factors. AVAILABILITY: The The platform is freely available at http://biosrv1.bmi.ac.cn/BatchGenAna.

Entities: Gene Species

Keywords: genomic analysis; primary transcript; small RNA

Year: 2009 PMID： 19707298 PMCID： PMC2720670 DOI： 10.6026/97320630003346

Source DB: PubMed Journal: Bioinformation ISSN： 0973-2063

Background

Small RNAs of less than 40 nucleotides (nt) in length, such as microRNAs (miRNAs), small interfering RNAs (siRNAs), scanRNAs (scnRNAs) and piwi-interacting RNAs (piRNAs), constitute a large family of tiny regulatory molecules. Among them, miRNAs, piRNAs and siRNAs are also discovered in mammals, which have diverse and important functions. Although our knowledge of mammalian small RNAs has advanced rapidly, the primary transcripts of most mammalian small RNAs remain to be determined. Uncovering primary transcripts of small RNAs is very important to our understanding of the biogenesis of small RNAs. It facilitates (a) identifying the regulatory regions such as transcription factor binding sites (TFBS) and hence discovering upstream regulators in the network, (b) detecting other sequence and structural motifs required in small RNA processing, and (c) providing essential information for small RNA knockout [1]. Genomic analysis is an efficient way to identify primary transcripts before experiments. For example, several studies have attempted to delineate the genomic boundaries of mammalian miRNAs by large-scale genomic analysis [1-3]. Up to now, there are two ways for large-scale genomic analysis of mammalian small RNAs. One way is to download genome sequences, annotations and other data into local machines and to implement in-house programs to analyze the genomic features. Although this way is flexible, it requires significant effort and computer skills for use. Researchers are required to map sequences to genomes, to build local database on annotations and to develop computer programs for parsing, retrieving and displaying the results, which are very challenging for most molecular biology laboratories. The other way is to utilize public databases such as UCSC [4][4], NCBI [5] or Ensembl [6]. These public databases align users' sequences (one or no more than 30 sequences at a time) against the specified genome, display one genome view at a time and extract flanking sequence for one sequence interactively. However, these interactive tools need much manual work. Doing large-scale genomic analysis with them would be exhausting. Moreover, researchers will have to record the overlapping ESTs and mRNAs manually if they want to further analyze tissue sources or other features of these transcripts. Recently, miRBase has been updated with genomic features of known miRNAs [7], and piRNABank is built for human, mouse and rat piRNAs supporting searches for overlapping genes and repeat elements [8]. However, piRNABank does not provide other genomic features like EST matches and flanking sequence extraction for further analysis, and these two specific databases do not support genomic analysis for other classes of small RNAs or newly discovered small RNAs. To facilitate large-scale genomic analysis of mammalian small RNAs, we have developed a batch platform, BatchGenAna, to provide batch mapping and annotating for as many as 1000 nucleotide sequences or 10,000 genomic loci of small RNAs at a time. BatchGenAna provides genomic features including RefSeq genes, mRNAs, ESTs and repeat elements, and produces both tabular and graphical results containing the selected genomic features. It also supports extraction of flanking sequences of submitted queries, specified genomic regions and host transcripts, which facilitates subsequent analysis such as scanning TFBS in upstream sequences and poly(A) signals in downstream sequences.

Methodology of development

The genome sequences are downloaded from NCBI [5]. The annotations of mRNAs, ESTs and RefSeq genes are downloaded from NCBI map viewer [5]. The repeat annotations of human, mouse and rat genome are downloaded from UCSC [4]. At the time of writing this manuscript, the human, mouse and rat genome assemblies and annotations are NCBI build 36.2, 37.1 and 4.1 respectively. To improve computational efficiency while achieving high sensitivity, BatchGenAna employs miBLAST [9] to search the sequences against the genome for sequences shorter than 40 nt. In our case, miBLAST is ~30 times faster than BLAST [10] for sequences shorter than 40 nt and has the same sensitivity. For those longer, BLAT is employed, since it is more accurate and much faster than popular existing tools for mRNA/DNA alignments when comparing vertebrate sequences [11]. A central MySQL database is used to store the downloaded data. The BatchGenAna interface was written in PHP. The program of sequence mapping, alignment result parsing and tabular/graphical result generating are implemented in C, Perl and Bioperl [12]. The current web service is running an Apache web server on a PC Linux box with quad Intel Xeon 3.2GHz processors and 4GB RAM. The operating system is Redhat Linux AS3.

Utility

BatchGenAna provides a friendly interface for users to specify parameters for input type, species, alignment, annotation, display and flanking sequence extraction (Figure 1a). Users may input nucleotide sequences, genomic loci or GenBank accession numbers. The input type of Genomic loci is provided for users who have already obtained the genomic loci of small RNAs or small RNA clusters. The input type of GenBank accession number is provided for users to extract flanking sequences of the overlapping transcripts in a second run. Users can specify the genomic features they focus on, including ESTs, mRNAs, RefSeq genes and repeat elements. Users can also specify the window width of genome views and the upstream/downstream length for flanking sequences extraction. Since batch jobs are more time-consuming than immediate jobs, users are required to input their email addresses so that BatchGenAna can notify users instantly once their jobs are completed (Executing time of variable batch jobs is provided in the homepage).

Figure 1

(a) A snapshot of BatchGenAna. (b) An example tabular result of human piRNAs for mRNAs. The genomic loci and the accession numbers of mRNAs overlapping submitted piRNAs are listed. (c) An example genome view of small RNA cluster. Small RNAs are denoted as green boxes (in plus strand) and red boxes (in minus strand). ESTs, mRNAs and repeat elements are displayed in different tracks with different colours.

BatchGenAna produces both tabular and graphical results. In tabular results, GenBank accession numbers of ESTs, mRNAs, RefSeq genes and the names/families of repeat elements are listed if they overlap with the submitted queries. Tabular results facilitate users to analyze tissue distribution and other features of their transcripts (Figure 1b). In graphical results, BatchGenAna displays genome views centered at submitted queries with selected genomic features. If the distance of two consecutive genomic loci of submitted queries is less than half of the display window width, these two genomic loci are plotted in the same view (Figure 1c). The detailed information and guidance is given in the webpage.

Future development

Work is under way to incorporate CpG islands, TSS, polyA signals, TFBS and CAGE tags into BatchGenAna. We also plan to integrate annotations from NCBI, UCSC and Ensembl into BatchGenAna, since their genome annotations are somewhat different.

12 in total

1. BLAT--the BLAST-like alignment tool.

Authors: W James Kent
Journal: Genome Res Date: 2002-04 Impact factor: 9.043

2. Basic local alignment search tool.

Authors: S F Altschul; W Gish; W Miller; E W Myers; D J Lipman
Journal: J Mol Biol Date: 1990-10-05 Impact factor: 5.469

3. Primary transcripts and expressions of mammal intergenic microRNAs detected by mapping ESTs to their flanking sequences.

Authors: Jin Gu; Tao He; Yunfei Pei; Fei Li; Xiaowo Wang; Jing Zhang; Xuegong Zhang; Yanda Li
Journal: Mamm Genome Date: 2006-10-03 Impact factor: 2.957

4. Identification of mammalian microRNA host genes and transcription units.

Authors: Antony Rodriguez; Sam Griffiths-Jones; Jennifer L Ashurst; Allan Bradley
Journal: Genome Res Date: 2004-09-13 Impact factor: 9.043

5. piRNABank: a web resource on classified and clustered Piwi-interacting RNAs.

Authors: S Sai Lakshmi; Shipra Agrawal
Journal: Nucleic Acids Res Date: 2007-09-18 Impact factor: 16.971

6. The UCSC Genome Browser Database: 2008 update.

Authors: D Karolchik; R M Kuhn; R Baertsch; G P Barber; H Clawson; M Diekhans; B Giardine; R A Harte; A S Hinrichs; F Hsu; K M Kober; W Miller; J S Pedersen; A Pohl; B J Raney; B Rhead; K R Rosenbloom; K E Smith; M Stanke; A Thakkapallayil; H Trumbower; T Wang; A S Zweig; D Haussler; W J Kent
Journal: Nucleic Acids Res Date: 2007-12-17 Impact factor: 16.971

7. Database resources of the National Center for Biotechnology Information.

Authors: David L Wheeler; Tanya Barrett; Dennis A Benson; Stephen H Bryant; Kathi Canese; Vyacheslav Chetvernin; Deanna M Church; Michael Dicuccio; Ron Edgar; Scott Federhen; Michael Feolo; Lewis Y Geer; Wolfgang Helmberg; Yuri Kapustin; Oleg Khovayko; David Landsman; David J Lipman; Thomas L Madden; Donna R Maglott; Vadim Miller; James Ostell; Kim D Pruitt; Gregory D Schuler; Martin Shumway; Edwin Sequeira; Steven T Sherry; Karl Sirotkin; Alexandre Souvorov; Grigory Starchenko; Roman L Tatusov; Tatiana A Tatusova; Lukas Wagner; Eugene Yaschenko
Journal: Nucleic Acids Res Date: 2007-11-27 Impact factor: 16.971

8. miRBase: tools for microRNA genomics.

Authors: Sam Griffiths-Jones; Harpreet Kaur Saini; Stijn van Dongen; Anton J Enright
Journal: Nucleic Acids Res Date: 2007-11-08 Impact factor: 16.971

9. Ensembl 2008.

Authors: P Flicek; B L Aken; K Beal; B Ballester; M Caccamo; Y Chen; L Clarke; G Coates; F Cunningham; T Cutts; T Down; S C Dyer; T Eyre; S Fitzgerald; J Fernandez-Banet; S Gräf; S Haider; M Hammond; R Holland; K L Howe; K Howe; N Johnson; A Jenkinson; A Kähäri; D Keefe; F Kokocinski; E Kulesha; D Lawson; I Longden; K Megy; P Meidl; B Overduin; A Parker; B Pritchard; A Prlic; S Rice; D Rios; M Schuster; I Sealy; G Slater; D Smedley; G Spudich; S Trevanion; A J Vilella; J Vogel; S White; M Wood; E Birney; T Cox; V Curwen; R Durbin; X M Fernandez-Suarez; J Herrero; T J P Hubbard; A Kasprzyk; G Proctor; J Smith; A Ureta-Vidal; S Searle
Journal: Nucleic Acids Res Date: 2007-11-13 Impact factor: 16.971

10. miBLAST: scalable evaluation of a batch of nucleotide sequence queries with BLAST.

Authors: You Jung Kim; Andrew Boyd; Brian D Athey; Jignesh M Patel
Journal: Nucleic Acids Res Date: 2005-08-01 Impact factor: 16.971