Literature DB >> 15473908

i-Genome: a database to summarize oligonucleotide data in genomes.

Feng-Mao Lin1, Hsien-Da Huang, Yu-Chung Chang, Jorng-Tzong Horng.   

Abstract

BACKGROUND: Information on the occurrence of sequence features in genomes is crucial to comparative genomics, evolutionary analysis, the analyses of regulatory sequences and the quantitative evaluation of sequences. Computing the frequencies and the occurrences of a pattern in complete genomes is time-consuming.
RESULTS: The proposed database provides information about sequence features generated by exhaustively computing the sequences of the complete genome. The repetitive elements in the eukaryotic genomes, such as LINEs, SINEs, Alu and LTR, are obtained from Repbase. The database supports various complete genomes including human, yeast, worm, and 128 microbial genomes.
CONCLUSIONS: This investigation presents and implements an efficiently computational approach to accumulate the occurrences of the oligonucleotides or patterns in complete genomes. A database is established to maintain the information of the sequence features, including the distributions of oligonucleotide, the gene distribution, the distribution of repetitive elements in genomes and the occurrences of the oligonucleotides. The database can provide more effective and efficient way to access the repetitive features in genomes.

Entities:  

Mesh:

Substances:

Year:  2004        PMID: 15473908      PMCID: PMC526275          DOI: 10.1186/1471-2164-5-78

Source DB:  PubMed          Journal:  BMC Genomics        ISSN: 1471-2164            Impact factor:   3.969


Background

During the last decade, many genomes have been successfully and completely sequenced. Summarized information about the oligonucleotides in genomes provides biologists, who interests in the evolution and growth of genomes, to work in comparative genomics, oligonucleotide probe design, primer design and the analyses of genomic repetitive features. The computation of the occurrences and the frequency of all oligonucleotides in a complete genome is very elaborate and time-consuming, especially when the genome size is very large, such as the human and mouse genomes. A database that summarizes the occurrences and the frequencies of oligonucleotides in complete genomes can facilitate the biological and the statistical analyses of genomes. The contents of the database can be used in many biological applications, such as comparative genomics and evolution analyses [1,2], the prediction of regulatory sequences by detecting the over-represented oligonucleotides [3-8] and primer/probe design based on the uniqueness of oligonucleotides [9]. Table 1 shows the biological applications of the database entries. The entries in the database are divided into two types, namely, the occurrence positions of oligonucleotides and the frequencies of oligonucleotides. The oligonucleotide occurrences and the oligonucleotide frequencies in both the coding regions and the non-coding regions are summarized. For instance, these information can be used in statistical analyses to study the over-representation of the regulatory sequences in upstream promoter regions in genes. van Helden et al. systematically searched the promoter regions of potentially co-regulated genes for over-represented oligonucleotides which may be transcription factor binding sites [3]. They presented a simple and fast method for isolating DNA binding sites for transcription factors from families of co-regulated genes, illustrating their results using Saccharomyces cerevisiae. Although conceptually simple, the algorithm efficiently extracted the upstream regulatory sequences that had been previously been determined experimentally for most of the yeast regulatory families already analyzed. Other studies [4-8,10-12] on the prediction of gene regulatory sequences have been based on oligonucleotide analysis.
Table 1

Applications and the relevant data in the database.

Database entriesEntry typesBiological applications
Oligonucleotide occurrencesPositions1. Oligonucleotide analysis for regulatory sequences2. Oligonucleotide probe design3. Primer design
Oligonucleotide frequenciesCounts1. Oligonucleotide analysis for regulatory sequences2. Evolutionary analysis3. Oligonucleotide probe design4. Primer design
Gene coding regionsPositions1. Oligonucleotide analysis for regulatory sequences2. Evolutionary analysis3. Oligonucleotide probe design4. Primer design
Repetitive element frequencies (LINE, SINE, Alu, and so on)CountsEvolutionary analysis
Repetitive element occurrencesPositionsEvolutionary analysis
Tandem repeatsPositionsPrediction for genetic disease marker
Applications and the relevant data in the database. Hsieh et al. [1] investigated the oligonucleotide distributions of typical microbial genomes and found that the microbial genomes have the statistical characteristics of a much shorter DNA sequence. This peculiar property supports an evolutionary model in which a genome evolves by random mutation but grows primarily by random segmental duplication. Repetitive elements, including LINEs, SINEs, LTR and Alu, can be investigated in evolution analysis [2]. It has been estimated that at least 43% of the human genome is occupied by four major classes of interspersed repetitive elements – LINEs, SINEs, LTR elements and DNA transposons [2]. Their analysis has yielded some insights into the evolution of the human genome. The tandem repeats provided in our database can be used for forensic analysis and the study of genetic diseases [13-16]. Another application of the established database is to facilitate the design of primers to amplify specific regions of the genomic sequence. The basic concept is that the sequences of primers from 15 to 40 bps should be unique, unlike the repetitive oligonucleotides in our database. Additionally, our database maintains the repetitive oligonucleotides that facilitate the design of oligonucleotide probes to allow the selection of signature oligonucleotides when identifying different organisms using DNA arrays [9]. The user can query oligonucleotides whose lengths exceed a threshold, such as 15 bps, to determine whether the oligonucleotides are repetitive. The non-repetitive regions of the target sequences without repetitive oligonucleotides can be used as the signatures for the target genomes.

Construction and content

Data sources and contents

The proposed database provides information about sequence features generated by exhaustively computing the sequences of the complete genome. The data sources including the complete genomes and the gene annotation information are obtained from GenBank [17]. The repetitive elements in the eukaryotic genomes, such as LINEs, SINEs, Alu and LTR, are obtained from Repbase [18]. The database supports a range of complete genomes including human, yeast, worm, and 128 microbial genomes. The Appendix lists the organisms supported in the database [see additional file 1]. The occurrences and the frequencies of oligonucleotides from one to 50 base pairs are generated and accumulated from each of the complete genome sequence. Inputting the sequence of the oligonucleotide returns the positions of the oligonucleotides. Additionally, both the occurrences and the frequencies of the repetitive elements such as LINEs, SINEs, Alu and LTR are provided by computationally scanning whole genome sequences. The tandem repeats are computationally detected by the tandem repeat finder [19]. The database also provides the gene annotation information. For instance, Table 2 presents the number of occurrences of repetitive oligonucleotides in yeast. The oligonucleotide "ACCCTA" occurs 2,724 times in the yeast genome, 822 times upstream of a gene (-600 ~-1 bp, +1 bp denotes the gene translational start postion) and 793 times in the coding regions. The counts of the occurrence of ecah oligonucleotide between one and 50 bps are present in the database.
Table 2

Number of occurrences of the repetitive oligonucleotides in yeast genome

Repetitive oligonucleotideAmount of occurrencesRepetitive oligonucleotideAmount of occurrences
GenomeUp-streamsCoding regionGenomeUp-streamsCoding region
ACCCTA2,724822793CAATCCA1,895655343
ACCCTC2,917881795CGTCTCC592199148
AGTACT3,073933879CGTCTGA652196165
AGTAGA6,6731,9701,798ACAAACTA594179183
AGTAGC4,9121,5451,299ACAAACTC514175112
GATACC4,8291,6381,005CACAGAAAC1463846
GATAGA7,0302,1631,807CACAGAAGA1645739
TGGTAA10,5133,4932,214ACATATAAAAA54929
TGTAAA11,3643,4393,418ACATATAAAAC1393456
AAGGGGA1,172299421ACATATAAAAG36722
AAGGGGC626142256ACTTATGTCATC571723
AGAGTGG983310271ACTTCTAGTATA1594467
AGAGTTA1,859610441ACTTTTTTTTCT32521
CAATCAG1,358445320ACTTTTTTTTTC50633
Number of occurrences of the repetitive oligonucleotides in yeast genome

Data Generation

A software is implemented to index systematically a complete genome sequence into a suffix-array using a perfect match approach [20]. This index is only able to find the perfect match for any oligonucleotide. The user can thus use it to find the positions of a designated oligonucleotide in a genome sequence. For each genome, the occurrences of all oligonucleotides shorter than 50 bps can be efficiently searched for. The occurrence is the position of the oligonucleotide in genome. The frequency is the count of oligonucleotide occurrences in a region. The regions are the complete genome, the coding regions and the non-coding regions. Frequencies of oligonucleotides with different lengths are stored in different flat-files. For example, the two chromosomes of the Vibrio cholerae genome are processed separately to allow the computation of the occurrences of oligonucleotides in each chromosomal sequence. RepeatMasker [21] and the repetitive element database, Repbase [18], are used to search the instances of the repetitive elements in eukaryotic genomes. The tandem repeat finder (TRF) is used to find the tandem repeats in genomes [19]. The TRF and RepeatMasker can find the instances of repetitive elements with imperfect matches. The settings used here for each software is described in below. The Tandem Repeat Finder uses seven parameters. These are match score, mismatch score, indel score, probability of match and insertion, minimum score of alignment and the maximum of tandem repeat pattern size. The corresponding values used here are 2, 7, 7, 80, 10, 20 and 500. The values are the default suggestions found in Tandem Repeat Finder documentation. The transposable elements (TEs) are detected by RepeatMasker. TEs in each genome are identified using the complete dataset available from REPbase Updates [Please add the citation]. The sensitivity and the speed of RepeatMasker are set as the default values.

Utility

Table 3 presents the two output formats – flat-files and the web query interface with a filtering function. In the flat-file format, the fields of each oligonucleotide (in a chromosome) include the sequences, the number of occurrences in the chromosome, the number of occurrences in the coding regions and the number of occurrences in the non-coding regions. The user that requires a large amount of such data can download them in this format [1].
Table 3

Output styles of the database.

Database entriesEntry typesOutput formats
Oligonucleotide occurrencesPositionsWeb interface
Oligonucleotide frequenciesCountsWeb interface and flat-file
Gene coding regionsPositionsWeb interface
Repetitive element frequencies (LINE, SINE, Alu, and so on)CountsWeb interface and flat-file
Repetitive element occurrencesPositionsWeb interface
Tandem repeatsPositionsWeb interface
Output styles of the database. The web interface enables users to query the occurrences of an oligonucleotide in a genome and the number of occurrences in each chromosome. Figure 1 shows this approach. The occurrences of the repetitive elements and the tandem repeats in the established database can also be queried via the web interface, as in the example given in Fig 2. Figure 3 depicts the flat-file format of oligonucleotide frequencies. The first row in the flat-file presents the basic information for the oligonucleotide frequencies and the fields are the chromosome sequence/NCBI accession number, the length of the chromosome sequence, the size of coding regions, the size of non-coding regions, the length of the oligonucleotides and the minimum number of copies of the oligonucleotide. The directories labeled C10 are the files that contain the counts of oligonucleotides with at least ten occurrences in genome. Each file name includes the sequence/NCBI accession number, the length of oligonucleotides and the minimum occurrences of oligonucleotides. For example, "NC 000913 L30 C10" is the oligonucleotides, which are 30 nucleotides in length and have at least 10 occurrences in the genome.
Figure 1

Web query interface (1/2).

Figure 2

Web query interface (2/2).

Figure 3

Database entries in flat-file format.

Web query interface (1/2). Web query interface (2/2). Database entries in flat-file format. Figure 4 shows the web interface for the occurrences of a specific oligonucleotide. The user submits the query oligonucleotide and selects particular species; the positions of the oligonucleotide are then shown. The first line is the user submitted data. Following this information are the positions of oligonucleotides in forward strand. This is followed by the same information for the reverse strand.
Figure 4

The occurrence positions of the oligonucleotide are found by Oligos Locator.

The occurrence positions of the oligonucleotide are found by Oligos Locator.

Conclusions

We have constructed the databases of both the oligonucleotide occurrence locations and their frequencies in the coding and the non-coding regions in complete genomes. The data in flat-file format can be downloaded directly for further analyses in several biological applications. The user may also use the web interface to query and access the database contents. The database also provides a filtering function for retrieving the information about oligonucleotides under search conditions specified by the users. Furthermore, the database provides the occurrences and the frequencies of other repetitive elements, such as LINE, SINE, Alu and tandem repeats in genomes.

Availability and requirements

The database is now available at

Authors' contributions

FML implements the software and refinements the system. HDH conceives of the study and drafted the manuscripts. YCC and JTH participates the design and coordination. All authors read and approved the final manuscripts.

Additional File 1

Appendix listing the organisms supported in the database. Click here for file
  18 in total

1.  Identification of hallucinogenic fungi from the genera Psilocybe and Panaeolus by amplified fragment length polymorphism.

Authors:  J C Lee; M Cole; A Linacre
Journal:  Electrophoresis       Date:  2000-05       Impact factor: 3.535

2.  A web site for the computational analysis of yeast regulatory sequences.

Authors:  J van Helden; B André; J Collado-Vides
Journal:  Yeast       Date:  2000-01-30       Impact factor: 3.239

3.  Evolutionary analyses of the human genome.

Authors:  W H Li; Z Gu; H Wang; A Nekrutenko
Journal:  Nature       Date:  2001-02-15       Impact factor: 49.962

4.  Identification of members of the genera Panaeolus and Psilocybe by a DNA test. A preliminary test for hallucinogenic fungi.

Authors:  J C Lee; M Cole; A Linacre
Journal:  Forensic Sci Int       Date:  2000-08-14       Impact factor: 2.395

5.  Repbase update: a database and an electronic journal of repetitive elements.

Authors:  J Jurka
Journal:  Trends Genet       Date:  2000-09       Impact factor: 11.639

6.  GenBank.

Authors:  Dennis A Benson; Ilene Karsch-Mizrachi; David J Lipman; James Ostell; David L Wheeler
Journal:  Nucleic Acids Res       Date:  2003-01-01       Impact factor: 16.971

7.  An untranslated CTG expansion causes a novel form of spinocerebellar ataxia (SCA8)

Authors:  M D Koob; M L Moseley; L J Schut; K A Benzow; T D Bird; J W Day; L P Ranum
Journal:  Nat Genet       Date:  1999-04       Impact factor: 38.330

8.  Selecting signature oligonucleotides to identify organisms using DNA arrays.

Authors:  Lars Kaderali; Alexander Schliep
Journal:  Bioinformatics       Date:  2002-10       Impact factor: 6.937

9.  Enrichment of regulatory signals in conserved non-coding genomic sequence.

Authors:  S Levy; S Hannenhalli; C Workman
Journal:  Bioinformatics       Date:  2001-10       Impact factor: 6.937

10.  The repetitive sequence database and mining putative regulatory elements in gene promoter regions.

Authors:  Jorng-Tzong Horng; Hsien-Da Huang; Ming-Hui Jin; Li-Cheng Wu; Shir-Ly Huang
Journal:  J Comput Biol       Date:  2002       Impact factor: 1.479

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.