Literature DB >> 17962300

DDBJ with new system and face.

H Sugawara¹, O Ogasawara, K Okubo, T Gojobori, Y Tateno.

Abstract

DDBJ (http://www.ddbj.nig.ac.jp) collected and released 1 880 115 entries or 1 134 086 245 bases in the period from July 2006 to June 2007. The released data contains the high-throughput cDNAs of cricket and high-quality draft genome of medaka among others. Our computer system has been upgraded since March 2007. Another new aspect is an efficient data retrieval tool that has recently been equipped and served at DDBJ. It is called All-round Retrieval for Sequence and Annotation, which enables the user to search for keywords also in the Feature/Qualifier of the International Nucleotide Sequence Database Collaboration (http://www.insdc.org/). We will also replace our home page with a more efficient one by the end of 2007.

Entities: Disease Species

Mesh：

Year: 2007 PMID： 17962300 PMCID： PMC2238829 DOI： 10.1093/nar/gkm889

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Through our service we have witnessed dramatic advancements in biology and the related areas in the past 20 or more years. For example, using genome sequence data for eubacteria, archaebacteria and eukaryotes, some authors constructed a tree of life, which is the phylogenetic tree of the three super-kingdoms (1,2). Others reported a way to predict the number of genes at least in the bacterial world (3). The dramatic advancements prove our simple idea that the more data we collect and serve the more people make use of it for various purposes. On the other hand, the recent development of sequencing machines such as 454 (by 454 Life Sciences), Solexa (by Illumina, Inc.) and SOLiD (by Applied Biosystems) makes us worrisome as well. According to some estimate, 5–10 tera bases will be sequenced by Solexa at one sequencing facility in a month in the near future. With the further development of the sequencing technology the whole genome of a person may repeatedly be submitted in the near future, as few examples warn (4). To cope with the expected situation of sequencing genes and genomes, we have recently upgraded our computer system and installed an efficient keyword search tool. We think that the new computer and tool serve our data submitters and users better and make our job more effective and efficient. In this article we will report on the data submissions to DDBJ in the past year, replacement of our computer system with an upgraded one, a new data retrieval tool and a new home page.

DATA SUBMISSIONS TO DDBJ IN THE LAST YEAR

In the period from July 2006 to June 2007, DDBJ collected and released the original data of 1 880 115 entries or 1 134 086 245 bases that were classified into the 19 International Nucleotide Sequence Database Collaboration (INSDC) divisions (5). More than 90% of the submissions came from Japanese researchers, and the rest were mainly from Chinese and Korean researchers. The released data includes the high-throughput cDNAs (HTC) of cricket, Gryllus bimaculatus submitted from Tokushima University (6). The data amount is 32 010 entries that can be obtained through anonymous FTP with the file name, Gryllus_bimaculatus_HTC_070726_1.seq.gz. Also included is 700 Mb of the high-quality draft genome data of medaka, Oryzias latipes, which was submitted from University of Tokyo and National Institute of Genetics (7). The data was carefully assembled and upgraded from the WGS data that was reported in our previous paper (8). The given accession numbers are BAAF03000000 (Hd-rR, version 0.9), BAAF04000000 (Hd-rR, version 1.0) BAAE01000000 (HNI) and ACAAA0000001-ACAAA0356693 (5′ SAGE tags). Although draft genome sequences of two fugu (blowfish) species are available, the high-quality draft genome of medaka will be quite useful particularly for the study of vertebrate evolution. The submitters of the genome data discussed, for example, that the medaka genome preserved its ancestral karyotype for more than 300 million years (7). It is also noted that the current number of bacterial species/strains in the complete bacteria genome data repository, the Genome Information Broker, (GIB, http://gib.genes.nig.ac.jp/) (9), at DDBJ is 569 and keeps on growing rapidly. The species added in the past year include Methanococcus maripaludis (by Joint Genome Institute), Saccharopolyspora erythraea (by University of Cambridge), Francisella tularensis subsp. tularensis (by UT Southwestern Medical Center), Desulfotomaculum reducens (by Joint Genome Institute), Burkholderia vietnamiensis (by Joint Genome Institute), Herminiimonas arsenicoxydans (by Genoscope), Geobacillus thermodenitrificans (by Nankai University), Corynebacterium glutamicum (by RITE) and many others. We also serve a complete virus genome data repository, GIB for Viruses (GIB-V, http://gib-v.genes.nig.ac.jp/) that now contains 31 486 virus genomes and genomic segments.

NEW COMPUTER SYSTEM

In July 2007, we celebrated the 20th anniversary of the public release of the DNA data. Our first release in July 1987 contained only 66 entries or 108 970 bases that were typed in from published papers. These numbers may be impressive in the comparison with the corresponding ones as of June 2007, 13 371 690 entries or 8 988 178 758 bases. This tremendous increase in the numbers perhaps reflects the remarkable advancement of research in biology and the related areas in Japan in the past 20 years. The ever-increasing amount of the data also makes us worry about our hardware and software facilities. In March 2007, we completely replaced our computer system with an upgraded one. Major upgraded aspects are as follows. (i) The increase in the number of entries in making the flat files from 300 000 to 1 000 000 entries/day, (ii) the decrease in processing time in making a huge flat file; in case of four rice chromosomes, from 110 to 13 min, (iii) the decrease in processing time from 120 to 13 min for updating the live-list that lists the accession numbers and dates of the public release of the released entries; it is weekly updated to exchange the information about the currently released data with the EMBL Bank and GenBank, (iv) the increase in the number of ESTs in data processing from 40 000 to 800 000 entries/h and (v) the increase in the number of queries accepted at once by 1.5 times. Therefore, we will be able to cope with the increase in the number of data submissions for the next several years.

NEW KEYWORD SEARCH TOOL

Recently, we have installed a high-speed keyword search tool, All-round Retrieval for Sequence and Annotation (ARSA, http://arsa.ddbj.nig.ac.jp/top-e.html). The search logic behind ARSA is called SIGMA, which was invented by Arikawa and his colleagues (10,11). For a given query SIGMA makes it possible to retrieves all the right entries by checking the contents of a database just once, no matter how the query is complicated. The one time checking makes keyword search fast. SIGMA does not need an index file, which means that search can be made against the currently available data. SIGMA is implemented on the Shunsaku search engine developed by Fujitsu. The search engine operates in parallel for divided data, which makes the search even faster. ARSA also has a large scalability with an increasing amount of data. In theory, one search can be completed within 10 s irrespective of the data size and the query formula. If the data increases more than 10 times larger than the current amount, however, we may have to increase the number units in the Shunsaku accordingly to keep the present search speed. ARSA covers 23 databases including DDBJ, UniProt, PFAM, PDB and LENZYME. A special feature of ARSA is that it can also incorporate the terms defined by the Feature/Qualifier of INSDC. While this feature is very helpful for us to annotate the submitted data, it enables our user to perform data retrieval by using terms in the Feature/Qualifier. For example, you can search for CDSs (protein coding sequences) located on human Y chromosome, as shown in Figure 1. In the figure, the query formula is given on the top, and a part of the hit entries is given below with the accession numbers. By clicking one of the numbers you can see its contents. HUM in the last column stands for the human division. You can download the search result in Flat File, FASTA or XML, and also choose the items in the search results to be displayed on the computer screen and directly download them in tab-limited format. We also provide you with WebAPI (http://xml.nig.ac.jp/>http://xml.nig.ac.jp/) (12) so that you can customize ARSA by writing a program in Perl or JAVA. We will soon include KEGG (http://www.genome.ad.jp/kegg/) in ARSA and make the 24 databases simultaneously retrievable for common keywords.

Figure 1.

Example of keyword search by ARSA. Keywords used are ‘source’ (Feature), ‘chromosome’ (Qualifier belonging to ‘source’) and CDS (Feature). ‘Chromosome’ has a value attribute to which ‘Y’ is given for specifying chromosome Y.

NEW FACE OF DDBJ

We updated our home page (HP) in 2005 (13). We are again in the process of updating it rather drastically this time. Since the present HP holds many contents that have been added in an irregular sequence without much consideration for consistency, it is not really convenient now for our data submitters and users. The main point of the updating thus is to reach the almost every content with three clicks or less, which is now a common practice in making use of a HP. In the new HP when you click one of our main services, data submission, data retrieval, ftp/SOAP, statistics and inquiry, you can get the whole view of all contents for each service at once, and easily go to the one of them that you wish. The new HP will replace the present one by the end of 2007. We hope the new HP on the new computer system and tool will be more attractive to our data submitters and users worldwide.

11 in total

1. Genome Information Broker (GIB): data retrieval and comparative analysis system for completed microbial genomes and more.

Authors: Masaki Fumoto; Satoru Miyazaki; Hideaki Sugawara
Journal: Nucleic Acids Res Date: 2002-01-01 Impact factor: 16.971

2. Biological SOAP servers and web services provided by the public sequence data bank.

Authors: H Sugawara; S Miyazaki
Journal: Nucleic Acids Res Date: 2003-07-01 Impact factor: 16.971

Review 3. Genome trees and the tree of life.

Authors: Yuri I Wolf; Igor B Rogozin; Nick V Grishin; Eugene V Koonin
Journal: Trends Genet Date: 2002-09 Impact factor: 11.639

4. Exploration and grading of possible genes from 183 bacterial strains by a common protocol to identification of new genes: Gene Trek in Prokaryote Space (GTPS).

Authors: Takehide Kosuge; Takashi Abe; Toshihisa Okido; Naoto Tanaka; Masaki Hirahata; Yutaka Maruyama; Jun Mashima; Aki Tomiki; Motoyoshi Kurokawa; Ryutaro Himeno; Satoshi Fukuchi; Satoru Miyazaki; Takashi Gojobori; Yoshio Tateno; Hideaki Sugawara
Journal: DNA Res Date: 2006-12-13 Impact factor: 4.458

5. A tree of life based on protein domain organizations.

Authors: Kaoru Fukami-Kobayashi; Yoshiaki Minezaki; Yoshio Tateno; Ken Nishikawa
Journal: Mol Biol Evol Date: 2007-03-01 Impact factor: 16.240

6. The medaka draft genome and insights into vertebrate genome evolution.

Authors: Masahiro Kasahara; Kiyoshi Naruse; Shin Sasaki; Yoichiro Nakatani; Wei Qu; Budrul Ahsan; Tomoyuki Yamada; Yukinobu Nagayasu; Koichiro Doi; Yasuhiro Kasai; Tomoko Jindo; Daisuke Kobayashi; Atsuko Shimada; Atsushi Toyoda; Yoko Kuroki; Asao Fujiyama; Takashi Sasaki; Atsushi Shimizu; Shuichi Asakawa; Nobuyoshi Shimizu; Shin-Ichi Hashimoto; Jun Yang; Yongjun Lee; Kouji Matsushima; Sumio Sugano; Mitsuru Sakaizumi; Takanori Narita; Kazuko Ohishi; Shinobu Haga; Fumiko Ohta; Hisayo Nomoto; Keiko Nogata; Tomomi Morishita; Tomoko Endo; Tadasu Shin-I; Hiroyuki Takeda; Shinichi Morishita; Yuji Kohara
Journal: Nature Date: 2007-06-07 Impact factor: 49.962

7. DDBJ in preparation for overview of research activities behind data submissions.

Authors: Kousaku Okubo; Hideaki Sugawara; Takashi Gojobori; Yoshio Tateno
Journal: Nucleic Acids Res Date: 2006-01-01 Impact factor: 16.971

8. DDBJ working on evaluation and classification of bacterial genes in INSDC.

Authors: Hideaki Sugawara; Takashi Abe; Takashi Gojobori; Yoshio Tateno
Journal: Nucleic Acids Res Date: 2006-11-15 Impact factor: 16.971

9. The diploid genome sequence of an individual human.

Authors: Samuel Levy; Granger Sutton; Pauline C Ng; Lars Feuk; Aaron L Halpern; Brian P Walenz; Nelson Axelrod; Jiaqi Huang; Ewen F Kirkness; Gennady Denisov; Yuan Lin; Jeffrey R MacDonald; Andy Wing Chun Pang; Mary Shago; Timothy B Stockwell; Alexia Tsiamouri; Vineet Bafna; Vikas Bansal; Saul A Kravitz; Dana A Busam; Karen Y Beeson; Tina C McIntosh; Karin A Remington; Josep F Abril; John Gill; Jon Borman; Yu-Hui Rogers; Marvin E Frazier; Stephen W Scherer; Robert L Strausberg; J Craig Venter
Journal: PLoS Biol Date: 2007-09-04 Impact factor: 8.029

10. DDBJ in collaboration with mass-sequencing teams on annotation.

Authors: Y Tateno; N Saitou; K Okubo; H Sugawara; T Gojobori
Journal: Nucleic Acids Res Date: 2005-01-01 Impact factor: 16.971

23 in total

Review 1. Data deposition and annotation at the worldwide protein data bank.

Authors: Shuchismita Dutta; Kyle Burkhardt; Jasmine Young; Ganesh J Swaminathan; Takanori Matsuura; Kim Henrick; Haruki Nakamura; Helen M Berman
Journal: Mol Biotechnol Date: 2008-12-10 Impact factor: 2.695

2. Toward an online repository of Standard Operating Procedures (SOPs) for (meta)genomic annotation.

Authors: Samuel V Angiuoli; Aaron Gussman; William Klimke; Guy Cochrane; Dawn Field; George Garrity; Chinnappa D Kodira; Nikos Kyrpides; Ramana Madupu; Victor Markowitz; Tatiana Tatusova; Nick Thomson; Owen White
Journal: OMICS Date: 2008-06

3. TogoWS: integrated SOAP and REST APIs for interoperable bioinformatics Web services.

Authors: Toshiaki Katayama; Mitsuteru Nakao; Toshihisa Takagi
Journal: Nucleic Acids Res Date: 2010-05-14 Impact factor: 16.971

4. The DBCLS BioHackathon: standardization and interoperability for bioinformatics web services and workflows. The DBCLS BioHackathon Consortium*.

Authors: Toshiaki Katayama; Kazuharu Arakawa; Mitsuteru Nakao; Keiichiro Ono; Kiyoko F Aoki-Kinoshita; Yasunori Yamamoto; Atsuko Yamaguchi; Shuichi Kawashima; Hong-Woo Chun; Jan Aerts; Bruno Aranda; Lord Hendrix Barboza; Raoul Jp Bonnal; Richard Bruskiewich; Jan C Bryne; José M Fernández; Akira Funahashi; Paul Mk Gordon; Naohisa Goto; Andreas Groscurth; Alex Gutteridge; Richard Holland; Yoshinobu Kano; Edward A Kawas; Arnaud Kerhornou; Eri Kibukawa; Akira R Kinjo; Michael Kuhn; Hilmar Lapp; Heikki Lehvaslaiho; Hiroyuki Nakamura; Yasukazu Nakamura; Tatsuya Nishizawa; Chikashi Nobata; Tamotsu Noguchi; Thomas M Oinn; Shinobu Okamoto; Stuart Owen; Evangelos Pafilis; Matthew Pocock; Pjotr Prins; René Ranzinger; Florian Reisinger; Lukasz Salwinski; Mark Schreiber; Martin Senger; Yasumasa Shigemoto; Daron M Standley; Hideaki Sugawara; Toshiyuki Tashiro; Oswaldo Trelles; Rutger A Vos; Mark D Wilkinson; William York; Christian M Zmasek; Kiyoshi Asai; Toshihisa Takagi
Journal: J Biomed Semantics Date: 2010-08-21

5. Petabyte-scale innovations at the European Nucleotide Archive.

Authors: Guy Cochrane; Ruth Akhtar; James Bonfield; Lawrence Bower; Fehmi Demiralp; Nadeem Faruque; Richard Gibson; Gemma Hoad; Tim Hubbard; Christopher Hunter; Mikyung Jang; Szilveszter Juhos; Rasko Leinonen; Steven Leonard; Quan Lin; Rodrigo Lopez; Dariusz Lorenc; Hamish McWilliam; Gaurab Mukherjee; Sheila Plaister; Rajesh Radhakrishnan; Stephen Robinson; Siamak Sobhany; Petra Ten Hoopen; Robert Vaughan; Vadim Zalunin; Ewan Birney
Journal: Nucleic Acids Res Date: 2008-10-31 Impact factor: 16.971

6. IMGT, the international ImMunoGeneTics information system.

Authors: Marie-Paule Lefranc; Véronique Giudicelli; Chantal Ginestoux; Joumana Jabado-Michaloud; Géraldine Folch; Fatena Bellahcene; Yan Wu; Elodie Gemrot; Xavier Brochet; Jérôme Lane; Laetitia Regnier; François Ehrenmann; Gérard Lefranc; Patrice Duroux
Journal: Nucleic Acids Res Date: 2008-10-31 Impact factor: 16.971