Literature DB >> 15608213

The Diatom EST Database.

Uma Maheswari¹, Anton Montsant, Johannes Goll, S Krishnasamy, K R Rajyashri, Villoo Morawala Patell, Chris Bowler.

Abstract

The Diatom EST database provides integrated access to expressed sequence tag (EST) data from two eukaryotic microalgae of the class Bacillariophyceae, Phaeodactylum tricornutum and Thalassiosira pseudonana. The database currently contains sequences of close to 30,000 ESTs organized into PtDB, the P.tricornutum EST database, and TpDB, the T.pseudonana EST database. The EST sequences were clustered and assembled into a non-redundant set for each organism, and these non-redundant sequences were then subjected to automated annotation using similarity searches against protein and domain databases. EST sequences, clusters of contiguous sequences, their annotation and analysis with reference to the publicly available databases, and a codon usage table derived from a subset of sequences from PtDB and TpDB can all be accessed in the Diatom EST Database. The underlying RDBMS enables queries over the raw and annotated EST data and retrieval of information through a user-friendly web interface, with options to perform keyword and BLAST searches. The EST data can also be retrieved based on Pfam domains, Cluster of Orthologous Groups (COG) and Gene Ontologies (GO) assigned to them by similarity searches. The Database is available at http://avesthagen.sznbowler.com.

Entities: Disease Species

Mesh：

Substances：
DNA, Algal

Year: 2005 PMID： 15608213 PMCID： PMC540075 DOI： 10.1093/nar/gki121

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Diatoms (Bacillariophyceae) are brown algae with a wide distribution and abundance in the world's water bodies, and are thought to be responsible for around one-fifth of global primary productivity. Being such important players in the global ecosystem, their ecology and physiology have been the focus of research for decades. More recently, the intricate siliceous bioarchitecture of diatom cell walls has attracted the interest of nanotechnologists. Understanding the information within diatom genomes is therefore likely to lead to dissection of the molecular mechanisms controlling bioinorganic pattern formation in these organisms and is fundamental for understanding their ecological success (1,2). As part of a general effort to study diatom biology at a molecular level, large-scale sequencing projects are being undertaken (2,3) (http://genomic.jpi-psf.org/thaps1.home.html). This rapidly growing body of sequence information requires accurate gene annotation as well as dedicated platforms for storage, processing and curation, and must be available for immediate data retrieval at any time.

CONSTRUCTION OF THE DATABASE

Raw data and core analyses

PtDB contains expressed sequence tags (ESTs) derived from Phaeodactylum tricornutum Bohlin clone CCMP632 (Provasoli-Guillard National Center for Culture of Marine Phytoplankton, Bigelow, ME). The RNA used for cDNA generation was isolated from exponentially growing cells (2). The cDNA library was created in a Uni-Zap XR vector (Stratagene) using oligo dT primers and directionally inserted into EcoRI–XhoI sites of pBluescript. 5′ end sequences (12 136) were generated using the T3 primer. PTSS0001–PTSS0997 have been described previously (2); PTAM00001–PTAM01131 were generated by MWG Biotech (Ebersberg, Germany) and PTMM00001–PTMM10008 were obtained from Avesthagen (Bangalore, India). TpDB contains ESTs derived from Thalassiosira pseudonana clone CCMP1335 (Provasoli-Guillard National Center for Culture of Marine Phytoplankton), from an exponentially growing culture in ASW medium. The cDNA library was created in the pZERO-2 vector (Invitrogen) using oligo dT primers and was not directionally inserted. A total of 6500 clones were sequenced from both ends and were denoted with an .x or .y extension in the clone ID based on the direction of sequencing. In some cases poor-quality runs were repeated, giving rise to .x2 and .x3 extensions etc. until 15 174 sequences were obtained. Prior to annotation, the sequences were subjected to quality checking and vector clipping using the Trimest, Trimseq and Vectorstrip programs of EMBOSS (European Molecular Biology Open Software Suite). The vector data were provided interactively to Vectorstrip and all sequences with a maximum mismatch level of 10% were detected and removed. As the T.pseudonana ESTs were generated from both ends, assembling was done using the consensus sequence rather than the individual ESTs when overlap was detected, which occurred for 1056 pairs of ESTs. Such complete cDNA sequences are labelled with the same ID as the individual ESTs, but without any extension. All sequences longer than 50 nt were then subjected to clustering using the Contig Assembling Program (CAP3) (4) to detect sequence redundancy. Sequences with >95% identity over a region longer than 30 nt were clustered, yielding 1243 contig assemblies for P.tricornutum and 832 contigs for T.pseudonana. Contigs were given a unique contig ID consisting of a prefix C and a four-digit number, assigned in descending order of number of ESTs in each contig. This helps to organize the ESTs based on the level of redundancy. The longest sequence from each assembly was then selected and pooled with the singletons (i.e. ESTs that did not fall into any cluster) to form the non-redundant set. PtDB contains 5108 non-redundant sequences and TpDB contains 5444 (Table 1). These sequences were then subjected to automated annotation, which comprised searches against the NCBI (5) non-redundant protein database using BLASTX and against protein domain databases, CDD (6) and COG databases using RPS-BLAST (Figure 1). The results of all similarity searches were parsed and stored in MySQL tables.

Table 1.

Number of sequences in PtDB and TpDB

Sequence type	Number in PtDB	Number in TpDB
Raw ESTs	12 136	15 174
Contigs	1243	832
Singletons	3865	4612
Non-redundant set	5108	5444

Figure 1

EST analysis overview.

Expressed sequence tag analysis—full-length clones and function assignments

In order to identify putative full-length sequences for function assignment, the EST sequences were grouped into nine alignment classes (Figure 2) based on the subject coverage (CovS) and identities of the BLASTX results. The subject coverage was calculated as follows (7):

Figure 2

Classification of non-redundant sequences based on BLASTX alignment coverage and identity percentages.

where Hlen is defined as length of the HSPs (high-scoring segment pairs) and Slen is defined as subject sequence length. CovS is an indicator of the extent to which the query sequence matches the target protein sequence. For sequences falling in F-1, F-2, M-1 and M-2 alignment classes (see Figure 2), the protein coding frame and the putative function were assigned based on the BLASTX description. For this subset of ESTs, in the six-frame translation output, the start and stop codons and the assigned frame are highlighted so that the user can detect complete open reading frames easily. The F-1 and F-2 categories were considered to comprise putative full-length clones. These sequences were aligned using CLUSTAL W (8) with their corresponding 10 most similar GenBank protein database entries obtained from the BLASTX results. The alignment output is linked to the web interface for quick reference. We used RPS (Reverse PSI) BLAST (6) to identify the COG (Cluster of Orthologous Groups) (9) to which each sequence in the non-redundant collection could be assigned. This allows the sequences to be classified into one of the groups shown in Figure 3.

Figure 3

Classification of the non-redundant set into COG functional categories.

Furthermore, to support the functional assignments and classifications, a motif search was made among all non-redundant sequences using RPS BLAST against the Pfam database (10). Motifs were assigned to ESTs in which a Pfam domain was detected with an E-value < 0.05. The corresponding Gene Ontology (GO) description (11) was also assigned to the non-redundant sequences based on the dbxref table in the GO database (MySQL format). Codon usage tables were created for each organism using the subsets of non-redundant ESTs falling in the alignment classes F-1, F-2, M-1 and M-2 (which amount to 859 sequences for P.tricornutum and 465 for T.pseudonana). The coordinates delimiting the coding region of these ESTs were obtained from the BLASTX output and the codon usage table was created using the Cusp program of EMBOSS.

Database architecture

The Diatom EST database is based on Linux Red Hat 9.0 and was developed with MySQL 4.0 as a backend with a web interface using PHP4. Bioperl and Perl Scripts were used to parse and fill the data into the database (Figure 1).

SEARCHING THE DATABASE

The database can be accessed through a web interface, and querying can be done using the View and Search options. The View option facilitates listing of the raw ESTs, contigs, singletons and non-redundant sequences. The ESTs are also listed based on their COG and GO assignments. Search options include simple searches by organism name, keyword, accession number or sequence ID. BLAST, BLASTN, TBLASTN and TBLASTX searches can also be performed against the Diatom EST Databases (PtDB, TpDB or both). An Advanced Search option provides additional possibilities such as the use of boolean terms (AND, OR and NOT) to search for a keyword/organism pair, defined alignment class, subject coverage (CovS), percentage identity and E-value. The search output contains information about the EST and its contig and functional annotations, sorted by E-value and sequence ID.

FUTURE DIRECTIONS

In the future, we hope that the Diatom EST Database will incorporate data from additional species as EST and genome-sequencing projects for diatoms (and other algae) are performed. Orthology will be assigned according to eukaryotic orthologous groups (KOGs) as soon as they are made available at the NCBI. Apart from the sequence and related data within the currently available database structure, gene expression data, including from microarray studies, could also be included. The database could also be integrated with a genome browser, where available, and enhanced functional annotation could be mined from other cluster and pathway databases. The server will be periodically upgraded for faster access to the growing body of data.

AVAILABILITY

The Diatom EST Database is freely available on the web at http://avesthagen.sznbowler.com. The P.tricornutum ESTs have been submitted to the NCBI dbEST (GenBank accession numbers CD374840–CD384835 and BI306757–BI307753). Requests for bulk queries and to house EST data from other diatoms should be addressed to C. Bowler.

11 in total

1. CAP3: A DNA sequence assembly program.

Authors: X Huang; A Madan
Journal: Genome Res Date: 1999-09 Impact factor: 9.043

2. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium.

Authors: M Ashburner; C A Ball; J A Blake; D Botstein; H Butler; J M Cherry; A P Davis; K Dolinski; S S Dwight; J T Eppig; M A Harris; D P Hill; L Issel-Tarver; A Kasarskis; S Lewis; J C Matese; J E Richardson; M Ringwald; G M Rubin; G Sherlock
Journal: Nat Genet Date: 2000-05 Impact factor: 38.330

3. CDD: a curated Entrez database of conserved domain alignments.

Authors: Aron Marchler-Bauer; John B Anderson; Carol DeWeese-Scott; Natalie D Fedorova; Lewis Y Geer; Siqian He; David I Hurwitz; John D Jackson; Aviva R Jacobs; Christopher J Lanczycki; Cynthia A Liebert; Chunlei Liu; Thomas Madej; Gabriele H Marchler; Raja Mazumder; Anastasia N Nikolskaya; Anna R Panchenko; Bachoti S Rao; Benjamin A Shoemaker; Vahan Simonyan; James S Song; Paul A Thiessen; Sona Vasudevan; Yanli Wang; Roxanne A Yamashita; Jodie J Yin; Stephen H Bryant
Journal: Nucleic Acids Res Date: 2003-01-01 Impact factor: 16.971

Review 4. Revealing the molecular secrets of marine diatoms.

Authors: Angela Falciatore; Chris Bowler
Journal: Annu Rev Plant Biol Date: 2002 Impact factor: 26.379

5. The genome of the diatom Thalassiosira pseudonana: ecology, evolution, and metabolism.

Authors: E Virginia Armbrust; John A Berges; Chris Bowler; Beverley R Green; Diego Martinez; Nicholas H Putnam; Shiguo Zhou; Andrew E Allen; Kirk E Apt; Michael Bechner; Mark A Brzezinski; Balbir K Chaal; Anthony Chiovitti; Aubrey K Davis; Mark S Demarest; J Chris Detter; Tijana Glavina; David Goodstein; Masood Z Hadi; Uffe Hellsten; Mark Hildebrand; Bethany D Jenkins; Jerzy Jurka; Vladimir V Kapitonov; Nils Kröger; Winnie W Y Lau; Todd W Lane; Frank W Larimer; J Casey Lippmeier; Susan Lucas; Mónica Medina; Anton Montsant; Miroslav Obornik; Micaela Schnitzler Parker; Brian Palenik; Gregory J Pazour; Paul M Richardson; Tatiana A Rynearson; Mak A Saito; David C Schwartz; Kimberlee Thamatrakoln; Klaus Valentin; Assaf Vardi; Frances P Wilkerson; Daniel S Rokhsar
Journal: Science Date: 2004-10-01 Impact factor: 47.728

6. Basic local alignment search tool.

Authors: S F Altschul; W Gish; W Miller; E W Myers; D J Lipman
Journal: J Mol Biol Date: 1990-10-05 Impact factor: 5.469

7. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice.

Authors: J D Thompson; D G Higgins; T J Gibson
Journal: Nucleic Acids Res Date: 1994-11-11 Impact factor: 16.971

8. The COG database: new developments in phylogenetic classification of proteins from complete genomes.

Authors: R L Tatusov; D A Natale; I V Garkavtsev; T A Tatusova; U T Shankavaram; B S Rao; B Kiryutin; M Y Galperin; N D Fedorova; E V Koonin
Journal: Nucleic Acids Res Date: 2001-01-01 Impact factor: 16.971

9. Genome properties of the diatom Phaeodactylum tricornutum.

Authors: Simona Scala; Nicolas Carels; Angela Falciatore; Maria Luisa Chiusano; Chris Bowler
Journal: Plant Physiol Date: 2002-07 Impact factor: 8.340

10. The Pfam protein families database.

Authors: Alex Bateman; Lachlan Coin; Richard Durbin; Robert D Finn; Volker Hollich; Sam Griffiths-Jones; Ajay Khanna; Mhairi Marshall; Simon Moxon; Erik L L Sonnhammer; David J Studholme; Corin Yeats; Sean R Eddy
Journal: Nucleic Acids Res Date: 2004-01-01 Impact factor: 16.971

27 in total

1. Analysis of expressed sequence tags from the harmful alga, Prymnesium parvum (Prymnesiophyceae, Haptophyta).

Authors: John W La Claire
Journal: Mar Biotechnol (NY) Date: 2006-07-28 Impact factor: 3.619

Review 2. Agrigenomics for microalgal biofuel production: an overview of various bioinformatics resources and recent studies to link OMICS to bioenergy and bioeconomy.

Authors: Namrata Misra; Prasanna Kumar Panda; Bikram Kumar Parida
Journal: OMICS Date: 2013-09-17

3. NADPH oxidases in Eukaryotes: red algae provide new hints!

Authors: Cécile Hervé; Thierry Tonon; Jonas Collén; Erwan Corre; Catherine Boyen
Journal: Curr Genet Date: 2005-12-13 Impact factor: 3.886

4. The peculiar distribution of class I and class II aldolases in diatoms and in red algae.

Authors: Peter G Kroth; Yvonne Schroers; Oliver Kilian
Journal: Curr Genet Date: 2005-11-05 Impact factor: 3.886

Review 5. Carotenoid biosynthesis in diatoms.

Authors: Martine Bertrand
Journal: Photosynth Res Date: 2010-08-24 Impact factor: 3.573

6. Comparative genomics of the pennate diatom Phaeodactylum tricornutum.

Authors: Anton Montsant; Kamel Jabbari; Uma Maheswari; Chris Bowler
Journal: Plant Physiol Date: 2005-01-21 Impact factor: 8.340

7. Membrane glycerolipid remodeling triggered by nitrogen and phosphorus starvation in Phaeodactylum tricornutum.

Authors: Heni Abida; Lina-Juana Dolch; Coline Meï; Valeria Villanova; Melissa Conte; Maryse A Block; Giovanni Finazzi; Olivier Bastien; Leïla Tirichine; Chris Bowler; Fabrice Rébeillé; Dimitris Petroutsos; Juliette Jouhet; Eric Maréchal
Journal: Plant Physiol Date: 2014-12-08 Impact factor: 8.340

8. Digital expression profiling of novel diatom transcripts provides insight into their biological functions.

Authors: Uma Maheswari; Kamel Jabbari; Jean-Louis Petit; Betina M Porcel; Andrew E Allen; Jean-Paul Cadoret; Alessandra De Martino; Marc Heijde; Raymond Kaas; Julie La Roche; Pascal J Lopez; Véronique Martin-Jézéquel; Agnès Meichenin; Thomas Mock; Micaela Schnitzler Parker; Assaf Vardi; E Virginia Armbrust; Jean Weissenbach; Michaël Katinka; Chris Bowler
Journal: Genome Biol Date: 2010-08-25 Impact factor: 13.583

9. Dynamic response of the transcriptome of a psychrophilic diatom, Chaetoceros neogracile, to high irradiance.

Authors: Seunghye Park; Gyeongseo Jung; Yong-sic Hwang; EonSeon Jin
Journal: Planta Date: 2009-11-19 Impact factor: 4.116

10. Whole-cell response of the pennate diatom Phaeodactylum tricornutum to iron starvation.

Authors: Andrew E Allen; Julie Laroche; Uma Maheswari; Markus Lommer; Nicolas Schauer; Pascal J Lopez; Giovanni Finazzi; Alisdair R Fernie; Chris Bowler
Journal: Proc Natl Acad Sci U S A Date: 2008-07-24 Impact factor: 11.205