Literature DB >> 15980559

TargetIdentifier: a webserver for identifying full-length cDNAs from EST sequences.

Xiang Jia Min¹, Gregory Butler, Reginald Storms, Adrian Tsang.

Abstract

TargetIdentifier is a webserver that identifies full-length cDNA sequences from the expressed sequence tag (EST)-derived contig and singleton data. To accomplish this TargetIdentifier uses BLASTX alignments as a guide to locate protein coding regions and potential start and stop codons. This information is then used to determine whether the EST-derived sequences include their translation start codons. The algorithm also uses the BLASTX output to assign putative functions to the query sequences. The server is available at https://fungalgenome.concordia.ca/tools/TargetIdentifier.html.

Entities: Chemical Disease Gene Species

Mesh：

Substances：
Codon
DNA, Complementary

Year: 2005 PMID： 15980559 PMCID： PMC1160197 DOI： 10.1093/nar/gki436

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

The generation of expressed sequence tags (ESTs) is a widely recognized gene discovery strategy. Reflecting this there were 25 556 476 EST entries deposited in GenBank as of dbEST release 020405 (). Furthermore, The Institute for Genomic Research (TIGR) has initiated the assembly and annotation of virtual transcripts (also called tentative consensus sequences) for 73 species. This TIGR effort relies heavily upon access to the GenBank dbEST database (). Two additional efforts are databases of full length cDNAs for mouse (1) and Arabidopsis (2). EST databases are an important resource for identifying cDNAs that contain complete protein coding regions for studies of gene function. Several computational tools, compared recently by Nadershahi et al. (3), including NetStart using neural networks (4), ESTScan using a hidden Markov model (5) and ATGpr using a linear discriminant approach (6), have been developed to identify translation initiation sites and/or coding regions in cDNA-derived sequences. These programs can predict the coding regions of cDNAs for which no known orthologues are available. However, since these programs are trained using organism-specific annotated sequences, they have limited value for organisms lacking annotated sequence data. In an attempt to address this issue ATGpr_sim (7), an updated version of ATGpr, was developed. In addition to relying on annotated data for training, ATGpr_sim also uses similarity information from BLASTX (8). The ATGpr_sim server only processes one sequence per submission, hence it cannot be used to process the large number of sequences produced by EST projects. We developed TargetIdentifier a webserver that automates the identification of full-length cDNAs within a large number of EST-derived sequences. The TargetIdentifier algorithm uses BLASTX alignments as a guide to identify full-length cDNAs and provide provisional functional assignments (9,10). Hence, TargetIdentifier does not require ‘training’ with previously annotated sequences and is useful in the analysis of sequences encoding proteins for which information of their orthologues is available. We also demonstrated that TargetIdentifier effectively identified start codons and protein coding regions in our own Aspergillus niger EST-derived data and human UniGene data from NCBI ().

OVERVIEW OF THE ALGORITHM AND IMPLEMENTATION

Although some polycistronic genes are found in protozoa (11), plants (12) and animals (13), almost all eukaryotic mRNAs are monocistronic. Hence a typical eukaryotic mRNA contains a 5′-untranslated terminal region (5′-UTR), a protein-coding region that begins with a translation start codon (ATG) and ends at a translation stop codon (TAA, TAG or TGA) (14) and a 3′-UTR (Figure 1A).

Figure 1

Categories of algorithm-predicted cDNA clones. (A) A full-length sequence that includes one or more stop codons in the predicted 5′-UTR, a completely sequenced protein coding region and a 3′-UTR. (B) A sequence similar to those described in (A) except that the 3′ end of the ORF region is not sequenced. (C) A sequence having a start codon but lacking a stop codon in the 5′-UTR, whether it contains a potential translation start codon or not is determined by comparing the BLASTX alignment between its predicted protein and the subject. (D) A sequence having a stop codon in the 5′-UTR but lacking an in-frame start codon. This is an ambiguous sequence. (E) A sequence that includes a coding region but neither a stop codon nor a start codon in the sequenced portion. The length of the low quality sequence removed by Lucy (15) is taken into consideration when predicting whether or not it was a ‘possible full-length’ sequence. Asterisk: stop codon upstream of the start codon (5′ end stop codon); solid circle: predicted translation initiation codon; solid triangle: a stop codon downstream from the start codon (3′ end stop codon); question mark: indicates checking if a 3′ stop codon exists; (X): the first amino acid in the alignment of the HSP in BLASTX; (M): methionine; (d1) the length of predicted peptide from a predicted start codon to X; (d2) the length of M to X in the subject sequence of the HSP in BLASTX; (d3) length of EST sequence trimmed by Lucy, can include a portion of a vector, an adaptor and a low quality region of a cDNA sequence; thick solid line: sequences retained after processing by Lucy; thin solid line: the low quality sequence removed from the 5′ end by Lucy; dashed line: amino acid sequence of the subject in BLASTX.

Since cDNA clones constructed using oligo-dT primers for first-strand synthesis are expected to have intact 3′ regions, clones that contain the translation initiation codon should have intact coding regions. TargetIdentifier therefore predicts whether the entire coding region is included in a cDNA clone by determining whether derived singleton and/or contig sequences include translation start codons. To accomplish this, the TargetIdentifier algorithm classifies EST-derived sequences as full-length, short full-length, possible full-length, ambiguous, partial or 3′-sequenced partial based on the decision tree presented in Figure 2 and the following definitions.

Figure 2

A decision tree for EST-derived sequence classification. The definitions of each category of EST-derived sequences are described in detail in the text. Start codon: ATG; 5′ stop codon: stop codon (TAA, TAG, or TGA) in the 5′-UTR; d1: the predicted length of the peptide that extends from the start codon encoded methionine to the first amino acid of the query in the HSP alignment in the output of BLASTX; d2: the subject's beginning position in the HSP alignment in the output of BLASTX; d3: the estimated length of the low quality sequence removed by Lucy (15).

Full-length. A sequence is considered to include the translation start codon when it satisfies one of the following two criteria. (1) The sequence has a 5′ stop codon followed by a start codon (Figure 1A and B). (2) The sequence does not have a 5′ stop codon but has an in-frame start codon encoding a methionine that aligns to the BLASTX subject prior to the 10th amino acid (Figure 1C). Short full-length. The sequence has an in-frame start codon encoding a methionine that aligns to a position between the 10th and the 100th amino acid of the subject sequence (Figure 1C). The program determines the location of the potential start codon relative to the start codon for the BLASTX subject sequence. An upper limit of 100 is selected, because BLAST alignments of closely related cellulases and aldehyde oxidases revealed that the length of the amino terminal region extending from the aligned core sequences rarely varies by >100 amino acids. Possible full-length. If sequence quality at the 5′ end of an EST sequence is poor, the DNA sequence removed by the quality control program may have included the start codon. The corresponding cDNA clone is therefore categorized as ‘possible full-length’ if the low quality sequence removed is long enough to include the missing amino terminal portion of the translated query. Ambiguous. The sequence has a 5′ stop codon but does not have a start codon (Figure 1D). This type of anomaly probably arises because of sequencing errors. This can occur in EST-derived sequences as they can often include sequence information derived from a single sequencing read. Partial. A sequence that is not assigned to one of the above categories (Figure 1E). 3′-sequenced partial. TargetIdentifier initially processes the sequence data assuming they were obtained by sequencing from the 5′ end of the cDNA inserts. In the BLASTX report, these sequences should align with the subject sequences in a positive reading frame. Query sequences are therefore classified as ‘3′-sequenced partial’ when they align to the subject sequence in a negative reading frame (−1, −2 or −3) and are not categorized as full-length, short full-length or ambiguous.

Input

A data file containing a set of ESTs or sequences assembled from ESTs in FASTA format. A pre-run BLASTX output for each sequence contained in the input sequence file described in 1. This can be produced by searching against a database, such as the NCBI non-redundant protein database, Swiss-Prot database or a user generated protein database. A cutoff E-value can be chosen at the time of running BLASTX. For users without access to the NCBI-blastall package for processing a batch of sequences, our server provides BLASTX searches against the UniProt/Swiss-Prot database with a limit of 1000 sequences per submission. If >1000 sequences are submitted, only the first 1000 sequences will be processed. Two optional input files that can be included are an ace file generated by an assembler, such as Phrap () and a file generated by a quality trimming program, such as Lucy (15). The ace file provides assembly information regarding the individual ESTs in a contig, and the quality file contains EST identifiers, EST length and the length of any low quality sequence removed from the 5′ end of each EST sequence in tab-delimited format. A cutoff E-value that is set by the user to define what is a valid hit in BLASTX. If the user defined E-value is larger than the E-value used for the pre-run BLASTX output, the actual cutoff value is the value in the BLASTX output. Options for users to choose either downloading the results or receiving the output via email.

Output

The TargetIdentifier output is tab-delimited and can be opened as a spreadsheet with Microsoft Excel. The output file includes: a summary of the results obtained for the whole set of EST or EST-derived sequences and a detailed report for each sequence predicted to fall within the various categories. The detailed report includes the following fields: (i) the name of the subject protein in the high score pair (HSP) of the BLASTX alignment; (ii) a query identifier; (iii) the HSP E-value; (iv) a prediction of whether the EST or EST-derived query sequence is full-length, short full-length, possible-full length, ambiguous, partial or 3′-sequenced partial; (v) start codon position; (vi) the strand and the sequence status of the query sequences regarding whether or not the protein coding region has been completely sequenced and (vii) HSP heading information taken from the BLASTX output that includes the subject definition line, length, score, E-value, identities, positives and reading frame. To sort genes by gene name, the algorithm removes the terms ‘probable’, ‘putative’, ‘possible’ and ‘similar to’ from the subject definition.

ACCURACY EVALUATION

To evaluate TargetIdentifier, we used the human UniGene set and our own EST-derived A.niger unigene set of contigs and singletons. The human UniGene set (Build #160, Homo sapiens, February 16, 2003) was searched using BLASTX against the full-length human protein sequences (total 8956) downloaded from the Swiss-Prot database. TargetIdentifier predicted that there were 7210 full-length, 66 short full-length, 376 (5′) partial, 400 3′-sequenced partial and 81 ambiguous sequences in the human UniGene set. We used a random number generator () to select a total of 270 human UniGene sequences and compared the TargetIdentifier output with manually obtained results. This comparison showed that TargetIdentifier correctly sorted 93% of the sequences into the full-length, short full-length, possible full-length, ambiguous and partial categories. We also assessed the TargetIdentifier predictions using our EST-derived A.niger assembly set. To assemble this dataset the EST sequence chromatograms were traced by Phred (16), vector and low quality regions were removed by Lucy (15) and the ESTs were assembled by Phrap (). The accuracy of TargetIdentifier was assessed using 98 EST assemblies that encode predicted protein sequences sharing >90% identity with an A.niger protein entry at GenBank. This revealed that of the 55 sequences classified as full-length by TargetIdentifier, 54 were correctly predicted (98%). The human Unigene sequences, the 98 A.niger EST-assemblies and the TargetIdentifier prediction data are available at .

SUMMARY

TargetIdentifier is a webserver that uses BLASTX alignments to identify full-length cDNAs from an EST-derived dataset. We have evaluated the prediction accuracy with the human UniGene set and our own set of assembled A.niger ESTs, and found that it is >90% accurate. TargetIdentifier can therefore be used to search EST-derived datasets for sequences encoding specific functionalities and predict whether or not a cDNAclone harboring the complete coding region has been identified.

16 in total

1. Prediction whether a human cDNA sequence contains initiation codon by combining statistical information and similarity with protein sequences.

Authors: T Nishikawa; T Ota; T Isogai
Journal: Bioinformatics Date: 2000-11 Impact factor: 6.937

2. ESTScan: a program for detecting, evaluating, and reconstructing potential coding regions in EST sequences.

Authors: C Iseli; C V Jongeneel; P Bucher
Journal: Proc Int Conf Intell Syst Mol Biol Date: 1999

3. An optimized protocol for analysis of EST sequences.

Authors: F Liang; I Holt; G Pertea; S Karamycheva; S L Salzberg; J Quackenbush
Journal: Nucleic Acids Res Date: 2000-09-15 Impact factor: 16.971

4. DNA sequence quality trimming and vector removal.

Authors: H H Chou; M H Holmes
Journal: Bioinformatics Date: 2001-12 Impact factor: 6.937

5. Assessing protein coding region integrity in cDNA sequencing projects.

Authors: A A Salamov; T Nishikawa; M B Swindells
Journal: Bioinformatics Date: 1998-06 Impact factor: 6.937

Review 6. Gene clusters and polycistronic transcription in eukaryotes.

Authors: T Blumenthal
Journal: Bioessays Date: 1998-06 Impact factor: 4.345

7. Base-calling of automated sequencer traces using phred. II. Error probabilities.

Authors: B Ewing; P Green
Journal: Genome Res Date: 1998-03 Impact factor: 9.043

8. Neural network prediction of translation initiation sites in eukaryotes: perspectives for EST and genome analysis.

Authors: A G Pedersen; H Nielsen
Journal: Proc Int Conf Intell Syst Mol Biol Date: 1997

9. Clusters of multiple different small nucleolar RNA genes in plants are expressed as and processed from polycistronic pre-snoRNAs.

Authors: D J Leader; G P Clark; J Watters; A F Beven; P J Shaw; J W Brown
Journal: EMBO J Date: 1997-09-15 Impact factor: 11.598

10. Identification of protein coding regions by database similarity search.

Authors: W Gish; D J States
Journal: Nat Genet Date: 1993-03 Impact factor: 38.330

22 in total

1. Identification and analysis of muscle-related protein isoforms expressed in the white muscle of the mandarin fish (Siniperca chuatsi).

Authors: Guoqiang Zhang; Wuying Chu; Songnian Hu; Tao Meng; Linlin Pan; Renxue Zhou; Zhen Liu; Jianshe Zhang
Journal: Mar Biotechnol (NY) Date: 2010-03-31 Impact factor: 3.619

2. Analysis of tarantula skeletal muscle protein sequences and identification of transcriptional isoforms.

Authors: Jingui Zhu; Yongqiao Sun; Fa-Qing Zhao; Jun Yu; Roger Craig; Songnian Hu
Journal: BMC Genomics Date: 2009-03-19 Impact factor: 3.969

3. Identification and characterization of full-length cDNAs in channel catfish (Ictalurus punctatus) and blue catfish (Ictalurus furcatus).

Authors: Fei Chen; Yoona Lee; Yanliang Jiang; Shaolin Wang; Eric Peatman; Jason Abernathy; Hong Liu; Shikai Liu; Huseyin Kucuktas; Caihuan Ke; Zhanjiang Liu
Journal: PLoS One Date: 2010-07-12 Impact factor: 3.240

4. Characterization of common carp transcriptome: sequencing, de novo assembly, annotation and comparative genomics.

Authors: Peifeng Ji; Guiming Liu; Jian Xu; Xumin Wang; Jiongtang Li; Zixia Zhao; Xiaofeng Zhang; Yan Zhang; Peng Xu; Xiaowen Sun
Journal: PLoS One Date: 2012-04-13 Impact factor: 3.240

5. Generation, annotation, and analysis of an extensive Aspergillus niger EST collection.

Authors: Natalia Semova; Reginald Storms; Tricia John; Pascale Gaudet; Peter Ulycznyj; Xiang Jia Min; Jian Sun; Greg Butler; Adrian Tsang
Journal: BMC Microbiol Date: 2006-02-02 Impact factor: 3.605

6. Assembly of 500,000 inter-specific catfish expressed sequence tags and large scale gene-associated marker development for whole genome association studies.

Authors: Shaolin Wang; Eric Peatman; Jason Abernathy; Geoff Waldbieser; Erika Lindquist; Paul Richardson; Susan Lucas; Mei Wang; Ping Li; Jyothi Thimmapuram; Lei Liu; Deepika Vullaganti; Huseyin Kucuktas; Christopher Murdock; Brian C Small; Melanie Wilson; Hong Liu; Yanliang Jiang; Yoona Lee; Fei Chen; Jianguo Lu; Wenqi Wang; Peng Xu; Benjaporn Somridhivej; Puttharat Baoprasertkul; Jonas Quilang; Zhenxia Sha; Baolong Bao; Yaping Wang; Qun Wang; Tomokazu Takano; Samiran Nandi; Shikai Liu; Lilian Wong; Ludmilla Kaltenboeck; Sylvie Quiniou; Eva Bengten; Norman Miller; John Trant; Daniel Rokhsar; Zhanjiang Liu
Journal: Genome Biol Date: 2010-01-22 Impact factor: 13.583

7. Transcriptome sequencing and analysis of wild Amur Ide (Leuciscus waleckii) inhabiting an extreme alkaline-saline lake reveals insights into stress adaptation.

Authors: Jian Xu; Peifeng Ji; Baosen Wang; Lan Zhao; Jian Wang; Zixia Zhao; Yan Zhang; Jiongtang Li; Peng Xu; Xiaowen Sun
Journal: PLoS One Date: 2013-04-01 Impact factor: 3.240

8. Pepper EST database: comprehensive in silico tool for analyzing the chili pepper (Capsicum annuum) transcriptome.

Authors: Hyun-Jin Kim; Kwang-Hyun Baek; Seung-Won Lee; JungEun Kim; Bong-Woo Lee; Hye-Sun Cho; Woo Taek Kim; Doil Choi; Cheol-Goo Hur
Journal: BMC Plant Biol Date: 2008-10-09 Impact factor: 4.215

9. A salmonid EST genomic study: genes, duplications, phylogeny and microarrays.

Authors: Ben F Koop; Kristian R von Schalburg; Jong Leong; Neil Walker; Ryan Lieph; Glenn A Cooper; Adrienne Robb; Marianne Beetz-Sargent; Robert A Holt; Richard Moore; Sonal Brahmbhatt; Jamie Rosner; Caird E Rexroad; Colin R McGowan; William S Davidson
Journal: BMC Genomics Date: 2008-11-17 Impact factor: 3.969

10. Gene discovery and transcript analyses in the corn smut pathogen Ustilago maydis: expressed sequence tag and genome sequence comparison.

Authors: Eric C H Ho; Matt J Cahill; Barry J Saville
Journal: BMC Genomics Date: 2007-09-24 Impact factor: 3.969