Literature DB >> 22419843

ESMP: A high-throughput computational pipeline for mining SSR markers from ESTs.

Ranjan Sarmah¹, Jagajjit Sahu, Budheswar Dehury, Kishore Sarma, Smita Sahoo, Mousumi Sahu, Madhumita Barooah, Priyabrata Sen, Mahendra Kumar Modi.

Abstract

UNLABELLED: With the advent of high-throughput sequencing technology, sequences from many genomes are being deposited to public databases at a brisk rate. Open access to large amount of expressed sequence tag (EST) data in the public databases has provided a powerful platform for simple sequence repeat (SSR) development in species where sequence information is not available. SSRs are markers of choice for their high reproducibility, abundant polymorphism and high inter-specific transferability. The mining of SSRs from ESTs requires different high-throughput computational tools that need to be executed individually which are computationally intensive and time consuming. To reduce the time lag and to streamline the cumbersome process of SSR mining from ESTs, we have developed a user-friendly, web-based EST-SSR pipeline "EST-SSR-MARKER PIPELINE (ESMP)". This pipeline integrates EST pre-processing, clustering, assembly and subsequently mining of SSRs from assembled EST sequences. The mining of SSRs from ESTs provides valuable information on the abundance of SSRs in ESTs and will facilitate the development of markers for genetic analysis and related applications such as marker-assisted breeding. AVAILABILITY: The database is available for free at http://bioinfo.aau.ac.in/ESMP.

Entities: Gene

Keywords: ESMP; Expressed Sequence Tag; Simple Sequence Repeats; Single Nucleotide Polymorphism

Year: 2012 PMID： 22419843 PMCID： PMC3302004 DOI： 10.6026/97320630008206

Source DB: PubMed Journal: Bioinformation ISSN： 0973-2063

Background

Expressed sequence tags (ESTs) represents short, unedited and randomly selected single-pass reads derived from cDNA libraries provides an alternative to whole genome sequencing of organisms. The analysis of EST data enable gene discovery, complete genome annotation, gene structure identification, establish the viability of alternative transcripts, guide single nucleotide polymorphism (SNPs) characterization and facilitate in proteomic exploration [1]. The ubiquity of microsatellite or simple sequence repeats (SSRs) in eukaryotic genomes and their usefulness as genetic markers has been well established over the last decade. SSRs are short (1-6 bp) repeat DNA motifs that are usually single locus markers with characteristics of hypervariability, abundance, reproducibility and ease of detection by polymerase chain reaction with unique primer pairs that flank the repeat motif [2]. The availability of ESTs greatly accelerates the systematic identification of SSRs and corresponding marker development based on computer analytical approaches [3]. EST-SSR and genomic SSR markers are considered as complementary to plant genome mapping, with EST-SSR being less polymorphic but concentrated in the gene-rich regions [4]. Several EST assembly and annotation pipelines viz. EST analysis pipelines (ESTAP) [5], EST pipeline system [6], ParPEST [7] etc. are available with their own objectives, provides cleansing EST sequences and annotating them using public databases. The mining of SSRs from ESTs requires different high-throughput computational tools that need to be executed individually which are computationally intensive and time consuming. To reduce the time lag and to streamline the cumbersome process of SSR mining from ESTs, we have developed EST-SSR Marker Pipeline: ESMP for mining of putative SSRs from EST sequences. ESMP accomplish EST pre-processing, clustering, assembly and subsequently mining of SSRs from assembled EST sequences. Cross_match [8], Trimest [9], CAP3 [10] and MISA [11] analytical tools has been integrated into ESMP for their respective applications to perform the process of ESTs assembly and mining of putative SSRs. ESMP has a three-tier architecture system. Presentation tier helps the user interact with ESMP through a web browser, whereas the business tier performs different analytical services associated with user specific options. The data generated in the business tier is then deposited into the data tier. For the use of this pipeline it does not require any database or any application installation on user machine. Instead the user simply uploads the fasta formatted EST sequence data into the server to run the pipeline with default parameters. It also has the options to choose the user defined parameters which makes the pipeline more interactive, user friendly and flexible.

Implementation

ESMP interface has been developed using HTML, CSS, JavaScript and PHP. MySQL has been used to store input EST data, intermediate data of the pipeline and mined SSRs statistics. The database schema is available at ESMP website. The backend system is a Linux machine with Intel ® Core(TM) 2Duo@3.33GHz CPU and 3GB RAM. Architecture and workflow of the pipeline is depicted in Figure 1.

Figure 1

Architecture and workflow of ESMP pipeline.

Software input

The ESMP web interface allows the user to submit EST sequences in the fasta format with “reads” extension. It also asks the user to upload vector sequences in a plain text format which can be obtained from FTP site UniVec database (ftp://ftp.ncbi.nih.gov/pub/UniVec/) of NCBI. Although most EST projects produces a large number of chromatogram files, ESMP cannot accept chromatogram files due to file-size limitations of web-based uploading. Accordingly, chromatogram files have to be converted into DNA sequence files using a base-calling program such as phred [8].

Software output

The ESMP output is stored in a MySQL database. All the output files are stored in “rar“ extension which can be downloaded by the user as well as can be viewed in the current web page. The statistics files contains the statistics of putative SSRs i.e., total number of sequence examined, total size of examined sequences (bp), total number of identified SSRs, number of SSRs containing sequences and number of sequence containing more than one sequences, number of SSR present in compound formation about the putative SSRs mined in the run. The statistics file can be transferred into an excel file for better visualisation of putative SSRs.

Conclusion

ESMP pipeline is the integration of multiple tools which are individually used for their respective applications to accomplish the mining of putative SSRs from ESTs.

Caveat and future development

ESMP currently supports pre-processing, assembly and putative SSR detection from EST datasets. This web-based ESTSSR pipeline reduces time lag and streamline the cumbersome process of SSR mining from ESTs, which is user-friendly. Our goal is not just to limit this pipeline for EST-SSR mining but to extend further for annotation and detection of suitable primer pairs which will flank the repeat motif. The mining of SSRs from ESTs provides valuable information on the abundance of SSRs in ESTs and will facilitate the development of markers for genetic analysis and related applications.

9 in total

1. CAP3: A DNA sequence assembly program.

Authors: X Huang; A Madan
Journal: Genome Res Date: 1999-09 Impact factor: 9.043

2. In silico analysis on frequency and distribution of microsatellites in ESTs of some cereal species.

Authors: Rajeev K Varshney; Thomas Thiel; Nils Stein; Peter Langridge; Andreas Graner
Journal: Cell Mol Biol Lett Date: 2002 Impact factor: 5.787

Review 3. Genic microsatellite markers in plants: features and applications.

Authors: Rajeev K Varshney; Andreas Graner; Mark E Sorrells
Journal: Trends Biotechnol Date: 2005-01 Impact factor: 19.536

4. ESTAP--an automated system for the analysis of EST data.

Authors: Chunhong Mao; John C Cushman; Gregory D May; Jennifer W Weller
Journal: Bioinformatics Date: 2003-09-01 Impact factor: 6.937

Review 5. A hitchhiker's guide to expressed sequence tag (EST) analysis.

Authors: Shivashankar H Nagaraj; Robin B Gasser; Shoba Ranganathan
Journal: Brief Bioinform Date: 2006-05-23 Impact factor: 11.622

6. Base-calling of automated sequencer traces using phred. II. Error probabilities.

Authors: B Ewing; P Green
Journal: Genome Res Date: 1998-03 Impact factor: 9.043

7. Exploiting EST databases for the development and characterization of gene-derived SSR-markers in barley (Hordeum vulgare L.).

Authors: T Thiel; W Michalek; R K Varshney; A Graner
Journal: Theor Appl Genet Date: 2002-09-14 Impact factor: 5.699

8. ParPEST: a pipeline for EST data analysis based on parallel computing.

Authors: Nunzio D'Agostino; Mario Aversano; Maria Luisa Chiusano
Journal: BMC Bioinformatics Date: 2005-12-01 Impact factor: 3.169

9. EST pipeline system: detailed and automated EST data processing and mining.

Authors: Hao Xu; Ling He; Yuanzhong Zhu; Wei Huang; Lijun Fang; Lin Tao; Yuedong Zhu; Lin Cai; Huayong Xu; Liang Zhang; Hong Xu; Yan Zhou
Journal: Genomics Proteomics Bioinformatics Date: 2003-08 Impact factor: 7.691

9 in total

3 in total

1. Indel Group in Genomes (IGG) Molecular Genetic Markers.

Authors: Ted W Toal; Diana Burkart-Waco; Tyson Howell; Mily Ron; Sundaram Kuppu; Anne Britt; Roger Chetelat; Siobhan M Brady
Journal: Plant Physiol Date: 2016-07-19 Impact factor: 8.340

2. In silico identification and characterization of conserved miRNAs and their target genes in sweet potato (Ipomoea batatas L.) expressed sequence tags (ESTs).

Authors: Budheswar Dehury; Debashis Panda; Jagajjit Sahu; Mousumi Sahu; Kishore Sarma; Madhumita Barooah; Priyabrata Sen; Mahendra Modi
Journal: Plant Signal Behav Date: 2013-09-25

3. ESAP plus: a web-based server for EST-SSR marker development.

Authors: Piyarat Ponyared; Jiradej Ponsawat; Sissades Tongsima; Pusadee Seresangtakul; Chutipong Akkasaeng; Nathpapat Tantisuwichwong
Journal: BMC Genomics Date: 2016-12-22 Impact factor: 3.969

3 in total