Literature DB >> 22493533

Mining for SSRs and FDMs from expressed sequence tags of Camellia sinensis.

Jagajjit Sahu¹, Ranjan Sarmah, Budheswar Dehury, Kishore Sarma, Smita Sahoo, Mousumi Sahu, Madhumita Barooah, Mahendra Kumar Modi, Priyabrata Sen.

Abstract

Simple Sequence Repeats (SSRs) developed from Expressed Sequence Tags (ESTs), known as EST-SSRs are most widely used and potentially valuable source of gene based markers for their high levels of crosstaxon portability, rapid and less expensive development. The EST sequence information in the publicly available databases is increasing in a faster rate. The emerging computational approach provides a better alternative process of development of SSR markers from the ESTs than the conventional methods. In the present study, 12,851 EST sequences of Camellia sinensis, downloaded from National Center for Biotechnology Information (NCBI) were mined for the development of Microsatellites. 6148 (4779 singletons and 1369 contigs) non redundant EST sequences were found after preprocessing and assembly of these sequences using various computational tools. Out of total 3822.68 kb sequence examined, 1636 (26.61%) EST sequences containing 2371 SSRs were detected with a density of 1 SSR/1.61 kb leading to development of 245 primer pairs. These mined EST-SSR markers will help further in the study of variability, mapping, evolutionary relationship in Camellia sinensis. In addition, these developed SSRs can also be applied for various studies across species.

Entities: Chemical Disease Species

Keywords: Camellia sinensis; Expressed Sequence Tags (ESTs); Functional Domain Markers (FDM); Simple Sequence Repeats (SSRs)

Year: 2012 PMID： 22493533 PMCID： PMC3321235 DOI： 10.6026/97320630008260

Source DB: PubMed Journal: Bioinformation ISSN： 0973-2063

Background

SSRs (Simple Sequence Repeats) otherwise known as Microsatellites are short repeat motifs that are present in both protein coding and non-coding regions of DNA sequences. SSRs show a high level of length polymorphism due to mutations of one or more repeats. The use of SSRs as molecular marker is being favored due to their multiallelic nature, reproducibility, high abundance and extensive genome coverage [1]. Expressed Sequence Tags (ESTs) are single pass sequence of cDNA classes that provide direct information of gene expression and serve as the main source for microsatellites [2]. The traditional methods of developing simple sequence repeat (SSR) markers are usually time consuming and laborintensive. Generally these processes involve genomic library construction, hybridization with the repeated units of nucleotides and sequencing of the clones. The computational approach for developing SSR markers from ESTs provides a better platform than the conventional approach. The EST databases stores EST sequences which are redundant, so they contain repetitive units. The EST sequences can be obtained from the databases and undergo preprocessing and assembly to develop potential SSR markers. Numerous tools (both standalone and web-based) are available for the mining of EST data to design EST-SSR markers at a large scale. The free computational programs and large set of EST data available on the web helps the researchers to perform data mining very easily from their local system rapidly in a very low cost. The tools like cross_match and trimest provided non redundant high quality EST sequence which do not contain the vector contamination and poly-A, -T talils. CAP3 tried to assemble the EST sequneces having overlapping region and produces Contigs by joining the sequeneces. The sequences that do not have common portions are remained as Singletons. MISA helped in detection of the Simple Sequence Repeats from both contigs and singletons. EST-SSR markers are potential candidates for gene tagging and comparative studies in various species as they are gene specific. Studies have been reported regarding the use of EST-SSR markers in a large number of commercially important plants. Best known for its commercial value, Camellia sinensis (Tea), a perennial ever green shrub is the perfect companion in the daily life. It is perhaps the second most consumed beverage in the whole world with some important health benefits. Tea promotes a healthier immune system by lowering blood pressure and cholesterol. It is mainly produced in a measure amount in India and Sri Lanka in many varieties. Huge amount of EST data in tea have been generated and stored in the databases. Present study deals with the detection of SSRs from the EST sequences of Camellia sinensis available in the public domains and functional annotation of the SSR containing ESTs. The annotation helps to know the putative functions of the ESTs and to find the important functional domain markers (FDM) related to the SSR-ESTs leading to gene ontology study. The gene ontology covers three domains: biological process: operations or sets of molecular events with a defined beginning and end, pertinent to the functioning of integrated living units: cells, tissues, organs, and organisms, cellular component: the parts of a cell or its extracellular environment and molecular function: the elemental activities of a gene product at the molecular level, such as binding or catalysis.

Methodology

Data source:

Sequence data were collected from the public domain GeneBank at NCBI (National Center for Biotechnology Information) [3]. A search for EST sequences of Camellia sinensis resulted in 12851 sequences. All the sequences were downloaded to the local machine for further analysis.

Processing EST sequence:

6148 non-redundant EST sequences were searched for vector sequence and removal of the vector sequence was done using cross_match software [4]. The vector sequences were obtained from the UniVec database [5]. Then the trimest program from EMBOSS was used for removing the poly-A and poly-T ends of the EST sequences [6]. This is freely available on the web which can be downloaded and installed in the PC. It allows to trim the poly-A and poly-T ends from the given sequence according to the parameters given.

EST assembly:

After the EST sequences were fully processed up to trimming, they were subjected to CAP3 for assembly [7]. CAP3 is a DNA sequence assembly program, freely available for academic use. CAP3 results Contig files, Singlets files, Qual files, Info files and Out files.

SSR detection:

The 6148 unique EST sequences i.e. 4779 singlets and 1369 contigs were searched for microsatellite sequences using MISA (MIcroSAtellite identification tool) [8]. MISA is a freely downloadable perl script available on internet. Alongwith misa.pl another file misa.ini was also downloaded which contains the search parameters.

Primer designing:

Primer pairs were designed from the obtained SSR sequences using Primer3 tool [9]. The parameters were changed according to own interest. The primer size parameter was changed to min- 17, opt-21 and max-27. The GC% was changed to min-45%, max-65%. Then the SSRs were searched for both forward and reverse primers.

Functional annotation:

The Functional Domain Markers were found from the SSR containing sequence using InterProScan at EBI [10]. InterProScan provides the platform to analyze the functional domains with the help of the member databases such as BlastProDom, FPrintScan, HMMPIR, HMMPfam, HMMSmart, HMMTigr, ProfileScan, HAMAP, PatternScan, SuperFamily, SignalPHMM, TMHMM, HMMPanther and Gene3D. The SSRESTs sequences were searched for the significant matches using a special type of BLAST program BLASTx at National Center for Bioechnology Information against non-redundant protein database entries [11]. BLASTx searches protein database using a translated nucleotide query. The BLASTx was performed keeping the value of identity parameter > 70%. The SSR-FDM contig sequences found from Interproscan were annotated for Biological process, Cellular component and Molecular function using QuickGO (http://www.ebi.ac.uk/QuickGO) at the EBI server [12]. QuickGO is a fast web-based browser for Gene Ontology terms and annotations, which is provided by the UniProtKB-GOA group.

Discussion

Powerful computational tools were used to mine the publicly available Camellia sinensis EST data. In the present study, 2371 SSRs were found from the total screened EST data Table 1 (see supplementary material). The EST sequences containing SSRs are generally referred to as SSR-ESTs, whereas the markers developed from SSR-ESTs are called EST-SSRs. Total number of 6148 sequences of size 3822.68 kb was examined for microsatellites, out of which 1636 (26.61%) numbers of sequences were found to contain SSRs with a density of 1 SSR/1.61 kb. (Figure 1) shows the frequency distribution of different repeat types identified and (Figure 2) shows the frequency distribution of mono, di, tri, tetra, penta and hexa nucleotide repeat motifs in EST sequences of Camellia sinensis The most frequent repeat type found within the EST sequences were mono- nucleotide repeats (65.9%) followed by dinucleotide repeats (24.6%), tri-nucleotide repeats (7.8%), tetranucleotide repeats (0.8%) and penta- and hexa-nucleotide repeats (0.8%). The frequency of identified SSR motifs are provided in the Table 2 (see supplementary material). The detected SSR motifs put insight into the frequency distribution of different types of nucleotide repeats in Camellia sinensis. The analysis showed that the frequency is not so high but better than many other species. Ignoring the mono-nucleotide repeats, the di-nucleotide repeats were the most abudant repeat types. Among the di-nucleotide repeat types, CT and AG have the highest frequency. This adds emphasis to the fact that in plant EST-SSRs, AG/CT repeats have been found to be most frequent [13, 14]. According to the previous reports, in most of eukaryotes, the GC repeats are abudant but in this study, however, the GC repeats were completely absent [15, 16]. Compared to di-meric repeats, tri-nucleotide repeats are less in Camellia sinensis. Studies revealed that in plant the most common tri-SSRs are AAG and CCG [17]. In Tea AAG repeat was not present and though CCG repeats were present but with less frequency. The most frequent tri-repeats in tea were GAA, TCT and CAC.

Figure 1

Frequency distribution of different repeat types identified in EST sequences of Camellia sinensis.

Figure 2

Frequency distribution of (A) mono- and di-, (B) tri-, (C) tetra-, (D) penta- and hexa-nucleotide repeat motifs in EST sequences of Camellia sinensis.

In Tea the length of the SSRs detected varried from 10 to 392. Polymerphic markers are mostly found in lengthy repeats. So the ESTs containing SSRs more than length of 100 were choosen for the design of primer pairs. Out of 97 such SSR-ESTs, 245 primer pairs were designed. (Figure 3) describess the flowchart of the stepwise procedure involved with in sillico mining of SSRs from Camellia sinensis EST sequences. From 469 SSR containing EST contigs, the Mono nucleotide SSR containing sequences were not taken into account. Using InterProScan the rest 314 contigs were analyzed and total 935 differentfunctional domains were found Table 3 (see supplementary material). 250 sequences were found to be SSR-FDMs containing both SSRs and FDMs. These sequences were then assigned to gene ontology terms in the Swissprot database.

Figure 3

Work Flow Chart

All the 469 SSR containing sequences were annotated against the non-redundant protein database to know the functions using BLASTx. The BLASTx result summarizes (Figure 4) 54 (11.51%) putative proteins, 74 (15.77%) predicted proteins, 50 (10.67%) hypothetical proteins, 146 (31.13%) of different functional classes and for 145 (30.29%) sequences no functions were found as there was no specific similarity. The SSR-ESTs were also assigned to gene ontology terms. 59 out of 250 contigs containing FDMs were found to have vital role in various processes, and functions. The (Figure 5) describes the details about the biological Processes, the Cellular Components and the Molecular Function of gene ontology. The gene ontology provided information regarding ESTs related to many genes coding for secondary metabolite production, translation, lipid transportation, stress response, GTP binding proteins, iron and sulfur cluster binding proteins and many other functions.

Figure 4

BLASTX results analysis.

Figure 5

Gene Ontology (A) Biological Process; (B) Cellular omponent; (C) Molecular Function

Currently a number of studies are being reported regarding the development of EST-SSRs in vast number of plant species using the computational tools. Microsatellite markers are very important for studing genetic variability and understanding the genetic information. In the present study, 2371 microsatellites were detected from the available EST data for Camellia sinensis. The redundancy in the EST sequences was reduced by preprocessing and assembly of the EST sequences which helped in detecting unique SSRs. The primer pairs designed can be checked for the transferability in the related species. BLASTx results helped to know the putative functions of the ESTs and the GO annotation revealed lots of functional domains. This will lead to the development of the specific markers to find different secondary metabolites, stress responsible gene and many other gene information that may present in the plants.

Conclusion

Camellia sinensis commonly known as Tea is a plant of great commercial value. Till date not many work has been reported regarding the application of molecular markers in this plant mainly because of unavailability of suitable markers. Simple Sequence Repeats (SSRs) are the most powerful genetic markers for genetic linkage analysis, diversity study and marker assisted selection. The microsatellite markers are mainly being used in plant genetic analysis due to the reducing cost of DNA sequencing and increasing availability of EST sequence data in different plants. To look into the genetic make-up of Camellia sinensis, inter-species variability, evolutionary relationship study, development and application of molecular markers are of immense importance. The EST-SSRs developed in the present study will shed light into the discovery of the information. These SSRs can also be used as molecular markers to identify gene function, if the SSRs are found to be linked to a gene of importance. The SSRs derived from the ESTs can be used in the related species for which very less number of sequences is available because of high cross-species transferability nature of EST-SSRs. These can enhance the cross species applications to develop conserved orthologous marker sets. Also this study provides a brief idea about the approach to develop computationally mined SSRs from ESTs.

11 in total

1. CAP3: A DNA sequence assembly program.

Authors: X Huang; A Madan
Journal: Genome Res Date: 1999-09 Impact factor: 9.043

2. Primer3 on the WWW for general users and for biologist programmers.

Authors: S Rozen; H Skaletsky
Journal: Methods Mol Biol Date: 2000

3. Data mining for simple sequence repeats in expressed sequence tags from barley, maize, rice, sorghum and wheat.

Authors: Ramesh V Kantety; Mauricio La Rota; David E Matthews; Mark E Sorrells
Journal: Plant Mol Biol Date: 2002 Mar-Apr Impact factor: 4.076

Review 4. A hitchhiker's guide to expressed sequence tag (EST) analysis.

Authors: Shivashankar H Nagaraj; Robin B Gasser; Shoba Ranganathan
Journal: Brief Bioinform Date: 2006-05-23 Impact factor: 11.622

5. Analysis of microsatellites derived from bee Ests.

Authors: Bin Li; Qing-You Xia; Cheng Lu; Ze-Yang Zhou
Journal: Yi Chuan Xue Bao Date: 2004-10

6. Frequency and coverage of trinucleotide repeats in eukaryotes.

Authors: Paola Astolfi; Dina Bellizzi; Vittorio Sgaramella
Journal: Gene Date: 2003-10-23 Impact factor: 3.688

7. Genome-wide analysis of microsatellite repeats in humans: their abundance and density in specific genomic regions.

Authors: Subbaya Subramanian; Rakesh K Mishra; Lalji Singh
Journal: Genome Biol Date: 2003-01-23 Impact factor: 13.583

8. InterProScan: protein domains identifier.

Authors: E Quevillon; V Silventoinen; S Pillai; N Harte; N Mulder; R Apweiler; R Lopez
Journal: Nucleic Acids Res Date: 2005-07-01 Impact factor: 16.971

9. Analysis on frequency and density of microsatellites in coding sequences of several eukaryotic genomes.

Authors: Bin Li; Qingyou Xia; Cheng Lu; Zeyang Zhou; Zhonghuai Xiang
Journal: Genomics Proteomics Bioinformatics Date: 2004-02 Impact factor: 7.691

10. QuickGO: a web-based tool for Gene Ontology searching.

Authors: David Binns; Emily Dimmer; Rachael Huntley; Daniel Barrell; Claire O'Donovan; Rolf Apweiler
Journal: Bioinformatics Date: 2009-09-10 Impact factor: 6.937

9 in total

1. Mining and comparative survey of EST-SSR markers among members of Euphorbiaceae family.

Authors: Surojit Sen; Budheswar Dehury; Jagajjit Sahu; Sunayana Rathi; Raj Narain Singh Yadav
Journal: Mol Biol Rep Date: 2018-04-06 Impact factor: 2.316

2. Rapid Genome-Wide Location-Specific Polymorphic SSR Marker Discovery in Black Pepper by GBS Approach.

Authors: Ankita Negi; Kalpana Singh; Sarika Jaiswal; Johnson George Kokkat; Ulavappa B Angadi; Mir Asif Iquebal; P Umadevi; Anil Rai; Dinesh Kumar
Journal: Front Plant Sci Date: 2022-05-27 Impact factor: 6.627

3. Rediscovering medicinal plants' potential with OMICS: microsatellite survey in expressed sequence tags of eleven traditional plants with potent antidiabetic properties.

Authors: Jagajjit Sahu; Priyabrata Sen; Manabendra Dutta Choudhury; Budheswar Dehury; Madhumita Barooah; Mahendra Kumar Modi; Anupam Das Talukdar
Journal: OMICS Date: 2014-05

4. Development of transcriptome based web genomic resources of yellow mosaic disease in Vigna mungo.

Authors: Rahul Singh Jasrotia; Mir Asif Iquebal; Pramod Kumar Yadav; Neeraj Kumar; Sarika Jaiswal; U B Angadi; Anil Rai; Dinesh Kumar
Journal: Physiol Mol Biol Plants Date: 2017-09-18

5. Development and characterization of genic SSR-FDM for stem gall disease resistance in coriander (Coriandrum sativum L.) and its cross species transferability.

Authors: Sharda Choudhary; Mahantesha B N Naika; R D Meena
Journal: Mol Biol Rep Date: 2021-05-22 Impact factor: 2.316

6. VigSatDB: genome-wide microsatellite DNA marker database of three species of Vigna for germplasm characterization and improvement.

Authors: Rahul Singh Jasrotia; Pramod Kumar Yadav; Mir Asif Iquebal; S B Bhatt; Vasu Arora; U B Angadi; Rukam Singh Tomar; Sarika Jaiswal; Anil Rai; Dinesh Kumar
Journal: Database (Oxford) Date: 2019-01-01 Impact factor: 3.451

7. Genome-Wide Identification of SSR and SNP Markers Based on Whole-Genome Re-Sequencing of a Thailand Wild Sacred Lotus (Nelumbo nucifera).

Authors: Jihong Hu; Songtao Gui; Zhixuan Zhu; Xiaolei Wang; Weidong Ke; Yi Ding
Journal: PLoS One Date: 2015-11-25 Impact factor: 3.240

8. SBMDb: first whole genome putative microsatellite DNA marker database of sugarbeet for bioenergy and industrial applications.

Authors: Mir Asif Iquebal; Sarika Jaiswal; U B Angadi; Gaurav Sablok; Vasu Arora; Sunil Kumar; Anil Rai; Dinesh Kumar
Journal: Database (Oxford) Date: 2015-12-08 Impact factor: 3.451

9. PlantFuncSSR: Integrating First and Next Generation Transcriptomics for Mining of SSR-Functional Domains Markers.

Authors: Gaurav Sablok; Antonio J Pérez-Pulido; Thac Do; Tan Y Seong; Carlos S Casimiro-Soriguer; Nicola La Porta; Peter J Ralph; Andrea Squartini; Antonio Muñoz-Merida; Jennifer A Harikrishna
Journal: Front Plant Sci Date: 2016-06-27 Impact factor: 5.753

9 in total