Literature DB >> 21383904

Exploiting EST databases for the mining and characterization of short sequence repeat (SSR) markers in Catharanthus roseus L.

Raj Kumar Joshi¹, Basudeba Kar, Sanghamitra Nayak.

Abstract

Periwinkle (Catharanthus roseus L.) (Family: Apocyanaceae) is a ornamental plants with great medicinal properties. Although it is represented by seven species, little work has been carried out on its genetic characterization due to non-availability of reliable molecular markers. Simple sequence repeats (SSRs) have been widely applied as molecular markers in genetic studies. With the rapid increase in the deposition of nucleotide sequences in the public databases and advent of bioinformatics tools, it has become a cost effective and fast approach to scan for microsatellite repeats and exploit the possibility of converting it into potential genetic markers. Expressed sequence tags (EST's) from Catharanthus roseus were used for the screening of Class I (hyper variable) simple sequence repeats (SSR's). A total of 502 microsatellite repeats were detected from 21730 EST sequences of turmeric after redundancy elimination. The average density of Class I SSRs account to 1 SSR per 10.21 kb of EST. Mononucleotides was the most abundant class of microsatellite motifs. It accounted for 44.02% of the total, followed by the trinucleotide (26.09%) and dinucleotide repeats (14.34%). Among all the repeat motifs, (A/T)n accounted for the highest Proportion (36.25%) followed by (AAG)n. These detected SSRs can be used to design primers that have functional importance and should also facilitate the analysis of genetic diversity, variability, linkage mapping and evolutionary relationships in plants especially medicinal plants.

Entities: Chemical Disease Species

Keywords: Catharanthus roseus; Expresses sequence tags; SSR Locator; short sequence repeats

Year: 2011 PMID： 21383904 PMCID： PMC3044425 DOI： 10.6026/97320630005378

Source DB: PubMed Journal: Bioinformation ISSN： 0973-2063

Background

Catharanthus roseus (L.) G. Don, commonly known as periwinkle, of the family Apocynaceae is an ornamental plant with great medicinal value. The genus Catharanthus includes seven species besides C. roseus namely C. coriaceus, C. lanceus, C longifoleus, C ovalis, C pussilus, C scitulus and C trichophyllus. C pussilus is endemic to India while the rest are abundantly present in Madagascar. Catharanthus roseus possesses a suitable genetic system, as it is biannual, seed cyclable, diploid (2n=16) and amenable to controlled pollination and Micropropagation [1, 2]. As such, various genetic, proteomic and biotechnological studies is in progress in Catharanthus thereby making it a highly investigative plant species. An important breeding objective for the improvement of C. roseus is the development of molecular markers. Variation at DNA level is the key to modern studies and several DNA marker systems has been developed including restriction fragment length polymorphism (RFLP), random amplified polymorphic DNA (RAPD), inter simple sequence repeats (ISSR), simple sequence repeats (SSR), amplified fragment length polymorphism (AFLP) and their variants in the recent times to analyze gross and specific DNA sequence variations in different species. Among different classes of molecular markers, microsatellite markers are the most favored. Microsatellites, or simple sequence repeats (SSRs), are stretches of DNA consisting of tandemly repeated short units of 1_6 base pairs in length. Compared with other molecular markers, simple sequence repeats (SSRs) are more advantageous and used for a variety of applications because of their multi-allelic nature, reproducibility, co-dominant inheritance, high abundance and extensive genome coverage [3]. Various studies has been undertaken to identify microsatellites in different crop species including rice, wheat, maize, barley, soybean several medicinal and aromatic plants of commercial importance [4]. In C. roseus, a number of microsatellite markers have recently been developed and deployed for the study of intraspecific and interspecific as well as intrageneric and intergeneric genetic polymorphism [5]. However, most of these are genomic SSRs whose development is highly laborious, cumbersome and cost-intensive. Currently, with the accumulation of biological data originating from whole genome sequence initiatives, the use of bioinformatics tools helps to maximize the identification of these sequences and consequently, the efficiency in the number of generated markers. Advances in genomic technologies have generated a large number of expressed sequence tags (ESTs) that has been made available in public database, thereby offering an opportunity to develop EST derived SSR markers by data mining. ESTs are short and single pass sequences read from mRNA (cDNA) [6] representing a snapshot of genes expressed in a given tissue. EST_SSR markers are expected to possess high interspecific transferability as they belong to relatively conserved genic regions of the genome. With recent increasing emphasis on functional genomics, large datasets of ESTs are being developed, and with evolving bioinformatics tools, it is now possible to identify and develop EST_ SSR markers at a large scale in a time and cost-effective manner [7, 8]. As of January 2011, GenBank had released 21730 EST sequences from Catharanthus roseus (http://www.ncbi.nlm.nih.gov/dbEST/dbEST_summary.html). In this context, the use of EST or cDNA-based SSRs has been reported for several species including grape 9], sugarcane 10], durum wheat 11], rye 12] and medicinal plant like basil 13]. Keeping the above point of view, the present study aims to assess the suitability of existing public databases for the mining of simple sequence repeats. We have mined updated EST libraries of Catharanthus roseus for this analysis to find the SSR polymorphisms. There are various SSR detecting softwares such as MISA 14], SSRFinder 15], SSRIT 16], TRF 17], TROLL 18], Sputnik (http://espressosoftware.com/pages/sputnik.jsp), Modified Sputnik I 19] and Modified Sputnik II 20]. However, we used a new SSR detecting software SSR Locator 21] to identify the SSR polymorphisms because of its user friendly Windows interface and ability to interpret results as html files.

Methodology

EST database of NCBI contains 21730 Catharanthus roseus express sequence tag data. 21730 sequences retrieved were related to different plant tissues i.e leaves and root. The downloaded sequences were obtained in FASTA format for sequence assembly and SSR analysis. A single text file was compiled containing all the 21730 EST sequences. The EST sequences were screened against the UniVec database from NCBI (ftp://ftp.ncbi.nih.gov/pub/ UniVec/) for detecting vector and adapter sequences by using the program Cross_Match 22]; the following parameters were used: minmatch ≥13 and minscore ≥20. Furthermore, polyA/T tails and X characters were removed using the EST_trimmer.pl script (http://pgrc.ipk-gatersleben.de/misa/download/est_trimmer.pl) until no stretch of (A/T)5 or (X)1 was present in a window of 100bp at the 5' or 3' end, respectively. dbEST has redundancy in EST sequences. In order to remove the redundancy, EST sequences were assembled using the contig assembly program CAP3 23]. The sequences containing file was submitted in FASTA formatted text file. The results were in different output files e.g. unigenes, contigs and singlets. For the purpose of the SSR identification, we combined the contig and singlet sequences to form nonredundant sequence data set. The SSR detection tool SSR locator was used to detect EST-SSR loci. EST-derived SSRs were considered to contain motifs ranging in length from 2 to 6 nucleotides, with dinucleotide repeat numbers ≥9, trinucleotide repeat numbers ≥6, tetranucleotide repeat numbers ≥5, pentanucleotide repeat numbers ≥4, hexanucleotide repeat numbers ≥3, and compound SSR motif length ≥24 bp.

Results and Discussions

ESTs are often represented by redundant cDNA sequences making it difficult to analyze them effectively for SSRs. To overcome this problem, the CAP3 program was used. The reduction in redundancy is used as a measure of degree of overlapping between EST sequences. The objective was the elimination of redundancy in EST sequences and arriving at a contiguous sequence (contigs) which can be used for analysis of SSRs. CAP3 is a commonly used program 24], which identifies overlapping sequences and generates contigs with consensus sequences. The 21730 redundant EST sequences retrieved from NCBI scanned for Class I microsatellite repeats represented approximately 9.25Mb of Catharanthus roseus genome. 632 SSR's were detected in this dataset corresponding to 1.0 SSR per 14.6 kb. During pre-processing, 835740 bp of empty vectors, low-quality sequences and Poly A/T tails were removed successfully. Trimming of poly A and poly T tails resulted in the removal of 9% of original dataset. Rest of the sequences were clustered and assembled into a non-redundant dataset of 5928 unique gene sequences (1227 contigs and 4704 singlets). Scanning of Class I microsatellites in this non-redundant dataset revealed 502 unique SSR containing sequences (, see Table 3). This accounts to 1SSR per 10.21kb of Catharanthus roseus genome (, see Table 1). The reduction in redundancy of Class I SSR's obtained by trimming and clustering of non-redundant dataset is shown in Figure 1. Seventeen cases were found where two microsatellites were immediately adjacent to each other; 57 ESTs contained two adjacent repeats at a distance of <10 bp to each other. Cardle et al 2000 [25] estimated the average distances (in kb) between SSRs in sets of non-redundant ESTs of various plants such as rice (3.4), soybean (7.4), tomato (11.1), Arabidopsis (13.8), poplar (14.0) and cotton (20.0) through a comprehensive computational study. Considering the same criteria in the present study, SSRs occurs with a frequency of 1SSR per 14.6 kb in Catharanthus ESTs. This suggests that the frequency of cDNA-SSRs in the expressed portion of the Catharanthus genome is low in comparison to rice, soybean, tomato and Arabidopsis and higher than other plant species.

Figure 1

Distribution of EST-SSRs based on the motifs.

The occurrence of the individual SSR motifs among the non-redundant set of 502 SSRs is summarized in (see Table 2). Obviously, the proportion of SSR unit sizes was not evenly distributed: 221 (44%) were mononucleotide, 72 (14%) dinucleotide, 131 (26%) trinucleotide, 27 (5%) Tetranucleotide, 21 (4%) Pentanucleotide and 22 (4%) hexa-nucleotide microsatellites. 1.6% of microsatellites were of compound types. The relative abundance of mononucleotides even after the trimming of poly A/T tracts could clearly indicate their occurrence of genome rather than at the ends of mRNA. Regarding dimeric SSRs, the motifs AG (41%) and AC (24%) were by far the most common ones, whereas AT is present only at low abundance i.e 7% only. The deficiency of AT SSRs in EST sequences is in accordance with reports from rice [26], Arabidopsis [25] and maize [27]. AG/CT motif can represent codons GAG, AGA, UCU and CUC in mRNA population and code for R, E, A and L respectively. Since A and L are found in increased amount in proteins, the abundance of AG/CT in the genome can be substantiated. CG repeats are least found in cereal species [28] and in our present study CG/GC motif was completely absent. Among trimeric microsatellites, AAG (43%), AGG (22%) and AGC-CCG (19%) were the most-common motifs. ACG also accounted for 14%. In plants, common motif is AAG while CCG is a specific feature of monocot genome [14]. Lu et al 2010 [29], have found (AAG)n to be the most abundant repeat motif in Gossypium barbadense. Siju et al [30] also found (AAG)n to be the most abundant in turmeric accounting to 8.2%. Similarly, trimeric motifs like AGG/CCT, AGC/CGT are also fairly well represented in rice [26], maize [27], pearl millet [28], and barley [14]. ATT represented only 3% of the total trimeric SSRs. Most of the monocots and dicots used to have least amount of motif ATT owing to the fact that TAA-based variants code for stop codons that have a direct effect on protein synthesis [27]. The most-frequent tetrameric microsatellite motifs were AAAC/GTTT and AAAT/ATTT. The penta and hexa nucleotides were found to be in lowest frequency. In all the repeat motifs, most of the SSR repeat motifs derived from the ESTs were A/T (36.25%) followed by AAG/CTT (8.56%), AG/CT (8.16%) and G/C (7.76%) (Figure 2). Rest of the repeat motifs accounted less than 5% contribution to the total SSR motifs. In the 16 repeat types, the most frequent repeat motifs were A/T, AG/CT, AAG/CTT, AAAC/GTTT, AAAAC/GTTTT, and AAAGCC/GGCTTT, which accounted for 56.94%, 57.20%, 32.82%, 37.03%, 42.85% and 22.72% of all types, respectively.

Figure 2

Reduction in redundancy by trimming (Poly A/T tails) and assembling Catharanthus roseus ESTs.

Conclusion

Microsatellites serve for divergent roles in the field of plant genomics. EST databases provide a valuable resource for the development of microsatellite markers, which are associated with transcribed genes. Development of SSR markers from EST-databases saves both cost and time, once sufficient amounts of EST sequences are available. In the present study, we identified 502 non-redundant hyper variable microsatellites from EST data source of Catharanthus roseus using SSR identification tool SSR Locator. The frequency analysis described in the study could offer potential for the designing of suitable repeat probes for effective targeting and isolation of microsatellite repeats from Catharanthus. Moreover, these non-redundant SSR resources can be used to design informative SSR primers that can be applied in studies of genetic variation, linkage mapping, comparative genomics and characteristic distribution of genes on chromosomes of Catharanthus species.

18 in total

1. Microsatellites are preferentially associated with nonrepetitive DNA in plant genomes.

Authors: Michele Morgante; Michael Hanafey; Wayne Powell
Journal: Nat Genet Date: 2002-01-22 Impact factor: 38.330

2. Data mining for simple sequence repeats in expressed sequence tags from barley, maize, rice, sorghum and wheat.

Authors: Ramesh V Kantety; Mauricio La Rota; David E Matthews; Mark E Sorrells
Journal: Plant Mol Biol Date: 2002 Mar-Apr Impact factor: 4.076

3. In silico analysis on frequency and distribution of microsatellites in ESTs of some cereal species.

Authors: Rajeev K Varshney; Thomas Thiel; Nils Stein; Peter Langridge; Andreas Graner
Journal: Cell Mol Biol Lett Date: 2002 Impact factor: 5.787

4. TIGR Gene Indices clustering tools (TGICL): a software system for fast clustering of large EST datasets.

Authors: Geo Pertea; Xiaoqiu Huang; Feng Liang; Valentin Antonescu; Razvan Sultana; Svetlana Karamycheva; Yuandan Lee; Joseph White; Foo Cheung; Babak Parvizi; Jennifer Tsai; John Quackenbush
Journal: Bioinformatics Date: 2003-03-22 Impact factor: 6.937

5. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences.

Authors: Weizhong Li; Adam Godzik
Journal: Bioinformatics Date: 2006-05-26 Impact factor: 6.937

6. Tandem repeats finder: a program to analyze DNA sequences.

Authors: G Benson
Journal: Nucleic Acids Res Date: 1999-01-15 Impact factor: 16.971

7. Maize simple repetitive DNA sequences: abundance and allele variation.

Authors: E C Chin; M L Senior; H Shu; J S Smith
Journal: Genome Date: 1996-10 Impact factor: 2.166

8. Exploiting EST databases for the development and characterization of gene-derived SSR-markers in barley (Hordeum vulgare L.).

Authors: T Thiel; W Michalek; R K Varshney; A Graner
Journal: Theor Appl Genet Date: 2002-09-14 Impact factor: 5.699

9. Isolation of EST-derived microsatellite markers for genotyping the A and B genomes of wheat.

Authors: I. Eujayl; M. E. Sorrells; M. Baum; P. Wolters; W. Powell
Journal: Theor Appl Genet Date: 2002-02 Impact factor: 5.699

10. Development and mapping of simple sequence repeat markers for pearl millet from data mining of expressed sequence tags.

Authors: S Senthilvel; B Jayashree; V Mahalakshmi; P Sathish Kumar; S Nakka; T Nepolean; Ct Hash
Journal: BMC Plant Biol Date: 2008-11-27 Impact factor: 4.215

4 in total

Review 1. Bioinformatics opportunities for identification and study of medicinal plants.

Authors: Vivekanand Sharma; Indra Neil Sarkar
Journal: Brief Bioinform Date: 2012-05-15 Impact factor: 11.622

2. Mining and gene ontology based annotation of SSR markers from expressed sequence tags of Humulus lupulus.

Authors: Swati Singh; Sanchita Gupta; Ashutosh Mani; Anoop Chaturvedi
Journal: Bioinformation Date: 2012-02-03

3. Mining, characterization and validation of EST derived microsatellites from the transcriptome database of Allium sativum L.

Authors: Subodh Kumar Chand; Satyabrata Nanda; Ellojita Rout; Raj Kumar Joshi
Journal: Bioinformation Date: 2015-03-31

4. Validation of dbEST-SSRs and transferability of some other solanaceous species SSR in ashwagandha [Withania Somnifera (L.) Dunal].

Authors: Eva K Parmar; Ranbir S Fougat; Chandni B Patel; Harshvardhan N Zala; Mahesh A Patel; Swati K Patel; Sushil Kumar
Journal: 3 Biotech Date: 2015-03-17 Impact factor: 2.406

4 in total