Literature DB >> 22125396

Mining functional microsatellites in legume unigenes.

Abstract

Highly polymorphic and transferable microsatellites (SSRs) are important for comparative genomics, genome analysis and phylogenetic studies. Development of novel species-specific microsatellite markers remains a costly and labor-intensive project. Therefore, interest has been shifted from genomic to genic markers owing to their high inter-species transferability as they are developed from conserved coding regions of the genome. This study concentrates on comparative analysis of genic microsatellites in nine important legume (Arachis hypogaea, Cajanus cajan, Cicer arietinum, Glycine max, Lotus japonicus, Medicago truncatula, Phaseolus vulgaris, Pisum sativum and Vigna unguiculata) and two model plant species (Oryza sativa and Arabidopsis thaliana). Screening of a total of 228090 putative unique sequences spanning 219610522 bp using a microsatellite search tool, MISA, identified 12.18% of the unigenes containing 36248 microsatellite motifs excluding mononucleotide repeats. Frequency of legume unigene-derived SSRs was one SSR in every 6.0 kb of analyzed sequences. The trinucleotide repeats were predominant in all the unigenes with the exception of C. cajan, which showed prevalence of dinucleotide repeats over trinucleotide repeats. Dinucleotide repeats along with trinucleotides counted for more than 90% of the total microsatellites. Among dinucleotide and trinucleotide repeats, AG and AAG motifs, respectively, were the most frequent. Microsatellite positive chickpea unigenes were assigned Gene Ontology (GO) terms to identify the possible role of unigenes in various molecular and biological functions. These unigene based microsatellite markers will prove valuable for recording allelic variance across germplasm collections, gene tagging and searching for putative candidate genes.

Entities: Chemical Species

Keywords: Functional annotation ; Legumes; Microsatellites; SSRs; Unigenes

Year: 2011 PMID： 22125396 PMCID： PMC3218422 DOI： 10.6026/97320630007264

Source DB: PubMed Journal: Bioinformation ISSN： 0973-2063

Background:

Comparative genomics is a proven and established tool for genome analysis, annotation and evolutionary studies [1]. Coding regions, in particular, can be exploited for developing DNA markers, already proved very useful in comparative studies. Microsatellites or Simple Sequence Repeats (SSRs) are ubiquitous in eukaryotic genomes, with non-random distribution in the genomic regions. Microsatellites provide a rich source of hypervariable co-dominant markers owing to their high mutation rates that generate allelic variation in array length [2]. Microsatellites have been implicated in genome evolution, gene regulation or functional evolution of the genes [3]. Microsatellites are important tools for comparative mapping because of their high polymorphism and transferability. Genic Microsatellite Markers (GMMs) have been extensively used in different areas such as genome characterization, genome mapping, comparative genomics, phylogenetic studies, population genetics and molecular breeding [4]. In general, microsatellites are identified from both non-coding and coding regions of the genome. Standard methods for the development of microsatellite markers require considerable amount of time, money and labour [5]. Moreover, microsatellites developed by these standard methods show anelement of biasness depending upon the method or probe used for their development. Recently, researchers have shifted their attention from genomic markers to genic markers that represent coding sequences or transcriptome [4]. The advancements in the field of genomics have resulted in the accumulation of huge amount of sequence data in the public domain including vast collection of expressed sequence tags (ESTs) and unigenes. This huge sequence data has provided an alternative approach for the identification and development of molecular markers. ESTs have provided a potentially rich source of GMMs [4]. GMMs are widely favoured as molecular markers owing to their inexpensive development, representation of transcribed genes/coding regions and a putative function can often be deduced by a homology search. Development of EST-derived microsatellite markers, however, suffers with a limitation of high redundancy prevailing in EST sequences yielding multiple markers at the same locus. To overcome this limitation of ESTs, unigenes are derived by clustering ESTs into singletons and contigs. Microsatellite markers developed from these unigenes can be used to detect variation in the functional genome with unique identity and position [4]. Parida and co-workers [6] identified and characterized microsatellite motifs in the unigenes available in five cereal crops (rice, wheat, maize, sorghum, barley) and Arabidopsis. These unigene derived microsatellite (UGMS) markers have high inter-specific transferability as they are developed from conserved coding regions of the genome. Moreover, they also serve as a potential tool to study functional diversity and genome evolution patterns more accurately. Therefore, these unigene derived microsatellite markers would be of great use for comparative mapping and phylogenetic analysis and to facilitate development of syntenic networks for understanding the evolution of genes and genomes. Fabaceae (earlier included in Leguminosae) is the third largest and economically important family of flowering plants. Members of Fabaceae include Arachis hypogaea, Cajanus cajan, Cicer arietinum, Glycine max, Pisum sativum and many other important legumes. These crops serve as a source of staple, essential food for supplementing dietary proteins for vegetarian people. Recent progress towards accumulation of various genomic resources (ESTs, unigenes) for legumes have facilitated comparative mapping in these plant species. Although a lot has been achieved towards development of genic microsatellite markers in plants, yet only few studies have been undertaken to develop such resources in case of legumes. This study concentrates on identifying genic microsatellite repeats in legumes and comparative analysis of genic microsatellites in legumes, which can further be used as a valuable tool for future studies in legumes related to genome evolution, gene tagging and genetic diversity.

Methodology:

Sequence resources:

Unigene collections of 11 plant species, including 9 legumes namely Arachis hypogaea (Aha), Cajanus cajan (Cca), Cicer arietinum (Car), Glycine max (Gma), Lotus japonicus (Lja), Medicago truncatula (Mtr), Phaseolus vulgaris (Pvu), Pisum sativum (Psa) and Vigna unguiculata (Vun), and two model plants, Oryza sativa (Osa) and Arabidopsis thaliana (Ath) were subjected to in silico mining of microsatellites (Supplementary table 1). Unigene sequences for all these species were downloaded from NCBI Unigene database ( ftp://ftp.ncbi.nih.gov/repository/UniGene/) except for C. cajan, C. arietinum and P. sativum, as unigene data for these legumes were not available in the public domain. For these species, large numbers of ESTs are available in NCBI dbEST ( http://www.ncbi.nlm.nih.gov/dbEST/). To nullify redundancy prevalent in ESTs, we used the sequence assembly program CAP3 [7] to cluster ESTs into contigs/singletons and generated non-redundant unigenes for each of these legume species. The non-redundant unigene sequences were used to identify microsatellites and perform gene ontology annotation.

Microsatellite mining and variability prediction:

A perl script, MISA (MIcroSAtelitte) was used to identify microsatellites in all these unigene sequences (http://pgrc.ipkgatersleben.de/misa/) [8]. A simple sequence repeat with repeat motif length varying between 1 and 6 bp was identified as a microsatellite. Mononucleotide repeats were excluded from the analysis because of the abundance of poly A/T repeats mostly resulting from sequencing artifacts and poly A tails. Repeat-motifs like AG, GA, TC and CT were considered in the same class considering complementary sequences and/or different reading frames. Compound microsatellites were considered with at least two different repeat-motifs without any interruption. The analysis of mined microsatellites was done on the basis of their motif length (di- to hexa-nucleotide), number and type of repeats, relative frequency of occurrence and length as class I (≥20 nucleotides) and class II (12 to 20 nucleotides) types [9]. GC content was also calculated. Trinucleotide repeats were examined for the possible encoded amino acid motif and codon biasness.

Assessment of functional relevance of unigenes having SSRs:

Unigene sequences containing microsatellites were used for similarity search using Blast2GO [10] to identify their putative function. Unigene sequences not showing any match were considered as unique to that particular species. The C. arietinum microsatellite positive unigenes were run through a Gene Ontology (GO) assignment database in order to assess associations between SSR loci and biological processes, cellular components and molecular function of known genes.

Discussion:

The availability of large unigene collections for some legumes in public domain allowed us to explore these resources for the presence and functional relevance of different microsatellite repeats. The unigenes being longer and without redundancy offer advantages over the EST sequences for the development of microsatellite markers. However, development of microsatellites for the species with no sequence information is an expensive and time-consuming task. To overcome this limitation, microsatellite markers developed in closely related species can be utilized [4]. EST-SSRs representing the coding regions of the genome, are expected to be conserved with a high rate of cross species transferability in comparison to genome derived SSRs [11]. Success of EST derived SSR markers across diverse taxonomic groups has been reported [12]. A total of 228090 putative unique sequences were screened for the presence of microsatellites, of which 12.18% (27791) contained specified repeat motifs excluding mononucleotide repeats, yielding 36248 unique SSRs. In legumes, a total of 156013 unique sequences were used for microsatellite search, of which only 6.85% (10688) contained microsatellites representing 12220 unique SSRs (table 1 see supplementary material). This is a relatively higher abundance of SSRs for plant unique coding sequences, compared to the previous reports for some cereals [13] and wild Arachis species [14]. The variable abundance of SSRs is known to be dependent on the SSR search criteria, the size of the dataset, the database-mining tools and the species concerned [4]. The frequency of occurrence for unigene-derived SSRs was one SSR in every 6.0 kb. In previous reports, this frequency ranged from 3.4 kb in rice to 20.0 kb in cotton [15]. In earlier reports, trinucleotide repeats generally formed the most common motif in various plant species [15], regardless of the EST-SSR search criteria. However, abundance of dinucleotide repeats has also been reported in many of the dicot species [16]. We also found trinucleotide repeats to be the most abundant followed by dinucleotides with the sole exception of C. cajan, where the situation was reversed as dinucleotide repeats were more abundant followed by trinucleotide repeats (Figure 1a). Dinucleotide and trinucleotide repeats together counted for more than 90% of all the microsatellites (Figure 1a).

Figure 1

Distribution of microsatellite repeats in various legumes, rice and Arabidopsis. b) Distribution of Class I and Class II microsatellite repeats

Tetranucleotide (˜2.0%), pentanucleotide (<1.0%) and hexanucleotide (˜1.0%) repeats showed very low abundance among all the species. In terms of single SSR motif, the dinucleotide motif AG/CT was most frequent [9, 13, 17]. The two most dominant motif types recorded in our search were AG and AT in agreement with a study on cultivated peanut and wild Arachis species [14]. Low abundance of “CG” repeats may be attributed to their tendency of forming secondary structures (hairpins), leading to a selective pressure against ‘CG’ accumulation in genomes. Microsatellites were also classified into two classes on length basis. Firstly, Class I microsatellites, which include microsatellites more than equal to 20 nucleotides in length, and secondly, Class II microsatellites including microsatellites of less than 20 nucleotides. Class II microsatellites are more abundant (>70%) among all the species (Figure 1b). Among trinucleotide repeat motifs, AAG motif was the most abundant, which is the second most abundant motif in Arabidopsis [6]. In other plant species, the most frequent trinucleotide repeat motifs were AAC/TTG in wheat, AAG/TTC in soybean, and CCG/GGC in barley, rice, maize and sorghum [13, 8, 18]. The previous studies on Arabidopsis and soybean [15] also reported abundance of trinucleotide motif AAG, contrasting to the abundance of CCG motif in cereal species [6]. The trinucleotide repeats code for 21 amino acids and stop codon. The predicted amino acid pattern for the trinucleotide motifs detected is shown in supplementary figure 1. CTA/CTC/CTG/CTT/TTA/TTG motifs coding for leucine were most common followed by AGC/AGT/TCA/TCC/TCG/TCT coding for serine and glutamic acid (GAA/GAG). Abundance of small/hydrophilic amino acid repeat motifs like that of serine in the unigenes of cereals and Arabidopsis is explainable since these repeats are tolerated in many proteins, while strong selection pressure possibly eliminates codon repeats encoding for hydrophobic/other amino acids [19]. Trinucleotide repeats try to maintain codon biasness and thus vary their frequency significantly to manage frameshift mutations in coding regions [20]. In silico identification of SSRs from various sequence resources like genomic sequences, ESTs or unigenes is a low cost and easy method for development of microsatellite markers. Such markers can be used for understanding the nature and possible biological functions. EST-SSRs have been of great interest to researchers and there are many recent reports about development of EST-SSR markers, in plant materials such as soybean [21], potato [22], seabuckthorn [23] and many more. To characterize unigene sequences harboring SSRs, we performed sequence similarity search against non redundant NCBI protein database. On an average, more than 70% of unigenes with SSRs showed homology to genes having known function for each species under study, with an exception of C. cajan, which had only 30% of such unigenes. The remaining unigenes showing no hit during similarity search were considered as organism specific. Most of the unigene sequences represented enzymes of general metabolism as reported earlier [17]. On the basis of GO annotation, microsatellite positive chickpea unigenes were assigned GO terms associated with biological process, cellular component and molecular function. In case of biological processes, C. arietinum unigenes were assigned to thirty three different categories (Figure 2a). Majority of unigenes were assigned to the “transport” category (15.5%). For the cellular components, unigenes were assigned to nineteen different categories with majority of them participated in “plastid” category (17.45%) (Figure 2b). When concentrating on molecular functions, the unigenes were assigned to twenty two categories with majority covering “binding” category (18.18%) (Figure 2c). In general, microsatellite containing C. arietinum unigene sequences matched to proteins having distinct molecular functions such as, binding, catalytic, transport, enzyme regulators, and structural activities in different biological processes, and cellular and sub-cellular organization. Unigenes related to biological process such as response to abiotic and biotic stresses should be explored as candidates for studying their role in response to that particular stress or trait. One of the favorable approaches to use them could be to assign marker trait association study based on the phenotypic data and allele variance across diverse collections.

Figure 2

Gene Ontology annotation of SSR positive Cicer arietinum unigenes. a) Biological process; b) Cellular component; and c) Molecular function

Conclusion:

The present study has focused on the in silico mining/identification of microsatellites from the unique coding sequences of nine members of Fabaceae family and two model plant species. Microsatellite markers developed from conserved coding region of genome show a higher transferability through cross-amplifications in related species than microsatellites developed using genomic regions. Development of microsatellite markers from coding region using computational approach has reduced the cost significantly and allowed their use for related species with less sequence information. Microsatellite dynamics with regard to frequency and types of microsatellites showed marked variability in the legume unigenes. The trinucleotide repeats were predominant in all the unigenes analysed except in C. cajan. Unigene sequences are derived from the expressed portion of the genome, therefore, markers developed from these resources can be assayed as gene based functional marker for diversity assessment, and gene mapping and marker assisted selection. To characterize unigene sequences with SSRs, we performed sequence similarity search against non-redundant NCBI protein database. Unigene derived markers may be implicated in biological, cellular and molecular functions and provide opportunity to investigate the possible role of microsatellites in various gene functions.

20 in total

1. Data mining for simple sequence repeats in expressed sequence tags from barley, maize, rice, sorghum and wheat.

Authors: Ramesh V Kantety; Mauricio La Rota; David E Matthews; Mark E Sorrells
Journal: Plant Mol Biol Date: 2002 Mar-Apr Impact factor: 4.076

2. Computational and experimental characterization of physically clustered simple sequence repeats in plants.

Authors: L Cardle; L Ramsay; D Milbourne; M Macaulay; D Marshall; R Waugh
Journal: Genetics Date: 2000-10 Impact factor: 4.562

3. Mining and survey of simple sequence repeats in expressed sequence tags of dicotyledonous species.

Authors: Siva P Kumpatla; Snehasis Mukhopadhyay
Journal: Genome Date: 2005-12 Impact factor: 2.166

4. Computational and experimental analysis of microsatellites in rice (Oryza sativa L.): frequency, length variation, transposon associations, and genetic marker potential.

Authors: S Temnykh; G DeClerck; A Lukashova; L Lipovich; S Cartinhour; S McCouch
Journal: Genome Res Date: 2001-08 Impact factor: 9.043

5. Differential distribution of simple sequence repeats in eukaryotic genome sequences.

Authors: M V Katti; P K Ranjekar; V S Gupta
Journal: Mol Biol Evol Date: 2001-07 Impact factor: 16.240

6. Development of EST-based new SSR markers in seabuckthorn.

Authors: Ankit Jain; Rajesh Ghangal; Atul Grover; Saurabh Raghuvanshi; Prakash C Sharma
Journal: Physiol Mol Biol Plants Date: 2010-12-09

7. Blast2GO: a universal tool for annotation, visualization and analysis in functional genomics research.

Authors: Ana Conesa; Stefan Götz; Juan Miguel García-Gómez; Javier Terol; Manuel Talón; Montserrat Robles
Journal: Bioinformatics Date: 2005-08-04 Impact factor: 6.937

8. Comparative genomics reveals conservative evolution of the xylem transcriptome in vascular plants.

Authors: Xinguo Li; Harry X Wu; Simon G Southerton
Journal: BMC Evol Biol Date: 2010-06-21 Impact factor: 3.260

9. ESTs from a wild Arachis species for gene discovery and marker development.

Authors: Karina Proite; Soraya C M Leal-Bertioli; David J Bertioli; Márcio C Moretzsohn; Felipe R da Silva; Natalia F Martins; Patrícia M Guimarães
Journal: BMC Plant Biol Date: 2007-02-15 Impact factor: 4.215

10. A White Campion (Silene latifolia) floral expressed sequence tag (EST) library: annotation, EST-SSR characterization, transferability, and utility for comparative mapping.

Authors: Maria Domenica Moccia; Christine Oger-Desfeux; Gabriel Ab Marais; Alex Widmer
Journal: BMC Genomics Date: 2009-05-25 Impact factor: 3.969

3 in total