Literature DB >> 21083898

Functionally relevant microsatellites in sugarcane unigenes.

Swarup K Parida¹, Awadhesh Pandit, Kishor Gaikwad, Tilak R Sharma, Prem Shankar Srivastava, Nagendra K Singh, Trilochan Mohapatra.

Abstract

BACKGROUND: Unigene sequences constitute a rich source of functionally relevant microsatellites. The present study was undertaken to mine the microsatellites in the available unigene sequences of sugarcane for understanding their constitution in the expressed genic component of its complex polyploid/aneuploid genome, assessing their functional significance in silico, determining the extent of allelic diversity at the microsatellite loci and for evaluating their utility in large-scale genotyping applications in sugarcane.
RESULTS: The average frequency of perfect microsatellite was 1/10.9 kb, while it was 1/44.3 kb for the long and hypervariable class I repeats. GC-rich trinucleotides coding for alanine and the GA-rich dinucleotides were the most abundant microsatellite classes. Out of 15,594 unigenes mined in the study, 767 contained microsatellite repeats and for 672 of these putative functions were determined in silico. The microsatellite repeats were found in the functional domains of proteins encoded by 364 unigenes. Its significance was assessed by establishing the structure-function relationship for the beta-amylase and protein kinase encoding unigenes having repeats in the catalytic domains. A total of 726 allelic variants (7.42 alleles per locus) with different repeat lengths were captured precisely for a set of 47 fluorescent dye labeled primers in 36 sugarcane genotypes and five cereal species using the automated fragment analysis system, which suggested the utility of designed primers for rapid, large-scale and high-throughput genotyping applications in sugarcane. Pair-wise similarity ranging from 0.33 to 0.84 with an average of 0.40 revealed a broad genetic base of the Indian varieties in respect of functionally relevant regions of the large and complex sugarcane genome.
CONCLUSION: Microsatellite repeats were present in 4.92% of sugarcane unigenes, for most (87.6%) of which functions were determined in silico. High level of allelic diversity in repeats including those present in the functional domains of proteins encoded by the unigenes demonstrated their use in assay of useful variation in the genic component of complex polyploid sugarcane genome.

Entities: Chemical Disease Gene Mutation Species

Mesh：

Substances：
DNA, Plant

Year: 2010 PMID： 21083898 PMCID： PMC3017843 DOI： 10.1186/1471-2229-10-251

Source DB: PubMed Journal: BMC Plant Biol ISSN： 1471-2229 Impact factor: 4.215

Background

Sugarcane (Saccharum sp.) is a complex polyploid belonging to the family Poaceae of the tribe Andropogoneae. It is an important commercial sugar producing crop and a source of approximately 50% of the world's sugar and alcohol. The polyploid/aneuploid nature with variation in chromosome number has been largely responsible for its genetic and taxonomic complexity [1]. Characterization of such large genomes is greatly facilitated by the use of molecular markers. Microsatellite or simple sequence repeat (SSR) markers are being preferred because of their co-dominant inheritance, reproducibility, multi-allelic nature, chromosome-specific location and wide genomic distribution. These markers are amenable to high throughput genotyping due to multiplexing and efficient resolution of amplicons by automated fragment analysis [2,3]. In sugarcane, Cordeiro et al. (2000) [2] and Parida et al. (2009a) [4] developed a large number of microsatellite markers from the genomic sequences. Pinto et al. (2004; 2006) [5,6] and more recently Oliveira et al. (2009) [7] also designed such markers from the sugarcane ESTs, which were used for constructing high resolution functional genetic linkage map of Saccharum spp. [8]. However, information on a limited number of these genic microsatellite markers is available in the public domain. Recently, the EST sequences have been assembled into unigenes [9], which is expected to provide non-redundant, locus specific and novel gene-based functional markers for sugarcane having a large genome not amenable to complete sequencing. The unigene sequences of sugarcane have not yet been analyzed for their microsatellite constitution and compared with the other small genome members of the grass family. In India, systematic breeding of sugarcane has resulted in the development of a number of varieties with high productivity and stress tolerance by inter-specific hybridization [10]. However, the genetic base of modern Indian sugarcane cultivars is considered narrow due to use of a limited number of parental species clones in cross hybridization and repeated intercrossing of hybrids [11]. Understanding the extent of natural variation at molecular level is essential to develop new strategies for sugarcane improvement. Earlier, molecular markers such as RAPD (Randomly Amplified Polymorphic DNA), AFLP (Amplified Fragment Length Polymorphism), and maize and sugarcane genomic microsatellites have been used for this purpose [4,12-15]. No effort has yet been made to understand the genetic diversity of Indian sugarcane cultivars based on functionally relevant genic regions of its complex genome. The present study was undertaken to mine the available unigene sequences of sugarcane (Saccharum sp.) to understand the microsatellite structure and distribution in the expressed genic component of the genome, assess their functional significance in silico, design primers from the flanking regions of the identified microsatellites, assess the efficiency of a set of fluorescent dye labeled primers in genotyping using automated fragment analysis system and determine functional diversity among different species, related genera and Indian varieties of sugarcane.

Results

Frequency, distribution and organization of microsatellites in sugarcane unigenes

The type, frequency and relative distribution of the microsatellites in the unigene sequences of sugarcane are given in Table 1. The perfect microsatellite (excluding the mononucleotides) frequency in the unigenes of sugarcane was one in every 10.9 kb and the proportion of microsatellite carrying unigenes was 3.7% (584 out of 15,582). When 1,871 (12%) mononucleotide microsatellites were included, the proportion of unigenes carrying microsatellites increased to 17.4%. The mononucleotides in sugarcane showed a strong bias (84.6%) towards A/T repeat, with the majority (89%) being 9 to 30 bases long and the remaining (11%) extending up to 69 bases (T69). A total of 841 perfect microsatellites were identified in 584 unigene sequences of sugarcane. One hundred sixty-seven (28.6%) of these unigenes contained multiple microsatellites (that accounted for 424 microsatellites) which were interrupted by more than 100 nucleotides and the remaining 417 (71.4%) unigenes had a single microsatellite each. In addition to the perfect microsatellites, we identified 183 compound microsatellites, of which 74.9% were interrupting types and the rest being non-interrupting types (Table 1). The trinucleotide repeat motifs were the most prevalent (73.1%) class of microsatellites followed by dinucleotide (23.8%), tetranucleotide (1.8%), pentanucleotide (0.83%) and hexanucleotide (0.47%) repeat motifs (Table 1). The GC rich repeat motifs GCA/GCC/GCG/GCT (23%) coding for alanine were most abundant followed by arginine (AGA/AGG/CGA/CGC/CGG/CGT, 22%) and glycine (GGA/GGC/GGT, 11%) (see Additional file 1). Among the dinucleotide repeats, the most and least frequent motifs were GA and CG, which accounted for 21% and 1.4% of all the microsatellites, respectively (see Additional file 2).

Table 1

Distribution of microsatellites in the unigene sequences of sugarcane

Characters under study	Unigenes*
Number of sequences examined	15,594

Size (bp) of examined sequences	9,17,43,95

Number of identified perfect microsatellites	2,712 (17.4)

Number of perfect microsatellite containing sequences	2,230 (14.3)

Number of perfect microsatellite (excluding mononucleotides) containing sequences	584 (3.7)

Number of sequences containing more than one perfect microsatellites	167 (28.6)

Number of sequences containing single and unique perfect microsatellites	417 (71.4)

Number of mononucleotides	1,871 (12)

Number of dinucleotides	200 (23.8)

Number of trinucleotides	615 (73.1)

Number of tetranucleotides	15 (1.8)

Number of pentanucleotides	7 (0.83)

Number of hexanucleotides	4 (0.47)

Number of perfect microsatellites excluding mononucleotides	841 (5.4)

Size (kb) of sequences containing one perfect microsatellite	10.9

Number of perfect class I microsatellites	207 (24.6)

Size (kb) of sequences containing one perfect class I microsatellite	44.3

Number of primer pairs designed for perfect^amicrosatellites	810 (96.3)

Number of compound class I microsatellite containing sequences	183

Size (kb) of sequences containing one compound class I microsatellite	50.1

Number of compound class I microsatellites	183 (1.2)

Number of compound interrupting class I microsatellites	137 (74.9)

Number of compound non-interrupting class I microsatellites	46 (25.1)

Number of primer pairs designed for compound class I microsatellites	151 (82.5)

*The number in the bracket is the proportion expressed in percentage

aMononucleotides to hexanucleotides repeated up to 100 times without any interruption at a locus

Distribution of microsatellites in the unigene sequences of sugarcane *The number in the bracket is the proportion expressed in percentage aMononucleotides to hexanucleotides repeated up to 100 times without any interruption at a locus Out of 841 perfect microsatellites identified in sugarcane, 587 (69.8%) were found in the ORFs and the remaining were present either in the 3'UTRs (102, 12.1%) or in the 5'UTRs (152, 18.1%). The trinucleotide repeat motifs were significantly more frequent (about 86%) in the ORFs. In contrast, the GA-rich dinucleotide repeat motifs were more in the 5' (49%) and 3' (32%) UTRs. The density of longer motif containing perfect class I microsatellites was one in every 44.3 kb sequences, which accounted for 24.6% (207) of the total 841 microsatellites identified (Table 1, Figure 1).

Figure 1

Frequency and relative distribution of class I and class II microsatellite repeats in sugarcane unigenes. Comparative distribution of long hypervariable class I and potentially variable class II microsatellite repeats in the unigenes of sugarcane. Trinucleotide was the most abundant repeat-motif in both class I (56.5%) and class II (79%) category, which was followed by dinucleotide motifs.

Development of unigene derived microsatellite (UGMS) markers and evaluation of their polymorphic potential

The primer pairs could be designed for 810 perfect microsatellites that was 96.3% of the total microsatellites (841) identified in the present investigation. The primer sequences flanking all the perfect UGMS including 207 class I microsatellites with their Tm values and product sizes are given in the Additional file 3. Besides, the primer sequences for 151 compound class I UGMS were designed and provided in the Additional file 4. To validate the UGMS markers, 176 primer-pairs designed from different microsatellite containing unigenes were used in PCR amplification (see Additional file 3). One hundred sixty seven (94.9%) of these produced clear and reproducible amplicons, whereas remaining nine (5.1%) did not give amplification in the S. officinarum from which sequence the primers were designed. To verify that the primers did amplify the expected microsatellite repeat-motifs, the amplified products obtained with 19 of the primers in all the Saccharum species and related genera as well as cereals were sequenced and the presence of the target microsatellite motifs as well the flanking sequences was observed in all the cases (see Additional file 5). In silico polymorphism analysis was confined to Saccharum officinarum and five cereal species namely, rice, wheat, maize, Sorghum and barley for which unigene sequences were available in the database. The polymorphism based on variation in microsatellite repeat length was observed for a maximum of 163 primers (46.6%) between sugarcane and barley followed by 161 (42.5%) between sugarcane and wheat and least (92, 17.8%) between sugarcane and Sorghum (see Additional file 6). The actual level of polymorphism detected by automated fragment analysis using 47 fluorescent dye labeled primers was much higher than that based on in silico analysis although the trend was maintained. S. officinarum had maximum polymorphism with barley (92.7%) followed by wheat (90.6%), rice (85.8%), maize (67.7%) and Sorghum (61.4%). Forty-three (91.5%) of the 47 primers detected polymorphism (mean PIC of 0.85) among the 41 genotypes belonging to Saccharum species, related genera, tropical and sub-tropical Indian sugarcane varieties and five cereal species (Table 2). This included 21 (95.5%, mean PIC of 0.81) amplifying dinucleotide repeats, 16 (84.2%, PIC of 0.73) trinucleotide repeats, two (100%, PIC of 0.87) tetranucleotide repeats, two (100%, PIC of 0.86) pentanucleotide repeats and two (100%, PIC of 0.83) amplifying hexanucleotide repeats. All the 38 (100%, mean PIC of 0.82) primers that targeted amplification of class I repeats and five (55.6%, PIC of 0.55) of the nine that targeted class II repeats showed polymorphism among Saccharum species, genera, varieties and cereal species. The microsatellites in the 5' and 3'UTR sequences showed higher potential for polymorphism (15 out of 16, 94%, mean PIC of 0.83) as compared to that from the coding regions (28 out of 31, 90%, mean PIC of 0.77).

Table 2

Evaluation of the amplification efficiency and polymorphic potential of 47 fluorescent dye labeled primers

														Polymorphic potential

Sl. No.	Unigene Accession IDs'	Class IUGMS primers^A	Repeat-motifs	Location	Forward primer sequences (5'-3')	Reverse primer sequences (5'-3')	Putative functions	Actual annealing temperature (°C)	No. of locus	Total no. of alleles amplified	Size (bp) of allele (s) amplified	Type of allele size distribution	No. of heterozygous loci	Among sugarcane species and genera	Among sugarcane varieties

															P/M^B	PIC
1	CA297715	UGSuM2a	(AT)43	CDS	CTGTGTATATGTTCGTAGTTTG	CACTTAGTCACACTCTCACACAC	Sucrose phosphate synthase	55	2	20	206-302	Step-wise	1	P	P	0.86

		UGSuM2b								4	481-512	Mixed

2	CA278792	UGSuM5	(TA)28	CDS	TCACATCCATCATCCACAGC	TCCAATGCAAGCAAACTCAC	Maize-Cyclin III	55	1	23	80-170	Step-wise	0	P	P	0.84

3	AY596609	UGSuM11	(TA)21	3'UTRs	TGGTAACCCTAGGCAGGTGA	GTGCACCAGATTTGGATGGT	Fructose-bisphosphate aldolase	56	1	23	92-190	Step-wise	0	P	P	0.83

4	CA279221	UGSuM15a	(CCTCGC)6	CDS	GTTTAAGACAAGATGGTGTAGATG	TACATATTTACATTGTTACTCCGC	Hypothetical protein	56	2	17	170-247	Mixed	1	P	P	0.80

		UGSuM15b								3	360-378	Step-wise

5	CA278282	UGSuM16	(AT)18	CDS	GCGTCTTCATCATCTGCAAC	GCGTCTTCATCATCTGCAAC	Pathogenesis-related PRMS protein	55	1	22	150-244	Step-wise	0	P	P	0.82

6	CA253277	UGSuM17	(AG)18	CDS	TTTCCATTCTTCCATTCAACTG	GGCAGGCTGAGAGACTGTTC	Abcisic acid-protein kinase	55	1	10	90-188	Step-wise	1	P	P	0.78

7	CA227482	UGSuM18a	(GA)18	CDS	GGCGAGAGAGAGAGAGAGAGAG	AGGTGGAGATCTTGAGGTAGGC	Glycine decarboxylase	56	2	20	96-193	Mixed	1	P	P	0.81

		UGSuM18b								2	326-348	Step-wise

8	CA126180	UGSuM20a	(TCA)12	CDS	ATCCCTTATGCTACAGAAATGT	TTAGCCTAGAGGTTTGATTGAT	Acetyl-CoA synthetase	54	2	16	190-259	Step-wise	1	P	P	0.83

		UGSuM20b								4	392-428	Step-wise

9	CA177414	UGSuM21	(AGGA)9	CDS	CGCTCCCTCACCGTCATT	CTCCGCATCCTCGTCACC	Transcription regulator protein	62	1	20	160-252	Step-wise	0	P	P	0.81

10	CA223153	UGSuM22a	(GCG)12	3'UTRs	CTCCCTCCTCCTCCCGTTG	CTCTTGGGTGTGAACCAG	Polyadenylate-binding protein	64	2	19	96-187	Mixed	1	P	P	0.85

		UGSuM22b								3	322-358	Step-wise

11	CA122659^A	UGSuM24a	(TTTTC)7	5'UTRs	CTGTACAACAGCAATTATGAATCT	CTCGACTACGAGAGGATATGAT	Hypothetical protein	55	3	12	240-294	Mixed	1	P	P	0.86

		UGSuM24b								4	398-428	Step-wise

		UGSuM24c								5	531-571	Step-wise

12	BU103692^A	UGSuM26	(CT)17	5'UTRs	CTCGATCCCAGAGAGCTCCACAG	AGTACCGAATTCATTAAACTCCT	Beta-amylase	55	1	21	250-346	Step-wise	0	P	P	0.85

13	CA073284	UGSuM27a	(GGC)11	CDS	CTGCAGTACGGTCCGGAATC	GTACCACCATGGCTCTAGCTTC	30 S ribosomal protein S16	60	2	2	50-54	Mixed	1	P	P	0.84

		UGSuM27b								22	154-288	Mixed

14	AY596606	UGSuM33a	(AGC)10	CDS	CGAGGCACTGAACCCATATC	TGTTTGAACTGGATGGCGTA	Hypothetical protein	58	3	17	92-189	Mixed	1	P	P	0.84

		UGSuM33b								2	293-305	Step-wise

		UGSuM33c								2	431-491	Step-wise

15	CA268640	UGSuM34a	(AAG)10	CDS	TTACAAATGTAGCCTTGCCTTG	ATCTTTCCTTGCTTGCCTCTC	Soluble acid invertase	61	2	19	99-178	Mixed	1	P	P	0.83

		UGSuM34b								4	333-351	Step-wise

16	CA131350	UGSuM41	(CCG)10	CDS	ATCATTCTCCATCATTTCTCA	AGGCTCTTCAACCGTGCT	Unknown protein	54	1	20	120-210	Step-wise	0	P	P	0.80

17	CA133924	UGSuM42a	(CTCTCC)5	5'UTRs	TTCATACAGAAGAACCTCCAC	TCCATCAGAGACAAGCAGA	Auxin-independent growth promoter	54	3	12	130-212	Mixed	1	P	P	0.83

		UGSuM42b								8	325-367	Step-wise

		UGSuM42c								2	475-487	Step-wise

18	CA136599	UGSuM43a	(CCG)10	3'UTRs	CAAAGTGCTGTAGGGCTG	TTCAATGGGTGATAAGTGTGT	Ribose-phosphate pyrophosphokinase 1	55	2	17	90-187	Mixed	1	P	P	0.85

		UGSuM43b								2	332-368	Step-wise

19	CA139800	UGSuM44a	(CT)15	5'UTRs	TCCATCAAGCCGTTCCTC	GCCAAGCAGATAAAGAAGTG	Rudimentary enhancer	55	2	16	220-308	Step-wise	1	P	P	0.84

		UGSuM44b								1	419	-

20	CA171090	UGSuM45a	(AAAAG)6	CDS	ATCTCCTCTTATTCGTTCTGG	AGCAGCGTCTTATCTGGG	PAP fibrillin	56	3	5	91-141	Step-wise	1	P	P	0.79

		UGSuM45b								2	267-287	Step-wise

		UGSuM45c								2	368-393	Step-wise

21	CA196477	UGSuM46a	(GAC)10	CDS	ACTCCTCCCGCCTCCACTAC	CTCACCGAAGCAATCAAG	Hypothetical protein	60	2	15	101-188	Mixed	1	P	P	0.81

		UGSuM46b								3	341-377	Step-wise

22	CA228772	UGSuM47a	(GCC)10	CDS	ATTTATGGAGGAAGAAACGG	ATTACAAACAAGAAGAGCGG	Transport protein particle component	55	2	3	82-97	Step-wise	1	P	P	0.80

		UGSuM47b								14	215-270	Mixed

23	CA161416	UGSuM50	(TC)14	CDS	CTACTGCCGAGGAAAGATCG	GGAAAAGTTTGTGGCAAGGA	Hypothetical protein	58	1	16	94-188	Step-wise	0	P	P	0.80

24	CA228375	UGSuM73a	(CGC)8	5'UTRs	CTTTCAACCTCTACACCTCCAC	ACTAGAAGACTGAGAAGAACCAGT	40 S ribosomal protein S11	55	2	13	101-186	Mixed	1	P	P	0.82

		UGSuM73b								5	354-375	Step-wise

25	CA261182	UGSuM74	(ACA)8	CDS	TCAGCAGCTGTGAAGTTTCATT	CGTCTCTTTTGGGTTTCATCTC	Transcription regulator protein	55	1	18	190-271	Step-wise	0	P	P	0.81

26	CA219230	UGSuM75a	(TA)12	CDS	TTGTGCTGATGTTTCCTGCT	CAAGAGAAGATGCCATTAGCC	Patatin-like protein	55	2	13	94-176	Step-wise	1	P	P	0.83

		UGSuM75b								4	333-351	Step-wise

27	CA093071	UGSuM96a	(AT)11	CDS	TCAAACCAGGATCTAAGCTCAC	GGTAGTGCCATTGAGGTTGC	Putative apyrase	57	2	11	205-290	Mixed	1	P	P	0.81

		UGSuM96b								5	408-458	Step-wise

28	CA229840	UGSuM97a	(GA)11	5'UTRs	GCGAGAGAGATAGAGGGAGAGA	AGGTGCCGTTCATGAGGTAGT	Glycine decarboxylase	56	2	19	240-334	Step-wise	1	P	P	0.85

		UGSuM97b								2	453-469	Step-wise

29	CA112979	UGSuM149a	(AGC)7	CDS	GTTCAATCAAATCCCTCTCCTC	AGCTTGGTCAGCTCCTCATCGTT	Ubiquitin-specific protease 4 (UBP4)	60	2	4	104-141	Mixed	0	P	P	0.78

		UGSuM149b								3	276-297	Step-wise

30	CA116368	UGSuM150a	(GGC)7	CDS	ACACTGACCGATGGATCCTCTT	ATCAACGTGGACCAGATCTTCTT	hAT family dimerisation domain protein	60	2	12	90-167	Mixed	1	P	P	0.76

		UGSuM150b								2	279-291	Step-wise

31	CA280782	UGSuM177a	(TGC)7	CDS	GGTGCTGTCCCTATCACTAC	GCCCTTGTTTCTTTGTCTACT	Hypothetical protein	55	2	3	150-167	Mixed	0	P	P	0.79

		UGSuM177b								3	287-314	Step-wise

32	CA291445	UGSuM178	(GAC)7	5'UTRs	GGACTACTACGACTACTGCGA	ACCTTGCTTACATCTTCCTCT	O-diphenol-O-methyl transferase	54	1	15	100-196	Step-wise	0	P	P	0.80

33	CA244023	UGSuM186	(AG)10	CDS	AACATTTCGGCATTTGAAGC	GGTCTTTCTTGGGGATCTCTC	Ubiquitin C-terminal hydrolase	56	1	4	190-232	Step-wise	0	P	M	0.73

34	CA231668	UGSuM187a	(CT)10	CDS	CAACAATTGTCGAAGCCTCTC	TTTGCTTACCCCCTGTTGAC	ATP synthase	58	2	10	200-264	Step-wise	0	P	M	0.83

		UGSuM187b								6	376-418	Step-wise

35	AY596560	UGSuM188	(GA)10	3'UTRs	CCCAAGCGAGCTAGAGAGAG	TCTTCTTTCCTTCGCACAGC	Hypothetical protein	55	1	19	95-186	Mixed	0	P	P	0.81

36	CA241232	UGSuM189a	(CT)10	CDS	CCGCGACTCTCCTCTCTCT	GTTCTTCTCGGCGTTCCTC	Auxin-regulated protein	55	2	15	106-200	Step-wise	1	P	P	0.80

		UGSuM189b								3	345-377	Step-wise

37	CA133642	UGSuM196a	(AAAG)5	5'UTRs	GCTACTATGGACAACAGGG	ATGAAGAGACGAGACGAAGA	Cinnamoyl CoA reductase	54	2	18	90-181	Mixed	1	P	P	0.81

		UGSuM196b								2	373-393	Step-wise

38	CA134472	UGSuM197	(GA)10	3'UTRs	GAAGGAGCAGCAGCGCCAGT	GATTTGCCGTCCTAGGGTTT	Epsin N-terminal homology domain	56	1	22	200-290	Step-wise	0	P	P	0.84

39	CA297648	UGSuM343a	(CAG)6	5'UTRs	ACTCCTCCTCCTCGCCGT	TCTTGTTGTAGTAGCCCTTGT	SOUL heme-binding family protein	62	2	17	250-320	Mixed	1	P	M	0.87

		UGSuM343b								3	438-450	Step-wise

40	CA300679	UGSuM344	(CTC)6	CDS	CTATCCTCTTGTTGGGTCCT	TCCGCACCTCCGTTCACC	Nucleoside diphosphate kinase protein	55	1	1	260	Step-wise	0	M	M	0.0

41	CA084691	UGSuM345	(TC)8	CDS	TATACAAGAATGAAAGGTGAGAGA	AAGCATACTCCCTCTATCTCTATG	DC1 domain-containing protein	55	1	2	210-220	Step-wise	0	M	P	0.70

42	CA093455	UGSuM346	(AG)8	CDS	TATACGTAGTAGTGATGATGACCG	CTCCTTCGTCCAGTACCAGTAG	DNA-binding protein DF1	60	1	4	150-180	Step-wise	1	P	M	0.71

43	CA110745	UGSuM347a	(CT)8	5'UTRs	TCTGGCTTTATCGTAACTTGTAT	GAGCCTCGTTTGGGTGGCTTTC	Expressed protein	55	2	5	230-254	Step-wise	1	P	M	0.74

		UGSuM347b								1	365	-

44	CA112979	UGSuM348	(CG)8	CDS	CTACCTCCTCGTCTCCTCCCTCTT	AACAAGGAATATGGTCCCTGAG	Unknown protein	61	1	2	240-261	Mixed	0	M	M	0.0

45	CA116458	UGSuM349a	(TC)8	CDS	CAAGATGTACCCGGACATGGCT	TGCTATACTAGCTATCTCCTTCCT	Unknown protein	55	2	3	220-236	Step-wise	0	P	M	0.73

		UGSuM349b								2	385-395	Step-wise

46	CA123971	UGSuM464	(ACA)5	5'UTRs	GGCTACTTCAGACACGCA	TCTACGCATCAACCTCTCA	SNF2-domain-containing protein	55	1	1	220-232	Step-wise	0	M	M	0.0

47	CA125310	UGSuM465	(GCA)5	CDS	GCTAACCAACATCAGCAGT	AGGAGATTGACGAAGAAGAAG	Transducin family protein	53	1	2	260-279	Mixed	0	M	M	0.0

AUGSuM stands for unigene derived sugarcane microsatellite primers as indicated in the Additional file 3

BPolymorphic (P) and Monomorphic (M)

Evaluation of the amplification efficiency and polymorphic potential of 47 fluorescent dye labeled primers AUGSuM stands for unigene derived sugarcane microsatellite primers as indicated in the Additional file 3 BPolymorphic (P) and Monomorphic (M) Due to high polyploidy and heterozygosity of the Saccharum genome, most (28, 65.1%) of the primers amplified two to three different loci (each locus being designated as an UGMS marker) yielding a total of 60 loci with the number of alleles per locus ranging from 2 to 22 and the allele size varying from 82 to 408 bp across sugarcane species, genera and varieties (Figure 2) as well as five cereal species. The remaining 15 (34.9%) primers gave single locus amplification (Table 2) with 1 to 23 alleles each. Overall, the 43 polymorphic primers amplified an average of 7.42 alleles per UGMS marker locus with a total of 722 alleles across 74 loci. Forty-two (73 UGMS marker loci with 720 alleles, 97.7% polymorphic, mean PIC of 0.81) of the 43 informative primers revealed polymorphism between sugarcane species and related genera, whereas 37 (66 UGMS marker loci with 678 alleles, 86% polymorphic, PIC of 0.74) primers detected polymorphism among the Indian commercial sugarcane varieties (Figure 3). Twenty-six (34.7%) of the 75 UGMS marker loci showed heterozygosity in different genotypes used. Maximum heterozygous loci was detected in five sugarcane species (14, 18.7%) followed by commercial varieties (9, 12%) and minimum (4, 5.3%) in the three related genera and five cereal species.

Figure 2

Figure 3

Fragment length polymorphism detected among Indian sugarcane varieties using fluorescent dye labeled primers. Allelic variation in a representative set of 14 commercial Indian tropical and sub-tropical sugarcane varieties. Fourteen alleles were amplified by a class I UGMS marker (UGSuM17) designed from the unigene encoding Abcisic acid inducible protein kinase that contained (AG)18 repeat-motif. The amplicons generated by the fluorescent dye labeled primers were resolved in MegaBACE automated DNA sequencer and analyzed in Fragment Profiler software. Fragment size (bp) and average peak height for all the amplified alleles are indicated.

Different UGMS allele types detected by automated fragment analysis. Distribution of various allele types detected in Saccharum species, genera, commercial varieties and five cereal species using the automated fragment analysis. A, C, D, F and H: Multiple alleles in a single locus showing step-wise distribution, B, G and I: Multiple alleles in a single locus showing mixed distribution, E: Single allele in a unique locus, and J: Multiple loci showing both stepwise and mixed distribution in different loci. Fragment length polymorphism detected among Indian sugarcane varieties using fluorescent dye labeled primers. Allelic variation in a representative set of 14 commercial Indian tropical and sub-tropical sugarcane varieties. Fourteen alleles were amplified by a class I UGMS marker (UGSuM17) designed from the unigene encoding Abcisic acid inducible protein kinase that contained (AG)18 repeat-motif. The amplicons generated by the fluorescent dye labeled primers were resolved in MegaBACE automated DNA sequencer and analyzed in Fragment Profiler software. Fragment size (bp) and average peak height for all the amplified alleles are indicated.

Molecular basis of UGMS polymorphism and its functional significance

Comparison of fragment size (bp) of variant alleles amplified at 75 polymorphic UGMS loci with changes in number of repeats among Saccharum complex, varieties and five cereal species revealed purely step-wise allelic distribution for 51 (68%) loci, while remaining 24 (32%) loci showed mixed type of allele size distribution (Figure 2). To elucidate the distribution pattern of UGMS alleles showing fragment length polymorphism, 10 amplified size variant alleles each of two microsatellite primers namely, UGSuM2 and UGSuM27 showing both step-wise and mixed type of allele distribution were sequenced. High quality sequence alignment revealed that the size variant alleles showing step-wise distribution contained the expected microsatellite repeat sequences with conserved primer binding sites, but corresponded exactly to the step-wise multiples of the number of repeat units. The size variation of sequenced alleles was explained by differences in the number of repeat units. Mixed type of allele distribution resulted from both variation in the number of repeat-units and insertions/deletions in the sequences flanking the UGMS repeats. For instance, UGSuM2 (Figure 4A) and UGSuM27 (Figure 4B) showing both step-wise and mixed type allele size distribution had the expected (AT)n and (GGC)n UGMS motifs, respectively with varying number of repeat-units in different sugarcane species, genera, varieties and cereals along with a small stretch of nucleotide insertion/deletion in the flanking regions of target microsatellites.

Figure 4

DNA sequence alignment depicting the molecular basis of UGMS fragment length polymorphism among . Multiple sequence alignment of 10 size variant alleles showing both stepwise and mixed distributions amplified from sugarcane species, genera, varieties and five cereals for the two UGMS marker loci namely, UGSuM2 (A) and UGSuM27 (B). Variation in the number of repeats of (AT)n and (GGC)n microsatellite motifs at UGSuM2 and UGSuM27 loci, respectively and additional nucleotide insertions/deletions in the flanking sequences of the repeats are highlighted. In order to assess the functional significance of the UGMS, gene ontology classification of 767 unigenes carrying perfect and compound microsatellites was carried out. Six hundred seventy-two (87.6%) of these (see Additional files 3 and 4) could be functionally annotated and were shown to encode enzymes for sugar metabolism (39%), structural proteins (26%), transcription and translation factors (18%), signal transduction pathway proteins (8%), cell growth and development factors (5%) and disease resistance proteins (4%) (see Additional file 7). Three hundred sixty-four (54.2%) of 672 unigenes contained microsatellite repeats in 89 different functional domains of proteins encoded by these unigenes (see Additional file 8) which included cytochrome b/c oxidase, chlorophyll A/B binding, FAD and ubiquitin binding, sucrose synthase, alpha-amylase, acetyl COA, aldehyde dehydrogenase, glyceraldehyde-3-phosphate, lipase, S-adenosylmethionine, leucine rich repeat (LRR), ferritin, protein kinase, chitinase, basic leucine zipper (bZIP), zinc finger, TATA box, Myb and WRKY DNA binding domains (Figure 5). Fifteen primers designed from 15 different unigenes targeting such functional domains (see Additional file 9) that gave single locus amplification and step-wise allele distribution (Table 2) were selected for validation. All of these primers gave fragment length polymorphism among Saccharum complex, varieties and five cereals based on variation in the number of UGMS repeat-units within the functional domains.

Figure 5

Functional annotation of unigenes carrying microsatellites in their functional domains. Functional annotation of 364 unigenes carrying microsatellites in the functional domains of encoded proteins. These unigenes corresponded maximum (47.7%) to the domains responsible for photosynthesis (cytochrome b/c and chlorophyll A/B binding domains) and carbohydrate metabolism (sucrose synthase and alpha amylase domains) followed by transcription factor associated basic leucine zipper, zinc finger, TATA box, Myb and WRKY DNA binding domains (22%), and minimum to the abiotic and biotic stress related leucine rich repeat, protein kinase and chitinase domains (5%). One of these primers (UGSuM26) showing polymorphism targeted the amylase catalytic domain of β-amylase (Figure 6). To understand the possible biological significance of the variable UGMS repeat in the amylase domain, the structure predicted from the aminoacid sequences of the functional domain was analyzed in silico. It revealed variation in the active binding site involved in formation of the Ca2+ ligand complexes due to the presence of variable number of repeats encoding Leucine-Serine aminoacid residues (Figure 6). This would most likely alter the function of amylase gene in respect of carbohydrate metabolism in sugarcane. Similarly, the expansion/contraction of microsatellite motifs (AG)n encoding Arginine-Glutamine aminoacid residues was observed in the protein kinase functional domain of another sugarcane unigene which was amplified by the primer UGSuM17. In silico analysis suggested a novel secondary protein structure and catalytic domain binding site in the variant form that was different from the native form which contained (AG)18 microsatellite repeats (see Additional file 10).

Figure 6

Alignment, and predicted protein structure and catalytic domain binding sites depicting the functional relevance of microsatellite carrying unigenes. Multiple sequence alignment of the four size variant alleles showing step-wise distribution in the amylase catalytic domain amplified from sugarcane species, genera, varieties and five cereals using primer (UGSuM26) for β-amylase unigene. Variation in the number of repeat-units of CT and the encoding repeated tracts of Leucine-Serine aminoacid residues at a microsatellite locus predicted different three dimensional protein structures and protein-ligand complex binding sites which are highlighted.

Assessment of functional genetic diversity

The pair-wise similarity among 36 genotypes belonging to Saccharum species, related genera and 28 tropical and sub-tropical Indian varieties of sugarcane based on 43 fluorescent dye labeled polymorphic primers revealed a broad range from 0.12 to 0.91 with an average of 0.32 similarity index. The similarity among the sugarcane species varied from 0.24 (IJ-76-3-1-9 and 1151) to 0.88 (IJ-76-3-1-9 and IM-76-256) with an average of 0.36. Among the species, S. officinarum showed maximum similarity (0.88) with S. robustum followed by S. barberi with S. sinense (similarity of 0.82). S. spontaneum was most divergent from the rest of the Saccharum species. The similarity of the three genera of sugarcane, namely, Erianthus, Narenga and Sclerostachya with the sugarcane species varied from 0.22 to 0.90. Erianthus had the least similarity with the Saccharum species. Similarity among the 28 Indian tropical and sub-tropical sugarcane varieties varied from 0.33 (Bo 91 and Co 8021) to 0.84 (Co 8371 and Co 8021) with an average of 0.40. The pair-wise similarity between the sugarcane species and varieties varied from 0.20 to 0.78 with average of 0.35, while between the three related genera and varieties, it ranged from 0.13 to 0.80 with an average of 0.24. The average similarity between the sub-tropical and tropical varieties (0.40) was slightly lower than the average pair-wise similarity measures within the sub-tropical (0.47) and tropical varietal groups (0.43). The genetic relationship among the Saccharum species, related genera, and tropical and sub-tropical varieties is depicted in Figure 7. The UGMS markers clearly discriminated all the 36 genotypes from each other and resulted in a definitive grouping among different genera and species of Saccharum with high bootstrap values (81 to 100) that corresponded well with their known pedigree relationships. All the clones of five Saccharum species were included in the major cluster I in which S. officinarum and S. robustum clones were sub-clustered (Ia) with 98% occurrence in boot strap analysis; S. barberi and S. sinense clones grouped together in a separate sub-cluster (Ib) with 100% occurrence and S. spontaneum grouped distinctly from these sub-clusters with 92% occurrence. The three related genera of Saccharum being highly divergent from the Saccharum species and varieties, grouped in a separate cluster (III) with 98% occurrence.

Figure 7

Phylogenetic tree depicting genetic relationships among . Unrooted phylogenetic tree depicting the genetic relationships among the Saccharum species, related genera and 28 commercial Indian tropical and sub-tropical sugarcane varieties based on Nei and Li's similarity coefficient using 47 fluorescent dye labeled primers. Bootstrap values are indicated at the corresponding node for each cluster. Molecular classification corresponded to the known evolutionary and pedigree relationship as well as the adaptive environment. T and ST denote the tropical and sub-tropical region of adaptation, respectively. All the tropical and sub-tropical varieties were included in a distinct and separate cluster (II) with seven distinct sub-clusters (Figure 7) with high (88) bootstrap value. The tropical variety Co 62175 grouped in a major sub-cluster IIa with its tropical male parent Co 419 (similarity 0.71) with a bootstrap value of 99%. In this sub-cluster (IIa), the other tropical varieties namely, Co 8021, Co 8371, Co 7704 and Co 6304 were also grouped together with average similarity of 0.77 (supported by 93% occurrences in bootstrap) possibly because of their common ancestry involving Co 419. The two sub-tropical varieties, Co 8347 and Co 87263, and three other sub-tropical varieties, CoPant 84211, Co 89003 and Co 7717, in spite of having a common ancestor Co 775, were included in two distantly placed major sub-clusters; IIb and IIg, respectively. It could be due to inclusion of two diverse sub-tropical varieties, Co 87268 and CoLk 8102 having common parent Black Cheribon within the sub-cluster IIb. Similarly, the clustering of two tropical varieties namely, CoC 671 and Co 87025 in the sub-cluster IIc with average similarity of 0.69 (supported by 100% occurrences) and five tropical varieties, Co 7219, Co 740, Co 86010, Co 86032 and Co 85002 together in another separate large sub-cluster (IIf) with average similarity of 0.64 (supported by 92% occurrences) was observed even though they had one common distant progenitor parent Black Cheribon. The grouping of sub-tropical varieties namely, Co 1148, CoS 8436 and CoPant 84212 in cluster IId with an average similarity of 0.70 (88% occurrences in bootstrap) was clearly influenced by a common ancestry involving Co 1148. The sub-tropical varieties, CoJ 64, Co 1158 and CoS 88230 included in the cluster IIe had common progenitor parent PoJ 2878 that could have influenced their clustering with a high bootstrap value (94.3%).

Discussion

Unigene resources representing the transcriptome of an organism provide opportunities to understand the sequence organization in the genic regions of complex polylploids and allow design of sequence based robust markers for various genotyping applications. Considering the availability of a large unigene database of sugarcane in public domain, this resource was studied for the presence and functional relevance of different microsatellite repeats. The frequency of microsatellites in sugarcane unigenes (1/10.9 kb) was lower than that obtained in rice, Sorghum, barley and maize (1/3.6 kb, 1/5.9 kb, 1/8.9 kb and 1/9 kb, respectively) but similar to wheat (1/10.6 kb). In polyploids, extensive loss of duplicated genes and chromosomal rearrangement after polyploidization possibly have resulted in shortening and loss of microsatellite repeat-motifs from the genic coding sequences leading to dosage compensation, developmental stability and functional plasticity [16,17]. It would be of interest to explore the mechanism that restricts repeat expansion in genic regions of wheat and sugarcane leading to lower microsatellite frequency in these species. The observed frequency of the mononucleotides (12%) in sugarcane was much less than that reported earlier [18] in the unigenes of maize (75.8%), wheat (71%), barley (42.4%) and rice (41.6%), but similar to Sorghum (13%). This suggested a lack of correspondence with genome size and ploidy. The presence of longer (69 bases or more) A/T mononucleotide repeat-motifs in sugarcane is comparable to earlier observations on the nature and the frequency of such repeats in the genomic and EST sequences of cereals [19]. The abundance of trinucleotide microsatellite motifs in the sugarcane unigenes is consistent with the earlier observations based on EST sequences of sugarcane [3,5] and unigene sequences of cereals [18]. The limited expansion of non-triplet microsatellites in the unigenes of sugarcane could be due to selection against frameshift mutations in the coding regions resulting from length changes in non-triplet repeats. We observed that more than 50% of the identified trinucleotide repeat motifs in sugarcane were GC-rich, possibly due to their high GC content [3,5] and consequent usage bias in the coding sequences [20]. The abundance of GC-rich trinucleotide repeat motifs coding for small and hydrophilic aminoacid alanine and GA-rich dinucleotide motifs in sugarcane coding sequences parallels their abundance in cereal exons [18]. The GA-rich dinucleotide UGMS motifs were observed particularly in the 5' and 3' (32%) UTRs with balanced (46 to 52%) GC content and therefore would support better amplification as polymorphic markers [21]. The unigenes being longer in higher quality sequences offer advantages over the EST sequence resources for the development of microsatellite markers. In the present study, 961 (810 perfect and 151 compound) primer-pairs were designed from the 767 different unigene sequences carrying microsatellite repeats in the expressed component of the sugarcane genome. The primers designed from the unigene sequences flanking the microsatellite motifs were highly efficient with amplification success rate of 94.9% that suggested the utility of the unigene database in designing sequence based robust genic markers. The paucity of usable and robust sequence based markers in sugarcane has been a major limitation in genetic analysis in this important sugar crop. The UGMS markers developed by us have been placed in the public domain and thus would be immediately useful in various genotyping applications in sugarcane. For most (87.6%) of the unigene sequences from which the primers were designed, putative functions have been predicted. For instance, about 39% and 4% of the primers were from sequences related to sugar metabolic enzymes and disease resistance, respectively. Interestingly, 54% of the microsatellite repeats were present in various functional domains of proteins encoded by the unigenes. Correlation between fragment length polymorphism due to variable number of UGMS repeat-units in the functional domains and alteration of the predicted protein structure and active ligand binding sites suggested functional relevance of the genic microsatellites. The evolutionary and adaptive advantages of such variable microsatellite repeats affecting the structure and function of the encoded proteins to generate favorable alleles for relaxation of environmental stress impact under the action of high natural selection pressure through modulation of mutation/recombination in these loci have been reported in many eukaryotes [22]. For instance, in Saccharomyces cerevisiae, the expansion and contraction of microsatellite repeats in the coding regions of protein kinase genes have provided greater adaptability to various abiotic and biotic stresses [22]. Sugarcane is a tropical crop. However, varieties adapted to subtropical condition have been developed in India. Variation in microsatellites in the coding regions of these two groups was evident. For example, the tropical sugarcane varieties Co7219 and Co740 contained (AG)18 microsatellite repeats whereas varieties Co89003 and Co7717 adapted to sub-tropical condition showed contraction of AG repeats to (AG)10 leading to alteration of protein structure and possibly function. Understanding the adaptive significance of such variation is of relevance that needs further experimentation. Designing of genic microsatellite markers targeting functional domains in solanaceous crops like tomato and pepper [23] was reported. Association of such markers with many traits including diseases like neuronal disorders [22] and cancers [24] in humans based on expansion/contraction of repeated tracts of microsatellites encoding aminoacid residues in the functional domains of proteins that changes their secondary structure and function has been demonstrated. The information generated in this study thus would allow selection of candidate gene-based markers for rapidly establishing marker-trait linkages in sugarcane. The efficiency of sugarcane UGMS markers to detect polymorphism within Saccharum complex and among five cereals was evaluated in silico that was further validated experimentally. In silico polymorphism based on variation in length of the UGMS repeat among S. officinarum and each of five cereal species could be due to divergent microsatellite evolution in these lineages [25]. The actual level of polymorphism detected by automated fragment analysis using fluorescent dye labeled primers was much higher than that based on in silico analysis. The expected relationship in evolution was reflected by actual polymorphism observed between sugarcane and cereals. With the use of automated fragment analysis system, all the allelic variants could be captured efficiently in a set of 36 sugarcane genotypes and five cereal species, thereby providing a platform for rapid, automated and large-scale genotyping of sugarcane. Further, the allele size information generated would enable multiplexing of the UGMS markers, thus making them useful in various high-throughput genotyping applications. The observed polymorphism (97.7%) among the members of the Saccharum species and related genera is higher than that estimated earlier using the fluorescent dye labeled sugarcane genomic (90%, [2]) and EST derived microsatellite markers (81%, [3]). The higher potential of UGMS markers particularly the longer class I di- and tetra-nucleotide repeat-motifs to detect polymorphism as compared to the class I trinucleotide and class II motifs reflected correspondence between the type and length of repeats, and the level of polymorphism as observed earlier in sugarcane [4] and rice [21,26]. Higher polymorphic potential of UGMS markers derived from the UTRs than that from the conserved coding sequences, which are constrained by purifying selection [25,27] suggested the utility of unigenes having such repeat-motifs as a source of polymorphic microsatellite markers. The level of inter-varietal polymorphism (86%, mean PIC of 0.74) detected by the fluorescent dye labeled primers was higher than the level reported previously with the labeled sugarcane EST derived (38%, PIC of 0.23, [3]) microsatellite markers. The discrepancies could be due to use of different sets of markers and genotypes. However, genotyping using the automated fragment analysis system in this study revealed comparable level of polymorphism with the genic and genomic microsatellite markers in sugarcane. It suggested that the genic microsatellite markers developed in this study would be highly informative and useful in sugarcane genetics, genomics and breeding. Unigene sequences usually have advantages of unique identity and position in the transcribed regions of the genome. If this is the case then primers designed from the unigene sequences flanking the microsatellite repeats should amplify unique single locus. In contrast, in the present study, multiple loci and thus amplification of multiple sequences of the same gene was observed for 28 (65.1%) of the 43 primers designed from different microsatellite carrying unigenes. This is possibly due to poor representation of unigene sequences in the database that was scanned for microsatellite repeats. Alternatively, all the copies of a microsatellite carrying gene that are PCR amplified might not be transcribed due to dosage compensation leading to silencing of all but one copy in the large polyploid sugarcane genome [16,28]. In spite of high polyploidy and heterozygosity of the sugarcane genome, 15 (34.9%) of the 43 polymorphic UGMS primers amplified a single discrete locus with step-wise allele distribution and showed fragment length polymorphism across the members of Saccharum complex and varieties in an automated fragment analysis system. This could be due to polyploidization followed by selective gene loss [29,16,28] resulting in retention of a single copy during evolution of Saccharum. Gene ontology analysis of these 14 unigenes showed correspondence with the genes coding for basic metabolic process of energy generation, degradation of cellular building blocks, DNA recombination and repair proteins which played a vital role in plant biology and thus remained as single copy without alteration. Distribution pattern of size variant alleles amplified at UGMS loci showed higher proportionate (68%) distribution of alleles showing step-wise mutation than that of mixed allele distribution (32%). High-quality sequence alignment of the size variant alleles showing both step-wise and mixed distributions confirmed the presence of variable number of repeat-units in different amplified alleles and additional insertions/deletions in the flanking regions of microsatellite repeat-motifs, which contributed to the UGMS fragment length polymorphism observed in the Saccharum complex, varieties and cereal species. Such complex pattern of allele distribution and fragment length polymorphism at microsatellite loci due to variation in the copy number of microsatellite repeats and insertions/deletions in the flanking sequences of microsatellite motifs have been observed earlier in species like maize with a large genome size [30]. It is important to evaluate molecular diversity existing among the members of the Saccharum species and related genera, since major varietal improvement in sugarcane was through inter-specific hybridization. Use of S. spontaneum and Erianthus is vital since they carry many traits of agricultural importance including tolerance to biotic and abiotic stresses [14]. Identification of true inter-specific and inter-generic hybrids is crucial for successful transfer of the target traits. Besides, monitoring of introgression is required to verify the transfer of the target genomic regions from the wild relatives to the cultivated genetic backgrounds. 97.7% of markers revealed inter-specific and inter-generic polymorphism and thus would enable precise identification of inter-specific and inter-generic hybrids, and also provide opportunities to assess transfer of genic regions to desirable genetic backgrounds, thereby offering advantages over the random markers such as RAPD and AFLP. Evaluation of molecular diversity in a set of 28 commercial Indian tropical and sub-tropical sugarcane varieties using unigene based genic microsatellite markers revealed a wider range (0.33 to 0.84 with an average of 0.40) of genetic similarity than the level detected previously with RAPD (0.59 to 0.81 with average of 0.71, [12]), maize microsatellite (0.40 to 0.73 with average of 0.64, [13]), sugarcane genomic microsatellite (0.39 to 0.82 with average of 0.53, [4]) and AFLP (0.52 to 0.83 with average of 0.62, [14]) markers. Hence, high efficiency of UGMS markers in assaying functional diversity in the transcribed component of the sugarcane genome makes them valuable for understanding diversity pattern, and variety identification. The commercial Indian sugarcane varieties used in this study were bred for either tropical or subtropical agro-climatic conditions with differential contribution of S. spontaneum, the most variable Saccharum species [14]. The random markers assaying genetic variation largely in different non-genic genomic regions that contribute to large genome size [31] would be of little relevance to phenotypic selection exercised during variety development. In contrast, the genic microsatellite markers assaying variation in the transcribed non-repetitive regions of the genome might be directly related to the phenotypic variation [25]. Our results thus suggested that the genetic base of the Indian sugarcane varieties is not very narrow particularly in the genic regions of the genome, which is most likely due to selection for wider adaptability of these varieties. The adaptive environment as well as parentage of these varieties corresponded well with the clustering pattern obtained using a small set of these markers, that further suggested the usefulness of the designed markers in realistic assessment of genetic diversity.

Conclusion

The present study identified microsatellites in sugarcane unigenes and assessed their functional relevance. A total of 961 primer-pairs were designed targeting 767 different unigenes carrying microsatellite repeats in the expressed component of the sugarcane genome, which would extend the accessibility of such microsatellite markers to researchers for many genetic studies in sugarcane. Precise allele sizing in automated fragment analysis system has encouraging implications to various high-throughput genotyping applications in sugarcane. Assessment of functional genetic diversity revealed that the genetic base of the Indian sugarcane varieties is not narrow.

Methods

Mining of microsatellites in the unigenes of sugarcane and assessment of their functional relevance

Fifteen thousand five hundred ninty-four sugarcane (Saccharum sp.) unigenes (Build 13.0, 19th Feb' 2008) comprising of 9.17 million base sequences were acquired in FASTA format from the recent NCBI FTP UniGene repository database of S. officinarum [32] in batches and searched for microsatellites using a perl scripting language based program MISA (MIcroSAtellite) [33] considering complementary sequence of repeat-motifs (like AG, GA, TC and CT) in the same class [18]. The identified microsatellites were characterized as perfect (monomers to hexamers repeated up to 100 times without any interruption), compound (non-interrupting) with at least two different repeat-motifs without any interruption, compound (interrupting) with a maximum of 100 nucleotides interposing two microsatellite repeat-motifs, class I (≥20 nucleotides) and class II (12 to 20 nucleotides) types. The microsatellite (excluding monomers) containing unigenes were used in a batch module for designing forward and reverse primers employing the microsatellite primer discovery tool of BatchPrimer3 [34]. All default options except for the optimum and maximum primer sizes of 22 and 24 nucleotides, respectively, were used. The putative functions for these unigenes were determined using NCBI BLASTX search tool [35] as against the nr-protein database. The BLASTX output was annotated into four functional categories viz, exact, putative, unknown and hypothetical, and then extracted to Excel sheets. Different structural components namely, coding and untranslated regions (5'UTRs and 3'UTRs) in the unigene sequences was determined using the software tools FGENESH [36] and UTRScan [37], respectively. The aminoacid sequences encoded by the predicted coding nucleotide regions of the microsatellite carrying unigenes were analyzed using the software Pfam [38] to determine the presence of functional domain/protein family within the UGMS. The aminoacid sequences of the functional domain carrying microsatellites was analyzed further using the I-TASSER automated web server [39,40] for prediction of three dimensional protein structure and active binding sites with ligands. Five different protein models and active binding sites were predicted at significant cut-off confidence (C ≥ -1.5) and binding site (BS ≥ 0.5) scores. The high quality protein model of correct topology and protein-ligand complex active binding sites was identified based on high C- and BS- scores.

Evaluation of polymorphic potential

To assess the amplification success rate of microsatellite markers designed from sugarcane unigenes, 176 primers were used to amplify one genotype of S. officinarum. Forty-seven, including thirty-eight class I and nine class II markers of these, labeled with 6-carboxyflourescein (6-FAM) dye phosphoramidites, were used to genotype one accession each belonging to five Saccharum species and three related genera, and twenty-eight commercial Indian tropical and sub-tropical sugarcane varieties (Table 3) for evaluating their polymorphic potential and assessment of genetic diversity. Five cereal species namely, rice, wheat, maize, Sorghum and barley were included in the experiment for comparison. Standard PCR constituents and cycling conditions except for annealing temperature, which varied from 55°C to 64°C depending on the primers, were used for PCR amplification. The amplified products were purified, mixed with 3.75 μl of MegaBACE formamide loading buffer and 0.25 μl of internal-lane MegaBACE™ET 550-R ROX size standard, denatured, cooled and resolved in automated MegaBACE 1000 DNA sequencer (Amersham Biosciences, Piscataway NJ, USA). The electropherogram containing trace files were analyzed and automated allele calling was carried out using the Binning Peak Post Processor tool of MegaBACE™Fragment Profiler Software Version 1.2 (Amersham Biosciences, Piscataway NJ, USA). The actual allele size (bp) was determined and fragment length polymorphism among the sugarcane genotypes and cereal species was identified. The average peak height and peak quality of alleles generated for all the UGMS loci were graphed and alleles exhibiting average peak height and quality of ≥ 1500 and ≥ 6.0 fluorescence units [41,42] respectively were considered for sizing and base-pair estimation. Allele binning was performed for individual UGMS loci for precise and accurate allele size determination and discrimination of homozygous and heterozygous allele types [42,43] for each genotype.

Table 3

List of genotypes belonging to five cereal species and Saccharum complex used in the study

Sl. No.	Genus and species	Clones/varieties	Parentage/origin	Region of adaptation
1	Hordeum vulgare	AK559	Unknown	-

2	Triticum aestivum	Kalyansona	PJ"S" × GB 55	-

3	Oryza sativa	IR64	IR5857-33-2-1 × IR-2061-465-1-5-5	-

4	Zea mays	KA509	India	-

5	Sorghum bicolor	Pusa chari6	India	-

6	Saccharum officinarum	IJ-76-3-1-9	Indonesia	-

7	S. barberi	Mamjasahe	India	-

8	S. sinense	Malari	Unknown	-

9	S. robustum	IM-76-256	Indonesia	-

10	S. spontaneum	1151	India	-

11	Narenga	NarengaER	Unknown	-

12	Sclerostachya	S1135	Unknown	-

13	Erianthus	IK76-81	Indonesia	-

14	Saccharum spp.	Co 1158	Co 421 × Co 419	Sub-tropical

15	Saccharum spp.	CoJ 64	Co 976 × Co 617	Sub-tropical

16	Saccharum spp.	CoS 88230	Co 1148 × Co 775	Sub-tropical

17	Saccharum spp.	Bo 91	Bo 55 × Bo 43	Sub-tropical

18	Saccharum spp.	CoS 8436	MS 68/47 × Co 1148	Sub-tropical

19	Saccharum spp.	Co 1148	P 4383 × Co 301	Sub-tropical

20	Saccharum spp.	CoPant 84212	Co 1148 × Co 775	Sub-tropical

21	Saccharum spp.	Co 87268	Bo 91 × Co 62399	Sub-tropical

22	Saccharum spp.	CoLk 8102	Co 1158 GC	Sub-tropical

23	Saccharum spp.	Co 7717	Co 419 × Co 775	Sub-tropical

24	Saccharum spp.	Co 87263	Co 312 × Co 6806	Sub-tropical

25	Saccharum spp.	Co 89003	Co 7314 × Co 775	Sub-tropical

26	Saccharum spp.	CoPant 84211	Co 6806 × Co 6912	Sub-tropical

27	Saccharum spp.	Co 8347	Co 419 × CoC 671	Sub-tropical

28	Saccharum spp.	Co 7219	Co 449 × Co 658	Tropical

29	Saccharum spp.	Co 740	P 3247 × P 4775	Tropical

30	Saccharum spp.	Co 86010	Co 740 × Co 7409	Tropical

31	Saccharum spp.	Co 86032	Co 62198 × CoC 671	Tropical

32	Saccharum spp.	Co 85002	Co 62198 × (-)	Tropical

33	Saccharum spp.	Co 8021	Co 740 × Co 6806	Tropical

34	Saccharum spp.	Co 8371	Co 740 × Co 6806	Tropical

35	Saccharum spp.	Co 6304	Co 419 × Co 453	Tropical

36	Saccharum spp.	Co 419	PoJ 2878 × Co 290	Tropical

37	Saccharum spp.	Co 62175	Co 951 × Co 419	Tropical

38	Saccharum spp.	Co 7704	Co 740 × Co 6806	Tropical

39	Saccharum spp.	Co 86249	CoJ 64 × CoA 7601	Tropical

40	Saccharum spp.	CoC 671	Q 63 × Co 775	Tropical

41	Saccharum spp.	Co 87025	Co 7704 × Co 62198	Tropical

List of genotypes belonging to five cereal species and Saccharum complex used in the study Variation in the fragment size (bp) of amplified alleles at each polymorphic UGMS marker locus was compared with changes in the number of microsatellite repeat-units at that target locus, and the occurrence of "stepwise" and "mixed" type of allele size distribution was inferred. When the allele size differences strictly corresponded to the variation in the number of repeat-units, it was considered as stepwise distribution. A mixed allele distribution was assumed when the allele size differences could partly be explained by the stepwise model. Multiple amplicons obtained by a primer-pair with peaks ≥ 1500 fluorescence units showing ≥ 100 bp allele size differences were binned into different loci. To confirm that the primers amplified the target microsatellite repeat motifs in different species and genera, the amplified products were purified using Micropon PCR purification kits (Millipore, Bedford, MA, USA) and sequenced two times in both forward and reverse directions using a capillary-based Automated DNA Sequencer (MegaBACE 1000, Amersham Biosciences, Piscataway NJ, USA). The trace files were base called, checked for quality and assembled into contigs [44]. The high quality sequences thus obtained were used for interspecies comparison using CLUSTALW multiple sequence alignment tool employing BIOEDIT software [45]. Ten size variant amplicons each of three microsatellite markers (UGSuM2, UGSuM26 and UGSuM27) showing both step-wise and mixed type of allele size distribution in sugarcane species, genera, varieties and cereals were eluted, purified, cloned in pGEM-T Easy Vector (Promega, USA) and sequenced as described above. The polymorphic information content (PIC) was calculated using the formula, PIC = 1- ∑Pij2 [46], where Pij is the frequency of the jth allele for ith locus summed across all alleles for the locus. Cluster analysis among the 36 genotypes of Saccharum species, varieties and related genera was based on Nei and Li similarity coefficient [47] using the un-weighted pair group method (UPGMA) in PowerMarker Version 3.0 [48,49] software. The confidence limits of UPGMA based dendrogram was determined by 1000 bootstrap replicates and bootstrap of 50% majority rule consensus unrooted phylogenetic tree was constructed.

List of abbreviations

Sugarcane, Unigenes, Microsatellites, Functional genetic diversity

Authors' contributions

SKP conducted mining of UGMS, marker design, large-scale genotyping, polymorphism survey, functional diversity estimation and drafted the manuscript. AP was involved in genotyping and sequencing. KG and TRS participated in microsatellite mining and data analysis. PSS and NKS helped in data analyses, interpretation and drafting of the manuscript. TM designed the study, guided data analysis and interpretation, participated in drafting and correcting the manuscript and gave the final approval of the version to be published. All authors have read and approved the final manuscript.

Additional file 1

Distribution of codon repeats along with the corresponding aminoacids in the unigenes of sugarcane. Click here for file

Additional file 2

Frequency and abundance of microsatellite repeat motifs and distribution of class I and class II motifs in the unigenes of sugarcane. Click here for file

Additional file 3

Primers designed from sugarcane unigenes carrying perfect microsatellite repeat-motifs along with the expected amplicon size and putative unigene function. Click here for file

Additional file 4

Primers designed from sugarcane unigenes carrying compound class I microsatellite repeat-motifs along with the expected amplicon size and putative unigene function. Click here for file

Additional file 5

Alignment showing the presence of class I UGMS repeat-motifs in sugarcane species, related genera and five cereal species. Click here for file

Additional file 6

. Click here for file

Additional file 7

Functional annotation of 672 unigenes carrying microsatellites of sugarcane. Click here for file

Additional file 8

Primers targeting the microsatellite repeats present in the functional domains of proteins encoded by the unigenes. Click here for file

Additional file 9

Primers targeting the microsatellite repeats located within the functional domains of proteins encoded by sugarcane unigene sequences which gave single locus and step-wise allele amplification. Click here for file

Additional file 10

Alignment depicting the functional significance of sugarcane UGMS markers targeting the microsatellite repeat-motif (AG). Click here for file

27 in total

1. Base-calling of automated sequencer traces using phred. II. Error probabilities.

Authors: B Ewing; P Green
Journal: Genome Res Date: 1998-03 Impact factor: 9.043

2. Computational and experimental analysis of microsatellites in rice (Oryza sativa L.): frequency, length variation, transposon associations, and genetic marker potential.

Authors: S Temnykh; G DeClerck; A Lukashova; L Lipovich; S Cartinhour; S McCouch
Journal: Genome Res Date: 2001-08 Impact factor: 9.043

3. Characterisation of microsatellite markers from sugarcane (Saccharum sp.), a highly polyploid species.

Authors:
Journal: Plant Sci Date: 2000-06-29 Impact factor: 4.729

4. Microsatellites in Zea - variability, patterns of mutations, and use for evolutionary studies.

Authors: Y. Matsuoka; S. E. Mitchell; S. Kresovich; M. Goodman; J. Doebley
Journal: Theor Appl Genet Date: 2002-02 Impact factor: 5.699

5. Fluorescent-labeled microsatellite panels useful for detecting allelic diversity in cultivated rice ( Oryza sativa L.).

Authors: Matthew W. Blair; Veena Hedetale; Susan R. McCouch
Journal: Theor Appl Genet Date: 2002-05-23 Impact factor: 5.699

6. Survey in the sugarcane expressed sequence tag database (SUCEST) for simple sequence repeats.

Authors: Luciana Rossini Pinto; Karine Miranda Oliveira; Eugênio César Ulian; Antonio Augusto Franco Garcia; Anete Pereira de Souza
Journal: Genome Date: 2004-10 Impact factor: 2.166

7. Microsatellites in Brassica unigenes: relative abundance, marker design, and use in comparative physical mapping and genome analysis.

Authors: Swarup K Parida; Devendra K Yadava; Trilochan Mohapatra
Journal: Genome Date: 2010-01 Impact factor: 2.166

8. Informative genomic microsatellite markers for efficient genotyping applications in sugarcane.

Authors: Swarup K Parida; Sanjay K Kalia; Sunita Kaul; Vivek Dalal; G Hemaprabha; Athiappan Selvi; Awadhesh Pandit; Archana Singh; Kishor Gaikwad; Tilak R Sharma; Prem Shankar Srivastava; Nagendra K Singh; Trilochan Mohapatra
Journal: Theor Appl Genet Date: 2008-10-23 Impact factor: 5.699

9. Genic non-coding microsatellites in the rice genome: characterization, marker design and use in assessing genetic and evolutionary relationships among domesticated groups.

Authors: Swarup Kumar Parida; Vivek Dalal; Ashok Kumar Singh; Nagendra Kumar Singh; Trilochan Mohapatra
Journal: BMC Genomics Date: 2009-03-31 Impact factor: 3.969

10. A set of multiplex panels of microsatellite markers for rapid molecular characterization of rice accessions.

Authors: Marco Pessoa-Filho; André Beló; António A N Alcochete; Paulo H N Rangel; Márcio E Ferreira
Journal: BMC Plant Biol Date: 2007-05-21 Impact factor: 4.215

16 in total

1. CNMS: The preferred genic markers for comparative genomic, molecular phylogenetic, functional genetic diversity and differential gene regulatory expression analyses in chickpea.

Authors: Deepak Bajaj; Shouvik Das; Swarup K Parida
Journal: J Biosci Date: 2015-09 Impact factor: 1.826

2. Genes tagging and molecular diversity of red rot susceptible/tolerant sugarcane hybrids using c-DNA and unigene derived markers.

Authors: R K Singh; R B Singh; S P Singh; M L Sharma
Journal: World J Microbiol Biotechnol Date: 2011-12-16 Impact factor: 3.312

3. Single nucleotide polymorphism in sugar pathway and disease resistance genes in sugarcane.

Authors: Swarup K Parida; Sanjay Kalia; Awadhesh Pandit; Preetam Nayak; Ram Kushal Singh; Kishor Gaikwad; Prem Shankar Srivastava; Nagendra K Singh; Trilochan Mohapatra
Journal: Plant Cell Rep Date: 2016-06-11 Impact factor: 4.570

4. Single strand conformation polymorphism of genomic and EST-SSRs marker and its utility in genetic evaluation of sugarcane.

Authors: Sachin B Kalwade; Rachayya M Devarumath
Journal: Physiol Mol Biol Plants Date: 2014-05-28

5. Development and Integration of Genome-Wide Polymorphic Microsatellite Markers onto a Reference Linkage Map for Constructing a High-Density Genetic Map of Chickpea.

Authors: Yash Paul Khajuria; Maneesha S Saxena; Rashmi Gaur; Debasis Chattopadhyay; Mukesh Jain; Swarup K Parida; Sabhyata Bhatia
Journal: PLoS One Date: 2015-05-14 Impact factor: 3.240

6. Genome-wide conserved non-coding microsatellite (CNMS) marker-based integrative genetical genomics for quantitative dissection of seed weight in chickpea.

Authors: Deepak Bajaj; Maneesha S Saxena; Alice Kujur; Shouvik Das; Saurabh Badoni; Shailesh Tripathi; Hari D Upadhyaya; C L L Gowda; Shivali Sharma; Sube Singh; Akhilesh K Tyagi; Swarup K Parida
Journal: J Exp Bot Date: 2014-12-10 Impact factor: 6.992

7. Identification and characterization of microsatellites in expressed sequence tags and their cross transferability in different plants.

Authors: Shamshad Ul Haq; Rohit Jain; Meenakshi Sharma; Sumita Kachhwaha; S L Kothari
Journal: Int J Genomics Date: 2014-10-19 Impact factor: 2.326

8. Exploiting EST databases for the development and characterisation of 3425 gene-tagged CISP markers in biofuel crop sugarcane and their transferability in cereals and orphan tropical grasses.

Authors: Amaresh Chandra; Radha Jain; Sushil Solomon; Shiksha Shrivastava; Ajoy K Roy
Journal: BMC Res Notes Date: 2013-02-04

9. Functionally relevant microsatellite markers from chickpea transcription factor genes for efficient genotyping applications and trait association mapping.

Authors: Alice Kujur; Deepak Bajaj; Maneesha S Saxena; Shailesh Tripathi; Hari D Upadhyaya; C L L Gowda; Sube Singh; Mukesh Jain; Akhilesh K Tyagi; Swarup K Parida
Journal: DNA Res Date: 2013-04-29 Impact factor: 4.458

10. Natural allelic diversity, genetic structure and linkage disequilibrium pattern in wild chickpea.

Authors: Maneesha S Saxena; Deepak Bajaj; Alice Kujur; Shouvik Das; Saurabh Badoni; Vinod Kumar; Mohar Singh; Kailash C Bansal; Akhilesh K Tyagi; Swarup K Parida
Journal: PLoS One Date: 2014-09-15 Impact factor: 3.240