Literature DB >> 16845095

GeMprospector--online design of cross-species genetic marker candidates in legumes and grasses.

Jakob Fredslund¹, Lene H Madsen, Birgit K Hougaard, Niels Sandal, Jens Stougaard, David Bertioli, Leif Schauser.

Abstract

The web program GeMprospector (URL: http://cgi-www.daimi.au.dk/cgi-chili/GeMprospector/main) allows users to automatically design large sets of cross-species genetic marker candidates targeting either legumes or grasses. The user uploads a collection of ESTs from one or more legume or grass species, and they are compared with a database of clusters of homologous EST and genomic sequences from other legumes or grasses, respectively. Multiple sequence alignments between submitted ESTs and their homologues in the appropriate database form the basis of automated PCR primer design in conserved exons such that each primer set amplifies an intron. The only user input is a collection of ESTs, not necessarily from more than one species, and GeMprospector can boost the potential of such an EST collection by combining it with a large database to produce cross-species genetic marker candidates for legumes or grasses.

Entities: Disease Species

Mesh：

Substances：

Year: 2006 PMID： 16845095 PMCID： PMC1538858 DOI： 10.1093/nar/gkl201

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Comparative genetics allows the transfer of genetic information from one species to another. In legumes (Fabaceae), comparative genetics holds the promise to transfer information from well-studied genetic models, such as Lotus japonicus and Medicago truncatula, to some of the agriculturally very important, but genetically understudied legumes among the 18 000 species in this family (e.g. peas, beans, lentils, soybeans, peanuts). The family of grasses (Poaceae, also known as Gramineae) contains 10 000 species including rice, wheat, barley, maize and forage grasses; it is the only family of plants more important to humans than legumes. For grasses, the primary source of genetic information is rice. Genetic markers, DNA polymorphisms between genomes of two mapping parents, are the work-horse of this information transfer by synteny. In order to detect polymorphisms at loci which can be placed at unique positions on the genetic maps of several related species, a polymorphism identification strategy which focuses on introns of highly conserved genes has been proposed [e.g. by Lyons et al. (1)]. We have built an automated bioinformatics pipeline for the identification of cross-species genetic marker candidates, as defined by sets of primer pairs for PCR amplification of introns, which we have used extensively to find family specific marker candidates in the legume and grass families (2). This paper presents a tool which lets the user compare his own legume or grass EST sequence data with the two respective databases built by our pipeline in order to find novel cross-species genetic marker candidates. GeMprospector users should cite this paper and GeMprospector's URL () in order to refer the program.

MATERIALS AND METHODS

The database holds gene indices (3), rice coding sequences and Arabidopsis peptides (4) from The Institute of Genomic Research, genomic Lotus sequences from The National Center for Biotechnology Information (NCBI) and genomic Medicago sequences from . We use the Blast program package from NCBI (5) for sequence comparisons with the cut-off E-value 10−7 for sequence homology. PriFi (6), () is used for primer design; Clustalw is used to perform multiple alignments (with permission from the European Bioinformatics Institute website: ).

RESULTS

The preprocessing underlying GeMprospector

GeMprospector aims at identifying regions of sequence conservation across several related species that include at least one intron, and then design primers such that the segment containing the intron is amplified (Figure 1). This maximizes the chance that

Figure 1

Aligning an intron-containing genomic sequence with several homologous gene indices and designing primers in conserved regions.

The primers work for most species in the clade, including those for which no sequence information is available. The PCR product contains a polymorphism making the locus a potential genetic marker. For our grass application of the pipeline we use genomic sequences from Oryza sativa (rice) and gene indices from Oryza sativa, Sorghum bicolor and Hordeum vulgare (barley). In the legume application, genomic sequences from L.japonicus and M.truncatula are used, but since these genomes are not completely sequenced yet, Arabidopsis thaliana is also included as a reference species. Gene index collections derive from L.japonicus, M.truncatula, Glycine max, Phaseolus vulgaris and Arachis spp. The pipeline follows steps i–iv listed below. Markers are maximally useful if they define a unique genetic position. Since the currently available Lotus and Medicago genome sequences are incomplete, we cannot rule out that a given legume sequence has several copies in these two plants. To get a copy number estimate from a complete plant genome, legume gene indices are compared with the Arabidopsis proteome, and predicted single-copy gene indices from all species are indexed according to their best Arabidopsis hit. For grasses, all indices are blasted against the rice genome and single-copy sequences are kept, indexed by their rice homologue. Relevant gene indices are compared against their genomes in order to identify sequences with introns (Lotus and Medicago in the legume application, rice in the grass application). Gene indices are intron-tagged at the corresponding positions. For legumes, these sequences are again indexed according to their best Arabidopsis hit. Each group of homologous sequences (in case of the legumes, sequences with the same Arabidopsis index) is called a pot. This bisected multitude of pots is the underlying database of GeMprospector; some are legume pots, some are for grasses. Each pot contains one sequence with inserted intron tags plus one or several gene index sequences from the other species, all homologous to the same Arabidopsis or rice sequence. Finally, our specially designed software PriFi (6) is batch-run on all pots, first creating multiple alignments and then suggesting primers which fulfill the requirements in terms of conservation, intron length, melting temperature, etc. Forcing the primers to span an intron increases the chance of a polymorphic amplicon due to the lower selection pressure on introns—and hence increases the chance that the primers and amplicon constitute a genetic marker. The legume version of the pipeline is diagrammed in Figure 2. Only some of the multiple sequence alignments yield marker candidates (marked by circles in Figure 2). The remaining alignments did not allow the design of valid primer pairs. Given new sequence information, these ‘dormant’ alignments may well become ‘activated’ and yield further marker candidates.

Figure 2

The legume version of the pipeline underlying GeMprospector. See text.

Here we present the tool GeMprospector. GeMprospector acts against the backdrop of this preassembled database. The user submits a set of ESTs (legume or grass) in Fasta format, and these ESTs undergo the same analysis as each of the above mentioned gene index collections, as shown in Figure 3. The submitted ESTs are drawn in blue; they are compared with the Arabidopsis proteome/rice genome, respectively, and non-rejected sequences are merged with the appropriate database sequences and subjected to PriFi. If the new sequences add sufficient information to some of the ‘dormant’ alignments, for example by raising the conservation score, valid primer pairs can now be produced and hence new marker candidates are found (marked by circles), each incorporating one of the uploaded sequences and data from the underlying database. Primers and associated information are reported to the user. In other words, the GeMprospector tool allows new ESTs from one species to drive the design of genetic markers for many species.

Figure 3

Running GeMprospector with new legume ESTs. Two dormant alignments give rise to new candidate markers because of the uploaded ESTs.

The web interface

The main page of the GeMprospector website is very simple. The user must upload a Fasta file of ESTs (or choose the available demo file), click either the legume or grass database, and start the analysis. There is also a link to the tool documentation (‘About this tool’). After submitting the sequences, the user is taken to a new page which dynamically reports the current step of the process, automatically reloading at suitably increasing intervals depending on the job size. When the analysis is complete, a summary tells the number of novel marker candidates found and offers links to view and download the results (Figure 4).

Figure 4

Screen shot displaying the analysis progress and summary.

Viewing results

Clicking ‘View’ takes the user to a tabular view of the results (Figure 6). The column headers of the table are ID of the marker candidates, best Arabidopsis/rice homolog, forward and reverse primers, PriFi score, and annotation of the Arabidopsis/rice homologue. By clicking (some of) the headers, the user can sort the table based on various criteria, e.g. the PriFi score (expected quality) of the primer pair. In the results table, the score serves as a link to a report with detailed information about the corresponding primers. The report may hold up to three alternative primer suggestions. Below is an example [for details on PriFi primer reports, see (6)]: The ID string of each marker candidate serves as a link to a display of the multiple sequence alignment underlying the marker candidate, including the position of the suggested primers and intron(s). The alignment is shown both as multi-colored sequences of letters and gaps, as a multi-color line sketch for quickly overviewing conserved regions (highlighted in olive-green) and primer placements, and as a ClustalW alignment (Figure 5).

Figure 5

Displaying the alignment underlying a marker candidate, including suggested primers.

The results can also be downloaded as a zipped file containing the same information as the results table.

Running time

The running time of GeMprospector depends on the combined length of the uploaded sequences, and on how many markers are found. For the demonstration file containing five legume sequences of 3 kilobases combined length, the complete analysis takes ∼40 s. Running GeMprospector on our unpublished collection of 1081 Arachis hypogaea EST clusters (total 0.6 megabases, file size 641 KB) took 5 min against the legume database. For the full set of 9484 P. vulgaris gene indices (total 6.3 megabases, file size 7.2 MB) against the legume database, the analysis was completed in 46 min. Finally, we also compared a set of 7205 gene indices from maize (total 5.5 megabases, file size 6.1 MB) against the grass database which took 1 h and 19 min (see Table 1). Currently, to limit the work load on our server for the benefit of other users, there is a file size maximum of 10 MB.

Table 1

Analysis times for four particular runs with different EST collections

Uploaded data	Database	Sequences	Nucleotides	File size (bytes)	Markers found	Analysis time
Website test file	legumes	5	2954	3130	2	40 s
Arachis EST clusters	legumes	1081	6 14 424	640 916	11	5 min
Phaseolus GIs	legumes	9484	6 344 504	7 187 446	117	46 min
Maize GIs	grasses	7205	5 482 827	6 067 588	312	79 min

DISCUSSION

GeMprospector is a specialized tool for the design of cross-species marker candidates using user-submitted sequence data originating either from the legume family or the grass family; as the user data are merged with a database of groups of intra-homologous sequences, submitted ESTs from only one species can still produce cross-species marker candidates. When mapped, such cross-species markers will allow information transfer through syntenic relationships between important crop- and model-plants. Any new primer pair proposed by GeMprospector will have a very high probability of amplifying an intron in the species of the submitted sequences. The primer pair is also likely to amplify introns of any other species within the clade of sequence representation. Furthermore the primer pair will be an educated guess in order to amplify introns in species which are outside of the clade of the represented sequence. For example, we have currently designed 459 cross-species marker candidates in legumes; so far, 76 of these have been tested resulting in the successful development of 56 markers in the ‘in-group’ bean and 43 in the ‘out-group’ peanut (2). We have chosen the legume and grass families because of the availability of genomic sequences from the well-studied rice, Lotus japonicus and Medicago truncatula, and because of the enormous importance of these families to humans (7). In principle, non-legume or -grass ESTs might also align, but they are likely prohibitively different from the database gene indices to allow multiple sequence alignments of sufficient quality to pass through PriFi’s primer design requirements. Our focus here is on family-wide anchor primer design, i.e. primers with potential to amplify sequences from distantly related members of the same plant family. Longer primers are expected to be less sensitive to any mismatches between primer and target which are likely to occur in this setup, and therefore, with the current settings GeMprospector suggests primers with accepted lengths between 18 and 35 nt. For our legume database, the average primer length is 29.5 nt; for the grasses, it is 28.7 nt. We are planning a future GeMprospector version whose databases include maize and wheat, species for which large EST collections also exist. This will certainly lead to any additional markers, but with a potentially more narrow application. We imagine a comprehensive tool which lets the user pick individual species from a given set and combine their sequences with his own uploaded set in a ‘user-designed’ database, targeting the results to the user's specific needs. GeMprospector allows maximal information gain from new legume/grass EST sequence collections when designing candidate cross-species genetic markers. Results of an analysis are only accessible to the submitting user.

7 in total

1. Basic local alignment search tool.

Authors: S F Altschul; W Gish; W Miller; E W Myers; D J Lipman
Journal: J Mol Biol Date: 1990-10-05 Impact factor: 5.469

Review 2. Legumes: importance and constraints to greater use.

Authors: Peter H Graham; Carroll P Vance
Journal: Plant Physiol Date: 2003-03 Impact factor: 8.340

3. Comparative anchor tagged sequences (CATS) for integrative mapping of mammalian genomes.

Authors: L A Lyons; T F Laughlin; N G Copeland; N A Jenkins; J E Womack; S J O'Brien
Journal: Nat Genet Date: 1997-01 Impact factor: 38.330

4. The TIGR Gene Indices: analysis of gene transcript sequences in highly sampled eukaryotic species.

Authors: J Quackenbush; J Cho; D Lee; F Liang; I Holt; S Karamycheva; B Parvizi; G Pertea; R Sultana; J White
Journal: Nucleic Acids Res Date: 2001-01-01 Impact factor: 16.971

5. PriFi: using a multiple alignment of related sequences to find primers for amplification of homologs.

Authors: Jakob Fredslund; Leif Schauser; Lene H Madsen; Niels Sandal; Jens Stougaard
Journal: Nucleic Acids Res Date: 2005-07-01 Impact factor: 16.971

6. A general pipeline for the development of anchor markers for comparative genomics in plants.

Authors: Jakob Fredslund; Lene H Madsen; Birgit K Hougaard; Anna Marie Nielsen; David Bertioli; Niels Sandal; Jens Stougaard; Leif Schauser
Journal: BMC Genomics Date: 2006-08-14 Impact factor: 3.969

7. Complete reannotation of the Arabidopsis genome: methods, tools, protocols and the final release.

Authors: Brian J Haas; Jennifer R Wortman; Catherine M Ronning; Linda I Hannick; Roger K Smith; Rama Maiti; Agnes P Chan; Chunhui Yu; Maryam Farzad; Dongying Wu; Owen White; Christopher D Town
Journal: BMC Biol Date: 2005-03-22 Impact factor: 7.431

7 in total

15 in total

1. Mesoamerican origin of the common bean (Phaseolus vulgaris L.) is revealed by sequence data.

Authors: Elena Bitocchi; Laura Nanni; Elisa Bellucci; Monica Rossi; Alessandro Giardini; Pierluigi Spagnoletti Zeuli; Giuseppina Logozzo; Jens Stougaard; Phillip McClean; Giovanna Attene; Roberto Papa
Journal: Proc Natl Acad Sci U S A Date: 2012-03-05 Impact factor: 11.205

2. Cytogenetic map of common bean (Phaseolus vulgaris L.).

Authors: Artur Fonsêca; Joana Ferreira; Tiago Ribeiro Barros dos Santos; Magdalena Mosiolek; Elisa Bellucci; James Kami; Paul Gepts; Valérie Geffroy; Dieter Schweizer; Karla G B dos Santos; Andrea Pedrosa-Harand
Journal: Chromosome Res Date: 2010-05-07 Impact factor: 5.239

3. EST-derived genic molecular markers: development and utilization for generating an advanced transcript map of chickpea.

Authors: Shalu Choudhary; Rashmi Gaur; Shefali Gupta
Journal: Theor Appl Genet Date: 2012-05 Impact factor: 5.699

Review 4. Genome sequencing and genome resources in model legumes.

Authors: Shusei Sato; Yasukazu Nakamura; Erika Asamizu; Sachiko Isobe; Satoshi Tabata
Journal: Plant Physiol Date: 2007-06 Impact factor: 8.340

5. New evidence of ancestral polyploidy in the Genistoid legume Lupinus angustifolius L. (narrow-leafed lupin).

Authors: Magdalena Kroc; Grzegorz Koczyk; Wojciech Święcicki; Andrzej Kilian; Matthew N Nelson
Journal: Theor Appl Genet Date: 2014-03-15 Impact factor: 5.699

6. Exon-primed intron-crossing (EPIC) markers for non-model teleost fishes.

Authors: Chenhong Li; Jean-Jack M Riethoven; Lingbo Ma
Journal: BMC Evol Biol Date: 2010-03-31 Impact factor: 3.260

7. The first SSR-based genetic linkage map for cultivated groundnut (Arachis hypogaea L.).

Authors: R K Varshney; D J Bertioli; M C Moretzsohn; V Vadez; L Krishnamurthy; R Aruna; S N Nigam; B J Moss; K Seetha; K Ravi; G He; S J Knapp; D A Hoisington
Journal: Theor Appl Genet Date: 2008-12-02 Impact factor: 5.699

8. Identification of candidate genome regions controlling disease resistance in Arachis.

Authors: Soraya C M Leal-Bertioli; Ana Carolina V F José; Dione M T Alves-Freitas; Márcio C Moretzsohn; Patrícia M Guimarães; Stephan Nielen; Bruna S Vidigal; Rinaldo W Pereira; Jodie Pike; Alessandra P Fávero; Martin Parniske; Rajeev K Varshney; David J Bertioli
Journal: BMC Plant Biol Date: 2009-08-22 Impact factor: 4.215

9. ConservedPrimers 2.0: a high-throughput pipeline for comparative genome referenced intron-flanking PCR primer design and its application in wheat SNP discovery.

Authors: Frank M You; Naxin Huo; Yong Q Gu; Gerard R Lazo; Jan Dvorak; Olin D Anderson
Journal: BMC Bioinformatics Date: 2009-10-13 Impact factor: 3.169

10. A linkage map for the B-genome of Arachis (Fabaceae) and its synteny to the A-genome.

Authors: Márcio C Moretzsohn; Andrea V G Barbosa; Dione M T Alves-Freitas; Cristiane Teixeira; Soraya C M Leal-Bertioli; Patrícia M Guimarães; Rinaldo W Pereira; Catalina R Lopes; Marcelo M Cavallari; José F M Valls; David J Bertioli; Marcos A Gimenes
Journal: BMC Plant Biol Date: 2009-04-07 Impact factor: 4.215