Literature DB >> 18854357

AutoSNPdb: an annotated single nucleotide polymorphism database for crop plants.

Chris Duran¹, Nikki Appleby, Terry Clark, David Wood, Michael Imelfort, Jacqueline Batley, David Edwards.

Abstract

Single nucleotide polymorphisms (SNPs) may be considered the ultimate genetic marker as they represent the finest resolution of a DNA sequence (a single nucleotide), are generally abundant in populations and have a low mutation rate. Analysis of assembled EST sequence data provides a cost-effective means to identify large numbers of SNPs associated with functional genes. We have developed an integrated SNP discovery pipeline, which identifies SNPs from assembled EST sequences. The results are maintained in a custom relational database along with EST source and annotation information. The current database hosts data for the important crops rice, barley and Brassica. Users may rapidly identify polymorphic sequences of interest through BLAST sequence comparison, keyword searches of annotations derived from UniRef90 and GenBank comparisons, GO annotations or in genes corresponding to syntenic regions of reference genomes. In addition, SNPs between specific varieties may be identified for targeted mapping and association studies. SNPs are viewed using a user-friendly graphical interface. The database is freely accessible at http://autosnpdb.qfab.org.au/.

Entities: Species

Mesh：

Substances：
Genetic Markers

Year: 2008 PMID： 18854357 PMCID： PMC2686484 DOI： 10.1093/nar/gkn650

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Molecular genetic markers describe genetic variations and provide a link between observed phenotypes and the underlying genotype. The development of high-throughput methods for the detection of single nucleotide polymorphisms (SNPs) and small insertion/deletions (indels) has led to a revolution in their use as molecular markers. SNPs may be considered the ultimate genetic marker as they represent the finest resolution of a DNA sequence, are generally abundant in populations and have a low mutation rate (1). However, SNP markers can be costly to develop, especially where re-sequencing from multiple individuals is required. The mining of readily available sequence data significantly reduces the costs associated with SNP discovery (2). The principal challenge in SNP discovery remains the discrimination between true genetic polymorphisms and the often more abundant sequence errors. Where sequence trace files are available to filter polymorphisms in traces of dubious quality, these can be used to differentiate between true SNPs and sequence error (3). Where trace files are unavailable, the identification of sequence errors can be based on two further methods to determine SNP confidence: redundancy of the polymorphism in a sequence alignment, and co-segregation of putative SNPs with haplotype. The frequency of occurrence of a polymorphism at a particular locus provides a measure of confidence that the SNP represents a true polymorphism; this is referred to as the SNP redundancy score. In addition, true SNPs that represent divergence between homologous genes co-segregate to define a conserved haplotype. A co-segregation score based on whether a SNP position contributes to defining a haplotype provides a second independent measure of SNP confidence. The SNP redundancy score and co-segregation score together provide an effective means for estimating confidence in the validity of SNPs independently of sequence trace files (4–6). We have combined SNP discovery software and sequence annotation within the relational database schema of autoSNPdb to enable the efficient identification of SNP and indel polymorphisms related to specific genes or traits. Here, we present the application of autoSNPdb to barley, rice and Brassica species. AutoSNPdb has a flexible interface facilitating a variety of queries. Users may search for SNPs within genes of predicted function, and through sequence identity with known genes. In addition, it is possible to add additional levels of annotation and novel queries specific to areas of interest. In the current version, we include plant cultivar information to allow the identification of SNPs that discriminate between plant cultivars.

METHODS

Data processing

Brassica, rice and barley expressed sequences were downloaded from GenBank release 159. RepeatMasker (www.repeatmasker.org) was used to identify and mask repeats prior to assembly using CAP3 (7) with the parameters –p 90, –o 50. The resulting assemblies and singleton sequences were parsed into a MySQL database. SNP discovery used the autoSNP method (4) implemented with custom Perl scripts, and the results were parsed to the database. Assemblies containing four sequences or more were examined for polymorphisms, with gaps created during the assembly process classified as indels. The minimum redundancy score defining a polymorphism was varied in proportion to the number of sequences in the assembly at the SNP position. A minimum redundancy score of 2 was required for up to seven sequences; 3 for between 8 and 11 sequences; 4 for between 12 and 19 sequences, and a minimum redundancy of 5 was required for predicting SNPs represented by 20 or more sequence reads. Each SNP was compared to all SNPs in an assembly to calculate the SNP co-segregation score, with the weighted co-segregation score calculated according to the proportion of missing data at that position in the assembly. Input sequences were annotated with cultivar type, tissue source and developmental stage where available. Consensus and singleton sequences were annotated based on sequence alignment using BLAST (8) against GenBank and UniRef90 databases. Gene Ontology (GO) annotations were derived from UniRef90 annotations. Comparative rice and Arabidopsis genome positions were derived by WU-BLAST comparison with TIGR rice pseudo-chromosomes (version 5) and TAIR Arabidopsis pseudo-chromosomes (v01222004), respectively.

Database content, access and interface

Barley, rice and Brassica sequences were downloaded from GenBank and processed through the autpSNPdb pipeline. A custom web interface allows users to query and visualize the SNP and annotation data (Figure 1). The maintenance of these data within a relational database enables numerous query options. Sequence annotations may be searched by gene keyword, sequence ID, GO term or through similarity to defined regions of the rice or Arabidopsis genome. A BLAST interface enables identification by sequence similarity. SNPs may be retrieved that differentiate between cultivars, providing a valuable resource for genetic mapping and association studies. To aid interpretation of the predicted SNP data, SNPs are viewed graphically as vertical bars, where the position of the bar along the x-axis reflects the relative position of the SNP in the consensus sequence; the height of the bar represents the SNP redundancy score; and the bar colour reflects the SNP-weighted co-segregation score. Information about each SNP is displayed by moving the cursor over the bar, while selecting a bar centres the sequence assembly at that position. The sequence assembly may be moved using the scroll bar and can be toggled between the full sequence assembly and a SNP summary. Labels to the left of the sequence may also be toggled between cultivars, GenBank accession numbers, tissue type and development stage for the respective sequences. The interface is documented with help pages and database build information.

Figure 1.

The autoSNPdb web interface displaying the sequence assembly, predicted SNPs as vertical bars and details presented in a mouse over box.

FUTURE DIRECTIONS

The autoSNPdb system was developed for flexible use and permits extension to a broad range of annotation and species. We plan to extend this system for other crops, including wheat and next-generation Roche 454 sequence data.

FUNDING

Funding for open access charge: the Australian Research Council. Conflict of interest statement. None declared.

8 in total

1. A general approach to single-nucleotide polymorphism discovery.

Authors: G T Marth; I Korf; M D Yandell; R T Yeh; Z Gu; H Zakeri; N O Stitziel; L Hillier; P Y Kwok; W R Gish
Journal: Nat Genet Date: 1999-12 Impact factor: 38.330

2. CAP3: A DNA sequence assembly program.

Authors: X Huang; A Madan
Journal: Genome Res Date: 1999-09 Impact factor: 9.043

Review 3. Accessing genetic variation: genotyping single nucleotide polymorphisms.

Authors: A C Syvänen
Journal: Nat Rev Genet Date: 2001-12 Impact factor: 53.242

4. Redundancy based detection of sequence polymorphisms in expressed sequence tag data using autoSNP.

Authors: Gary Barker; Jacqueline Batley; Helen O' Sullivan; Keith J Edwards; David Edwards
Journal: Bioinformatics Date: 2003-02-12 Impact factor: 6.937

5. Basic local alignment search tool.

Authors: S F Altschul; W Gish; W Miller; E W Myers; D J Lipman
Journal: J Mol Biol Date: 1990-10-05 Impact factor: 5.469

6. Overlapping genomic sequences: a treasure trove of single-nucleotide polymorphisms.

Authors: P Taillon-Miller; Z Gu; Q Li; L Hillier; P Y Kwok
Journal: Genome Res Date: 1998-07 Impact factor: 9.043

7. Mining for single nucleotide polymorphisms and insertions/deletions in maize expressed sequence tag data.

Authors: Jacqueline Batley; Gary Barker; Helen O'Sullivan; Keith J Edwards; David Edwards
Journal: Plant Physiol Date: 2003-05 Impact factor: 8.340

8. SNPServer: a real-time SNP discovery tool.

Authors: David Savage; Jacqueline Batley; Tim Erwin; Erica Logan; Christopher G Love; Geraldine A C Lim; Emmanuel Mongin; Gary Barker; German C Spangenberg; David Edwards
Journal: Nucleic Acids Res Date: 2005-07-01 Impact factor: 16.971

8 in total

18 in total

1. Statistical genetic issues for genome-wide association studies.

Authors: Bruce S Weir
Journal: Genome Date: 2010-11 Impact factor: 2.166

Review 2. Applications and challenges of next-generation sequencing in Brassica species.

Authors: Lijuan Wei; Meili Xiao; Alice Hayward; Donghui Fu
Journal: Planta Date: 2013-09-24 Impact factor: 4.116

3. High-Throughput Genotyping Technologies in Plant Taxonomy.

Authors: Monica F Danilevicz; Cassandria G Tay Fernandez; Jacob I Marsh; Philipp E Bayer; David Edwards
Journal: Methods Mol Biol Date: 2021

Review 4. Accessing complex crop genomes with next-generation sequencing.

Authors: David Edwards; Jacqueline Batley; Rod J Snowdon
Journal: Theor Appl Genet Date: 2012-09-05 Impact factor: 5.699

5. A high-throughput data mining of single nucleotide polymorphisms in Coffea species expressed sequence tags suggests differential homeologous gene expression in the allotetraploid Coffea arabica.

Authors: Ramon Oliveira Vidal; Jorge Maurício Costa Mondego; David Pot; Alinne Batista Ambrósio; Alan Carvalho Andrade; Luiz Filipe Protasio Pereira; Carlos Augusto Colombo; Luiz Gonzaga Esteves Vieira; Marcelo Falsarella Carazzolle; Gonçalo Amarante Guimarães Pereira
Journal: Plant Physiol Date: 2010-09-23 Impact factor: 8.340

Review 6. A user guide to the Brassica 60K Illumina Infinium™ SNP genotyping array.

Authors: Annaliese S Mason; Erin E Higgins; Rod J Snowdon; Jacqueline Batley; Anna Stein; Christian Werner; Isobel A P Parkin
Journal: Theor Appl Genet Date: 2017-02-20 Impact factor: 5.699

7. An integrated database to enhance the identification of SNP markers for rice.

Authors: Changkug Kim; Unghan Yoon; Gangseob Lee; Sunghan Park; Young-Joo Seol; Hwanki Lee; Jangho Hahn
Journal: Bioinformation Date: 2009-12-31

8. Epistasis: obstacle or advantage for mapping complex traits?

Authors: Koen J F Verhoeven; George Casella; Lauren M McIntyre
Journal: PLoS One Date: 2010-08-26 Impact factor: 3.240

9. SNiPlay: a web-based tool for detection, management and analysis of SNPs. Application to grapevine diversity projects.

Authors: Alexis Dereeper; Stéphane Nicolas; Loïc Le Cunff; Roberto Bacilieri; Agnès Doligez; Jean-Pierre Peros; Manuel Ruiz; Patrice This
Journal: BMC Bioinformatics Date: 2011-05-05 Impact factor: 3.307

10. Identification of SNPs in RNA-seq data of two cultivars of Glycine max (soybean) differing in drought resistance.

Authors: Ramon Oliveira Vidal; Leandro Costa do Nascimento; Jorge Maurício Costa Mondego; Gonçalo Amarante Guimarães Pereira; Marcelo Falsarella Carazzolle
Journal: Genet Mol Biol Date: 2012-06 Impact factor: 1.771