Literature DB >> 25429973

SNP-Seek database of SNPs derived from 3000 rice genomes.

Nickolai Alexandrov¹, Shuaishuai Tai², Wensheng Wang³, Locedie Mansueto⁴, Kevin Palis⁴, Roven Rommel Fuentes⁴, Victor Jun Ulat⁴, Dmytro Chebotarov⁴, Gengyun Zhang⁵, Zhikang Li⁶, Ramil Mauleon⁴, Ruaraidh Sackville Hamilton⁴, Kenneth L McNally⁴.

Abstract

We have identified about 20 million rice SNPs by aligning reads from the 3000 rice genomes project with the Nipponbare genome. The SNPs and allele information are organized into a SNP-Seek system (http://www.oryzasnp.org/iric-portal/), which consists of Oracle database having a total number of rows with SNP genotypes close to 60 billion (20 M SNPs × 3 K rice lines) and web interface for convenient querying. The database allows quick retrieving of SNP alleles for all varieties in a given genome region, finding different alleles from predefined varieties and querying basic passport and morphological phenotypic information about sequenced rice lines. SNPs can be visualized together with the gene structures in JBrowse genome browser. Evolutionary relationships between rice varieties can be explored using phylogenetic trees or multidimensional scaling plots.

Entities: Chemical Disease Species

Mesh：

Year: 2014 PMID： 25429973 PMCID： PMC4383887 DOI： 10.1093/nar/gku1039

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

The current rate of increasing rice yield by traditional breeding is insufficient to feed the growing population in the near future (1). The observed trends in climate change and air pollution create even bigger threats to the global food supply (2). A promising solution to this problem can be the application of modern molecular breeding technologies to ongoing rice breeding programs. This approach has been utilized to increase disease resistance, drought tolerance and other agronomically important traits (3–5). Understanding the differences in genome structures, combined with phenotyping observations, gene expression and other information, is an important step toward establishing gene-trait associations, building predictive models and applying these models in the breeding process. The 3000 rice genome project (6) produced millions of genomic reads for a diverse set of rice varieties. SNP-Seek database is designed to provide a user-friendly access to the single nucleotide polymorphisms, or SNPs, identified from this data. Short, 83 bp pair-ended Illumina reads were aligned using the BWA program (7) to the Nipponbare temperate japonica genome assembly (8), resulting in average of 14× coverage of rice genome among all the varieties. SNP calls were made using GATK pipeline (9) as described in (6).

SNP DATA

For the SNP-Seek database we have considered only SNPs, ignoring indels. A union of all SNPs extracted from 3000 vcf files consists of 23 M SNPs. To eliminate potentially false SNPs, we have collected only SNPs that have the minor allele in at least two different varieties. The number of such SNPs is 20 M. All the genotype calls at these positions were combined into one file of ∼20 M × 3 K SNP calls, and the data were loaded into an Oracle schema using three main tables: STOCK, SNP and SNP_GENOTYPE (Figure 1). Some varieties lack reads mapping to the SNP position, and for them no SNP calls were recorded. Distribution of the SNP coverage is shown in Figure 2. About 90% of all SNP calls have a number of supporting reads greater than or equal to four. Out of them, 98% have a major allele frequency >90% and are considered to be homozygous, 1.1% have two alleles with frequencies between 40 and 60% and considered to be heterozygous, and the remaining 0.9% represent other cases when the SNP could not be classified as neither heterozygous nor homozygous. More than 98% of SNPs have exactly two different allelic variants in 3000 varieties, 1.7% of SNPs have three variants and 0.02% of SNPs have all four nucleotides in different genomes mapped to that SNP position. There are 2.3× more transitions than transvertions in our database (Table 1).

Figure 1.

Basic schema of the SNP-Seek database

Figure 2.

Distribution of SNP coverage

Table 1.

Types of allele variants and their frequencies in rice SNPs

Allele variants	Frequency,%
A/G + C/T	70
A/C + G/T	15
A/T	9
C/G	6

Basic schema of the SNP-Seek database Distribution of SNP coverage Not all SNPs have been called in all varieties. Actually, the distribution of the called SNPs among varieties is bimodal, with one mode at about 18 M SNP calls corresponding to japonica varieties which are close to the reference genome, and the second peak at about 14 M corresponding to the other varieties (Figure 3).

Figure 3.

SNP distribution by varieties. The major peak shows that about 14 M SNPs have been called in most varieties. The bimodal plot indicates that a fraction of SNPs are missing in some varieties, likely due to lack of mapped reads in variable regions.

GENOME ANNOTATION DATA

We used CHADO database schema (10) to store the Nipponbare reference genome and gene annotation, downloaded from the MSU rice web site (http://rice.plantbiology.msu.edu/) (8). To browse and visualize genes and SNPs in the rice genome, we integrated the JBrowse genome browser (11) as a feature of our site.

PASSPORT AND MORPHOLOGICAL DATA

Most of the 3000 varieties (and eventually all) are conserved in the International Rice genebank housed at IRRI (12). Passport and basic morphological data from the source accession for the purified genetic stock are accessible via SNP-Seek.

INTERFACES

We deployed interfaces to facilitate the following major types of queries: (i) for two varieties find all SNPs from a gene or genomic region that differentiate them; (ii) for a gene or genome region, show all SNP calls for all varieties (Supplementary Figure S1); (iii) find all sequenced varieties from a certain country or a subpopulation, which can be viewed as a phylogenetic tree, built using TreeConstructor class from BioJava (13) and rendered using jsPhyloSVG JavaScript library (14) (Supplementary Figure S2) or as a multidimensional scaling plots (Figure 4). The results of SNP search can be viewed as a table exported to text files, or visualized in JBrowse.

Figure 4.

Multidimensional scaling plot of the 3000 rice varieties. Ind1, ind2 and ind3 are three groups of indica rice, indx corresponds to other indica varieties, temp is temperate japonica, trop is tropical japonica, temp/trop and trop/temp are admixed temperate and tropical japonica varieties, japx is other japonica varieties, aus is aus, inax is admixed aus and indica, aro is aromatic and admix is all other unassigned varieties.

USE CASE EXAMPLE FOR QUERYING A REGION OF INTEREST

We used Rice SNP-Seek database to quickly examine the diversity of the entire panel at a particular region of interest. We chose the sd-1 gene as test case due to its scientific importance in rice breeding. This semi-dwarf locus, causing a semi-dwarf stature of rice, was discovered by three different research groups to be a spontaneous mutation of GA 20-oxidase (formally named sd-1 gene), originating from the Taiwanese indica variety Deo-woo-gen. Its incorporation into IR8 and other varieties by rice breeding programs spurred the First Green Revolution in rice production in the late 1960s (15). Sd-1 is annotated in the Nipponbare genome by Michigan State University's Rice Genome Annotation Project as LOC_Os01g66100, on chromosome 1 from position 38 382 382 to 38 385 504 base pairs. On the home page of SNP-Seek, the module was opened and the coordinates of sd-1 were used to define the region to retrieve all SNPs, with checked to select from all the varieties. Clicking on button resulted in the identification of 80 SNP positions (Supplementary Figure S1). An overall view of the SNP positions in the polymorphic panel shows at least eight distinct SNP blocks (Figure 5). In this particular panel group of mostly temperate japonica, two distinct SNP blocks can be seen as shared (Figure 5). Variety information can be obtained by typing the name of the varieties you see on the genome browser into the field of the Variety module. This use case is one of the examples detailed in the module.

Figure 5.

Jbrowse view of the SNP genotypes within the sd-1 gene (each variety is one row). Red blocks indicate polymorphism of the variety against Nipponbare. Shared SNP blocks are seen as vertical columns in red. The blue rectangle box in the bottom contains varieties that do not have these blocks.

CONCLUSION

We have organized the largest collection of rice SNPs into the database data structures for convenient querying and provided user-friendly interfaces to find SNPs in certain genome regions. We have demonstrated that about 60 billion data points can be loaded into an Oracle database and queried with a reasonable (quick) response times. Most of the varieties in SNP-Seek database have passport and basic phenotypic data inherited from their source accession enabling genome-wide or gene-specific tests of association. The database is quickly developing and will be expanding in the near future to include short indels, larger structural variations, SNPs calls using other rice reference genomes.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

14 in total

1. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data.

Authors: Aaron McKenna; Matthew Hanna; Eric Banks; Andrey Sivachenko; Kristian Cibulskis; Andrew Kernytsky; Kiran Garimella; David Altshuler; Stacey Gabriel; Mark Daly; Mark A DePristo
Journal: Genome Res Date: 2010-07-19 Impact factor: 9.043

Review 2. Genetic engineering and breeding of drought-resistant crops.

Authors: Honghong Hu; Lizhong Xiong
Journal: Annu Rev Plant Biol Date: 2013-12-02 Impact factor: 26.379

3. JBrowse: a next-generation genome browser.

Authors: Mitchell E Skinner; Andrew V Uzilov; Lincoln D Stein; Christopher J Mungall; Ian H Holmes
Journal: Genome Res Date: 2009-07-01 Impact factor: 9.043

Review 4. Disease resistance in rice and the role of molecular breeding in protecting rice crops against diseases.

Authors: Shah Fahad; Lixiao Nie; Faheem Ahmed Khan; Yutiao Chen; Saddam Hussain; Chao Wu; Dongliang Xiong; Wang Jing; Shah Saud; Farhan Anwar Khan; Yong Li; Wei Wu; Fahad Khan; Shah Hassan; Abdul Manan; Amanullah Jan; Jianliang Huang
Journal: Biotechnol Lett Date: 2014-07 Impact factor: 2.461

5. jsPhyloSVG: a javascript library for visualizing interactive and vector-based phylogenetic trees on the web.

Authors: Samuel A Smits; Cleber C Ouverney
Journal: PLoS One Date: 2010-08-18 Impact factor: 3.240

6. A Chado case study: an ontology-based modular schema for representing genome-associated biological information.

Authors: Christopher J Mungall; David B Emmert
Journal: Bioinformatics Date: 2007-07-01 Impact factor: 6.937

7. BioJava: an open-source framework for bioinformatics in 2012.

Authors: Andreas Prlić; Andrew Yates; Spencer E Bliven; Peter W Rose; Julius Jacobsen; Peter V Troshin; Mark Chapman; Jianjiong Gao; Chuan Hock Koh; Sylvain Foisy; Richard Holland; Gediminas Rimsa; Michael L Heuer; H Brandstätter-Müller; Philip E Bourne; Scooter Willis
Journal: Bioinformatics Date: 2012-08-09 Impact factor: 6.937

Review 8. The genes of the Green Revolution.

Authors: Peter Hedden
Journal: Trends Genet Date: 2003-01 Impact factor: 11.639

9. The 3,000 rice genomes project.

Authors:
Journal: Gigascience Date: 2014-05-28 Impact factor: 6.524

10. Improvement of the Oryza sativa Nipponbare reference genome using next generation sequence and optical map data.

Authors: Yoshihiro Kawahara; Melissa de la Bastide; John P Hamilton; Hiroyuki Kanamori; W Richard McCombie; Shu Ouyang; David C Schwartz; Tsuyoshi Tanaka; Jianzhong Wu; Shiguo Zhou; Kevin L Childs; Rebecca M Davidson; Haining Lin; Lina Quesada-Ocampo; Brieanne Vaillancourt; Hiroaki Sakai; Sung Shin Lee; Jungsok Kim; Hisataka Numa; Takeshi Itoh; C Robin Buell; Takashi Matsumoto
Journal: Rice (N Y) Date: 2013-02-06 Impact factor: 4.783

124 in total

1. The Argonaute-binding platform of NRPE1 evolves through modulation of intrinsically disordered repeats.

Authors: Joshua T Trujillo; Mark A Beilstein; Rebecca A Mosher
Journal: New Phytol Date: 2016-07-19 Impact factor: 10.151

2. NGS sequencing reveals that many of the genetic variations in transgenic rice plants match the variations found in natural rice population.

Authors: Doori Park; Su-Hyun Park; Youn Shic Kim; Beom-Soon Choi; Ju-Kon Kim; Nam-Soo Kim; Ik-Young Choi
Journal: Genes Genomics Date: 2018-11-07 Impact factor: 1.839

3. Predicting rice hybrid performance using univariate and multivariate GBLUP models based on North Carolina mating design II.

Authors: X Wang; L Li; Z Yang; X Zheng; S Yu; C Xu; Z Hu
Journal: Heredity (Edinb) Date: 2016-09-21 Impact factor: 3.821

4. Physiological characterization and allelic diversity of selected drought tolerant traditional rice (Oryza sativa L.) landraces of Koraput, India.

Authors: Swati S Mishra; Prafulla K Behera; Vajinder Kumar; Sangram K Lenka; Debabrata Panda
Journal: Physiol Mol Biol Plants Date: 2018-09-28

5. Characterization of OglDREB2A gene from African rice (Oryza glaberrima), comparative analysis and its transcriptional regulation under salinity stress.

Authors: Abubakar Mohammad Gumi; Pritam Kanti Guha; Abhishek Mazumder; Pawan Jayaswal; Tapan Kumar Mondal
Journal: 3 Biotech Date: 2018-01-23 Impact factor: 2.406

6. Genetic architecture of cold tolerance in rice (Oryza sativa) determined through high resolution genome-wide analysis.

Authors: Ehsan Shakiba; Jeremy D Edwards; Farman Jodari; Sara E Duke; Angela M Baldo; Pavel Korniliev; Susan R McCouch; Georgia C Eizenga
Journal: PLoS One Date: 2017-03-10 Impact factor: 3.240

7. Nucleotide variations of 9-cis-epoxycarotenoid dioxygenase 2 (NCED2) and pericarp coloration genes (Rc and Rd) from upland rice varieties.

Authors: Muazr Amer Hamzah; Nur Aini Mohd Kasim; Athirah Shamsuddin; Nadia Mustafa; Norliana Izzati Mohamad Rusli; Chui-Yao Teh; Chai-Ling Ho
Journal: 3 Biotech Date: 2020-02-07 Impact factor: 2.406

8. Cloning and characterization of a gene encoding MIZ1, a domain of unknown function protein and its role in salt and drought stress in rice.

Authors: Vikender Kaur; Shashank K Yadav; Dhammaprakash P Wankhede; Pranusha Pulivendula; Ashok Kumar; Viswanathan Chinnusamy
Journal: Protoplasma Date: 2019-11-30 Impact factor: 3.356

9. Novel sequences, structural variations and gene presence variations of Asian cultivated rice.

Authors: Zhiqiang Hu; Wensheng Wang; Zhichao Wu; Chen Sun; Min Li; Jinyuan Lu; Binying Fu; Jianxin Shi; Jianlong Xu; Jue Ruan; Chaochun Wei; Zhikang Li
Journal: Sci Data Date: 2018-05-02 Impact factor: 6.444

10. OsPDCD5 negatively regulates plant architecture and grain yield in rice.

Authors: Shiqing Dong; Xianxin Dong; Xiaokang Han; Fan Zhang; Yu Zhu; Xiaoyun Xin; Ying Wang; Yuanyi Hu; Dingyang Yuan; Jianping Wang; Zhou Huang; Fuan Niu; Zejun Hu; Peiwen Yan; Liming Cao; Haohua He; Junru Fu; Yeyun Xin; Yanning Tan; Bigang Mao; Bingran Zhao; Jinshui Yang; Longping Yuan; Xiaojin Luo
Journal: Proc Natl Acad Sci U S A Date: 2021-07-20 Impact factor: 11.205