| Literature DB >> 31624841 |
Hua Peng1,2, Kai Wang1, Zhuo Chen1,2, Yinghao Cao1, Qiang Gao1, Yan Li1, Xiuxiu Li1,2, Hongwei Lu1,2, Huilong Du1,2, Min Lu1,2, Xin Yang1, Chengzhi Liang1,2.
Abstract
To date, large amounts of genomic and phenotypic data have been accumulated in the fields of crop genetics and genomic research, and the data are increasing very quickly. However, the bottleneck to using big data in breeding is integrating the data and developing tools for revealing the relationship between genotypes and phenotypes. Here, we report a rice sub-database of an integrated omics knowledgebase (MBKbase-rice, www.mbkbase.org/rice), which integrates rice germplasm information, multiple reference genomes with a united set of gene loci, population sequencing data, phenotypic data, known alleles and gene expression data. In addition to basic data search functions, MBKbase provides advanced web tools for genotype searches at the population level and for visually displaying the relationship between genotypes and phenotypes. Furthermore, the database also provides online tools for comparing two samples by their genotypes and finding target germplasms by genotype or phenotype information, as well as for analyzing the user submitted SNP or sequence data to find important alleles in the germplasm. A soybean sub-database is planned for release in 3 months and wheat and maize will be added in 1-2 years. The data and tools integrated in MBKbase will facilitate research in crop functional genomics and molecular breeding.Entities:
Mesh:
Year: 2020 PMID: 31624841 PMCID: PMC7145604 DOI: 10.1093/nar/gkz921
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.Schematic of data, page and function modules in MKBbase. Each data module has many linked pages and functional units for displaying the information related to the module, as exemplified by the Germplasm module in the figure.
The summary of rice data in MBKbase
| Phenotype | |||||
|---|---|---|---|---|---|
| Total num | With pedigree | With source locations | Traits | Value Num | |
| Germplasm | 137 769 | 24 462 | 134 626 | 130 | 4 786 640 |
| WGS sample | 7010 | 1153 | 6351 | 122 | 207 158 |
| Named Gene num | Trait gene num | Verified alleles num | |||
| Known genes | 13 219 | 4821 | 91 | ||
| Run num | Tissue | Stage or tissue | |||
| RNA-seq | 175 | 20 | 45 | ||
| MBK Allele Num | SNP num (AF ≥ 0.01) | Indel num (AF > 0.005) | |||
| Reference | Nip | 51 722 | 14 850 931 | 2 250 804 | |
| R498 | 54 973 | 13 278 107 | 2 387 538 | ||
Figure 2.An example of a locus page showing multiple types of information associated with the locus. Rice gene GS5 controlling grain size is on locus OsG00067204. (A) Two reference genomes contain different alleles at the locus. (B) Genotype table showing all genotypes in the rice population, with reference genome positions in the first row of the table and reference bases in the second row. Genotype IDs are enumerated from T1 as the most frequent genotype (allele or haplotype for homozygous genome). ‘0’ indicates that the genotype at this position is the same as in the reference genome. Capitalized bases indicate homozygous variants, lowercase bases indicate heterozygous variants and ‘-’ indicates missing information. In the REF row, the letter ‘N’ represents a short insertion in the reference genome compared with some other samples. In the GT row, the ‘N’ represents insertion of bases in the samples compared with the reference genome. (C) Distribution of each genotype (allele) in a rice population. For example, the majority of genotype (allele) T1 is found in temperate japonica lines (1251 out of 1769). (D) Boxplot showing the relationship between genotypes and phenotypes. The phenotype can be selected by users from all collected traits in the database. In this case, the locus OsG0067204 is known to be associated with grain width (GW) traits. (E) Expression profile at the locus in different reference genomes in different tissues.
Figure 3.Advanced functions based on genotype. (A) Advanced germplasm query by genotype (GT), the options in the drop-down list of genotype source include position GT, Locus GT and Verified GT. Users can set up complex logical queries by selecting different combinations of variants. In this example, 177 samples passed the filter with the same genotype. (B) Advanced germplasm query by sequence. Users can submit a genome sequence and identify the germplasms containing the sequence. (C) Comparison of two germplasm samples (A and B) based on Locus GT. X-axis: window position along the chromosome; Y-axis: locus number in the window, positive for sample A, negative for sample B. A red line represents the number of identical genotypes between A and B. A green line represents the number of specific genotypes of sample A. A teal line represents the number of specific genotypes of sample B. A purple line represents the number of missing or low frequency genotypes (containing sample number <10). (D) Identification of the known alleles in a WGS sample. This can be used for WGS samples either stored in the database or submitted by users through the web interface. (E) Tools for viewing the effect of variation. Clicking on any variation in a reference row of genotype table, for example, clicking ‘T’ (at position 30,315,214 bp) in the example will result in the sequence view panel automatically jumping to the corresponding position. As shown, this mutation results in the substitution of amino acids (K→M).