Oryza sativa (Asian cultivated rice) is known to have five subpopulations which are indica, temperate japonica, tropical japonica, aus and aromatic. Of these, indica and japonica, comprised of temperate japonica and tropical japonica, are known as pivotal subpopulations; aus is known to be related to indica, and aromatic is intermediate between indica and japonica (Garris et al., 2005; Thomson et al., 2007; Kovach et al., 2009; Huang et al., 2012; Schatz et al., 2014; Civan et al., 2015; McCouch et al., 2016; Chin et al., 2017). There are genetic barriers, particularly between indica and japonica, which often challenge rice trait improvements by breeding (Kato & Kosaka, 1930; Chen et al., 2008; Kim, Jiang & Koh, 2009; Zhu et al., 2017). Because each subpopulation often has desirable characteristics for cultivars, overcoming the genetic barriers between the subpopulations will help rice breeders freely introgress desirable genes originating from different subpopulations into an elite line. This can eventually minimize the breeding costs and maximize the sustainability of rice production being threatened by climate changes, human population increases and loss of cultivation land. Classifying Oryza sativa is important because it can help breeders develop effective mating paths to overcome the genetic barriers by identifying accessions that can bridge between indica and japonica. Currently, classifying Oryza sativa is widely conducted with genomic tools because genomic data can reflect the variation of subpopulation characteristics at a molecular level (Garris et al., 2005; Kim, Jiang & Koh, 2009; Zhao et al., 2011; McCouch et al., 2016; Chin et al., 2017). However, the sole use of genomic data may have limitations to the understanding of Oryza sativa diversity because some subpopulation-associated traits are related to cytoplasmic effects (Bao, Sun & Corke, 2002; Zhao et al., 2010; Thomson et al., 2007). To overcome the gap that genomic data cannot fill, the use of phenotypic data is reasonable in that it comprehends the genomic and cytoplasmic effects.This study introduces how to computationally classify Oryza sativa accessions into indica and japonica by applying multiple phenotypic variables to the logistic regression model (LRM). This study used a publicly available data source containing information on 413 Oryza sativa accessions and demonstrated that the LRM is a promising tool for accurate indica/japonicaclassification in Oryza sativa by phenotype.
Materials and Methods
Phenotypic data
A set of phenotypic data and subpopulation information was used, which was originally generated and analyzed by Zhao et al. (2011). The data set consisted of 413 accessions originating from 82 countries, obtained at http://ricediversity.org/data/sets/44kgwas. Out of 32 phenotypic variables, 24 variables were selected because they were quantitative and not confined to a certain geography. The selected variables were divided into morphology (culm habit, flag leaf length, flag leaf width), yield components (panicle number per plant, plant height, panicle length, primary panicle branch number, seed number per panicle, florets per panicle, panicle fertility), seed morphology (seed length, seed width, seed volume, seed surface area, brown rice seed length, brown rice seed width, brown rice volume, seed length/width ratio, brown rice length/width ratio), stress tolerance (straighthead susceptibility, blast resistance) and quality (amylose content, alkali spreading value, protein content). Every accession in the data set belonged to one of the following subpopulation groups: admixed (62), aromatic (14), aus (57), indica (87), temperate japonica (96) and tropical japonica (97). In this study, temperate japonica and tropical japonica were combined into japonica.
Stepwise variable selection and parameter estimation
The LRM formula used in this study can be denoted as follows:where P(japonica|x1, …, x) is the probability of an accession being japonica (>0.5) or indica (<0.5) given the predictor variables, x1,…, x; β0 is the constant term; β1,…, β are the parameters for the predictor variables, x1, …, x, respectively.In Eq. (1), the response variable is binary between indica and japonica, and the predictor variables (phenotypic variables) are quantitative. To identify a set of predictor variables that can maximize the indica/japonica separation power, n sets were prepared from 1D to nD, in which the nD is a set containing all possible combinations of n different phenotypic variables (e.g., 1D contains every single phenotypic variable, 2D contains all possible pairs of phenotypic variables, and so forth). The n was increased by one until the maximum P(japonica|x1, …, x) was reached, in which every single selection from the nD set was subject to the following steps:Calculate a set of parameters by fitting the LRM (Eq. (1)) with the phenotypic variables in a selection.Applying the phenotypic variables in the selection used in Step 1 to Eq. (1) with the parameters estimated in Step 1. The resulting value must be between 0 and 1, from which the indica or japonicacan be determined at 0.5.In order to estimate the indica/japonica separation power, a receiver operating characteristic (ROC) curve was implemented using an R package called pROC (Robin et al., 2011). In an ROC space, the x- and y- axes ranging between 0 and 1 represent the false positive rate (FPR or specificity) and true positive rate (TPR or sensitivity), respectively. Thus, an area under an ROCcurve (AUC) can range between 0 and 1. The closer the AUC is to 1, the higher the indica/japonica separation power. By referring to the resulting AUCs, a set of phenotypic variables that maximized the indica/japonica separation power was identified, and the resulting parameters were used to define the customized LRM.
Applications of the customized LRM
1. Estimating the indica/japonicaclassification accuracy: given the 280 indica/japonica accessions, the customized LRM along with the related phenotypic variables were taken into 100 iterations of ten-fold cross-validations, from which the 100 values were obtained. The resulting values were averaged into the indica/japonicaclassification accuracy.2. Estimating the variable interaction: as the LRM uses multiple phenotypic variables, some portion of indica/japonicaclassification power might be derived from interactions between variables. The interaction magnitude for every variable with all other variables was calculated using the H-statistic (Frieman & Popescu, 2008). The resulting H-statisticcan range between 0 and 1 with no interaction resulting in 0 and full interaction resulting in 1. The H-statistic was calculated using an R package, iml (Molnar, Casalicchio & Bischl, 2018).3. Classifying accessions in each minor subpopulation group into indica and japonica: the customized LRM was applied to each minor subpopulation group (aromatic, aus, admixed) to divide accessions into indica and japonica. This examination aimed to observe how accessions in each minor subpopulation group are phenotypically divided from the indica/japonica perspective.
Dendrogram-based indica/japonica classification
To draw dendrograms, the genomic data set was obtained at http://ricediversity.org/data/sets/44kgwas (Zhao et al., 2011). The genomic data set consisted of the accessions in the phenotypic data, genotyped with 36,901 SNPs. Two dendrograms were drawn; one included the 280 indica/japonica accessions, and the other included all 413 accessions. Then, the dendrogram-based indica/japonicaclassifications were compared with the LRM-based indica/japonicaclassifications. Each dendrogam was graphed based on a genetic distance matrix in which the genetic distances between two different accessions were calculated using the following equation:where the IBS is the IBS (identical by state) coefficient between A and B, which can range between 0 and 2.The IBS matrix was computed using Numericware i (Kim & Beavis, 2017) which is freely available at https://figshare.com/articles/Numericware_i/3496787.
Data availability
With the exception of calculating the IBS matrix, all other computations were conducted using R (R Core Team, 2016). The data set and R scripts used in this study are freely available at https://github.com/bongsongkim/logit.regression.rice.
Results
Stepwise variable selection
Figure 1 and Table 1 suggest that more phenotypic variables led to stronger indica/japonica separation power of the LRM, and the fully accurate indica/japonica separation power (AUC = 1) was achieved with a 7D selection. Table 2 summarizes the phenotypic variables yielding the maximum indica/japonica separation power in each set (1D to 7D). The set of phenotypic variables yielding AUC = 1 comprises panicle number per plant, seed number per panicle, florets per panicle, panicle fertility, straighthead susceptibility, blast resistance and protein content.
Figure 1
The best indica/japonica classification for each set (1D to 7D).
(A) The best indica/japonica classification in 1D set, (B) the ROC curve for Fig. 1A, (C) the best indica/japonica classification in 2D set, (D) the ROC curve for Fig. 1C, (E) the best indica/japonica classification in 3D set, (F) the ROC curve for Fig. 1E, (G) the best indica/japonica classification in 4D set, (H) the ROC curve for Fig. 1G, (I) the best indica/japonica classification in 5D set, (J) the ROC curve for Fig. 1I, (K) the best indica/japonica classification in 6D set, (L) the ROC curve for Fig. 1K, (M) the best indica/japonica classification in 7D set, (N) the ROC curve for Fig. 1M.
Table 1
Summary of the resulting AUCs obtained in each set (1D to 7D).
Set name
Minimum
Median
Mean
Maximum
1D
0.5129
0.6937
0.6868
0.9396
2D
0.5192
0.7968
0.7829
0.9762
3D
0.5507
0.8441
0.8413
0.9892
4D
0.5861
0.8789
0.8805
0.9927
5D
0.6101
0.9045
0.9081
0.9968
6D
0.6299
0.9266
0.9283
0.9977
7D
0.6785
0.9435
0.9437
1.0000
Table 2
Summary of predictor variables yielding the maximum indica/japonica separation power in each set (1D to 7D).
Set name
Predictor variables
1D
panicle number per plant
2D
panicle number per plant, brown rice seed width
3D
panicle number per plant, straighthead susceptibility, blast resistance
4D
panicle number per plant, brown rice volume, straighthead susceptibility, blast resistance
5D
panicle number per plant, brown rice volume, straighthead susceptibility, blast resistance, protein content
6D
panicle number per plant, seed number per panicle, florets per panicle, panicle fertility, straighthead susceptibility, blast resistance
7D
panicle number per plant, seed number per panicle, florets per panicle, panicle fertility, straighthead susceptibility, blast resistance, protein content
The best indica/japonica classification for each set (1D to 7D).
(A) The best indica/japonicaclassification in 1D set, (B) the ROCcurve for Fig. 1A, (C) the best indica/japonicaclassification in 2D set, (D) the ROCcurve for Fig. 1C, (E) the best indica/japonicaclassification in 3D set, (F) the ROCcurve for Fig. 1E, (G) the best indica/japonicaclassification in 4D set, (H) the ROCcurve for Fig. 1G, (I) the best indica/japonicaclassification in 5D set, (J) the ROCcurve for Fig. 1I, (K) the best indica/japonicaclassification in 6D set, (L) the ROCcurve for Fig. 1K, (M) the best indica/japonicaclassification in 7D set, (N) the ROCcurve for Fig. 1M.
Estimating the indica/japonica classification accuracy
Given the seven phenotypic variables yielding AUC = 1, the indica/japonicaclassification accuracy was estimated using the 280 indica/japonica accessions, for which 100 iterations of ten-fold cross-validations were implemented. As a result, the indica/japonicaclassification accuracy of 100% was obtained. This indicates that the seven phenotypes were certainly impactful for the fully accurate indica/japonicaclassification. Assuming that some portion of classification power might be derived from interactions between variables, the H-statistic was calculated for the purpose of estimating how much each variable generates the classification power in collaboration with other variables. Table 3 summarizes the resulting H-statistic values, indicating nearly no interactions between variables.
Table 3
H-statistic summary representing how much the indica/japonica classification power was derived from each predictor variable in collaboration with other predictor variables.
Phenotypic variable
H-statistic
Panicle number per plant
3.231598e−15
Seed number per panicle
2.233564e−15
Florets per panicle
4.127335e−16
Panicle fertility
3.167362e−16
Straighthead susceptibility
1.660925e−16
Blast resistance
1.994817e−16
Protein content
1.312777e−16
Figure 2 shows the dendrogram-based indica/japonicaclassification with the 280 indica/japonica accessions, in which the indica- and japonica-varietal clades were accurately divided, and the japonica accessions were further accurately divided into temperate japonica and tropical japonica (Fig. S1). This result shows that the dendrogram-based indica/japonicaclassification accuracy was 100%.
Figure 2
Dendrogram-based indica/japonica classification of 280 accessions.
Dendrogram-based indica/japonica classification of 280 accessions, obtained by using 36,901 SNPs.
Dendrogram-based indica/japonica classification of 280 accessions.
Dendrogram-based indica/japonicaclassification of 280 accessions, obtained by using 36,901 SNPs.
Applying the LRM to each minor subpopulation group
The LRM customized with parameters yielding AUC = 1 was used to investigate how it classifies the accessions in each minor subpopulation group (admixed, aromatic, aus) into indica and japonica. The customized LRM was as follows:where P(japonica|x1, …, x7) is the probability of an accession being japonica (>0.5) or indica (<0.5) given the predictor variables, x1,…, x7; x1 = the predictor variable for panicle number per plant; x2 = the predictor variable for seed number per panicle; x3 = the predictor variable for florets per panicle; x4 = the predictor variable for panicle fertility; x5 = the predictor variable for straighthead susceptibility; x6 = the predictor variable for blast resistance; x7 = the predictor variable for protein content.The absolute values for parameters in Eq. (3) in descending order are 50943.2 (-) for panicle fertility, 38008.7 (-) for florets per panicle, 35800.4 (+) for seed number per panicle, 5832.5 (-) for panicle number per plant, 491.6 (-) for straighthead susceptibility, 353.7 (+) for blast resistance and 214.6 (-) for protein content. These suggest that, when it comes to the indica/japonica separation power, the panicle fertility is most impactful, followed by florets per panicle, seed number per panicle, panicle number per plant, straighthead susceptibility, blast resistance and protein content.Because of missing phenotypic records, the size of each minor subpopulation group was reduced from 62 to 52 for the admixed group, 14 to 12 for the aromatic group and 57 to 26 for the aus group. Figure 3 shows that Eq. (3) split each subpopulation group into indica and japonica in ratios of 5:47 for the admixed group, 5:7 for the aromatic group and 18:8 for the aus group, respectively. Meanwhile, the dendrogram drawn with the whole accessions (413) shows two major clades (upper and lower), in which temperate japonica, tropical japonica and aromatic formed accurately separate groups within the upper clade (hereafter called japonica-varietal clade), while indica and aus formed accurately separate groups within the lower clade (hereafter called indica-varietal clade). The admixed accessions were spread across all subpopulation groups (Fig. S2). Figure 4 shows three Venn diagrams, each of which represents comparison between the LRM-based and dendrogram-based classifications for each minor subpopulation group; the agreements were 92.3% (48/52) in the admixed group, 58.3% (7/12) in the aromatic group and 69.2% (18/26) in the aus group.
Figure 3
LRM-based indica/japonica classification for each minor subpopulation group.
(A) LRM-based indica/japonica classification for the admixed group, (B) LRM-based indica/japonica classification for the aromatic group, (C) LRM-based indica/japonica classification for the aus group.
Figure 4
Comparison between the LRM-based classification and the dendrogram-based classification for each minor subpopulation group.
(A) Comparison between the LRM-based classification and dendrogram-based classification for the admixed group, (B) comparison between the LRM-based classification and dendrogram-based classification for the aromatic group, (C) comparison between the LRM-based classification and dendrogram-based classification for the aus group.
LRM-based indica/japonica classification for each minor subpopulation group.
(A) LRM-based indica/japonicaclassification for the admixed group, (B) LRM-based indica/japonicaclassification for the aromatic group, (C) LRM-based indica/japonicaclassification for the aus group.
Comparison between the LRM-based classification and the dendrogram-based classification for each minor subpopulation group.
(A) Comparison between the LRM-based classification and dendrogram-based classification for the admixed group, (B) comparison between the LRM-based classification and dendrogram-based classification for the aromatic group, (C) comparison between the LRM-based classification and dendrogram-based classification for the aus group.
Discussion
The indica and japonica in Oryza sativacan be classified based on phenotypic observations by humans. However, the classification by humanscan be subjective so that the classification results could sometimes be biased by an observer’s perception. Meanwhile, the indica/japonicaclassification by the LRM is thoroughly systematic and quantitative. In this aspect, the indica/japonicaclassification qualities by the LRM are expected to be comparable to or better than the qualities made by humans. In fact, the customized LRM (Eq. (3)) achieved the indica/japonicaclassification accuracy of 100% given the 280 indica/japonica accessions.Table 1 and Fig. 1 suggest that the more predictor variables there are, the stronger the indica/japonica separation power (AUC). This implies that the variation in a single trait is narrowly distinct between indica and japonica perhaps due to intensive genetic admixture by breeding over history (Zhao et al., 2010; Xu et al., 2012), and that the variation given multiple traits is substantially distinct between indica and japonica because multiple layers of the narrow effects collectively magnify differences between indica and japonica. In this study, the fully accurate indica/japonica separation power (AUC = 1) of the LRM was achieved with a set of seven phenotypic variables (Table 1 and Fig. 1). The H-statistic value of nearly zero for every phenotypic variable indicates that the indica/japonica separation power was not overestimated by unexpected synergetic effects between the phenotypic variables.Table 2 shows that the panicle-related traits, straighthead susceptibility and blast resistance frequently appeared across all sets. This may be related to previous knowledge that the variations in panicle characteristics, straighthead susceptibility and blast resistance are strongly associated with the indica and japonica differentiation: panicles are long and sparse in indica but short and dense in japonica (Bai et al., 2016); straighthead resistance and blast resistance are greater in indica than japonica (Yan et al., 2005; Jia et al., 2011).The customized LRM (Eq. (3)) was applied to each minor subpopulation group (aromatic, aus, admixed). Applying Eq. (3) to the aromatic and aus groups aimed to see how well the resulting classifications reflect their known evolutionary relationships with both indica and japonica: aromatic is intermediate but narrowly closer to japonica; aus is distinct from but closely related to indica (Garris et al., 2005; Thomson et al., 2007; Kovach et al., 2009; Huang et al., 2012; Schatz et al., 2014; Civan et al., 2015; McCouch et al., 2016; Chin et al., 2017). Regarding the aromatic accessions, the dendrogram-based classification formed the aromatic group distantly from the japonica group in the japonica-varietal clade, which is consistent with the previous knowledge that aromatic is narrowly close to japonica between indica and japonica. The agreement between the dendrogram-based and LRM-based classifications was 58.3% (7/12). The low agreement may be related to the subtle phenotypic similarity between aromatic and japonica (Garris et al., 2005). Regarding the aus accessions, the dendrogram-based classification assigned all of them to the aus group in the indica-varietal clade, which was consistent with the previous knowledge that aus and indica are distinct within a close evolutionary relationship. The agreement of 69.2% (18/26) between the dendrogram-based and LRM-based classifications suggests that phenotypic variation between the aus and japonica groups overlaps to some extent. Perhaps, it may be because the sub-speciation of aus from indica, occurred in a geographically isolated area (Bangladesh, India) under short growing seasons and upland conditions, might have confounded its phenotypiccharacteristics (Garris et al., 2005; Londo et al., 2006). Regarding the admixed accessions, the dendrogram-based classification dispersed them across indica- and japonica-varietal clades. This indicated that the admixed group covered a wide spectrum of genomic properties. The admixed group was a collection of accessions with uncertain subpopulation membership due to intensive inter-subpopulation mating. This allows us to deduce that the admixed accessions may have a high potential to bridge between indica and japonica. The agreement of 92.3% (48/52) between the dendrogram-based and LRM-based classifications shows reliable classification ability of the LRM given Oryza sativa accessions with uncertain subpopulation membership.
Conclusion
This study showed that the indica/japonicaclassification based on phenotypes can be analyzed using the LRM. A set of phenotypes that can collectively generate the fully accurate indica/japonica separation power can be used for researching indica/japonica differentiation. For example, if a study aims to detect quantitative trait loci (QTL) associated with the indica/japonica differentiation, each phenotype that contributes to the fully accurate indica/japonica separation power can be used for genome-wide association studies (GWAS). Furthermore, research on the indica/japonica differentiation in Oryza sativa may benefit from the variation of genes responsible for the phenotypes that contribute to the fully accurate indica/japonica separation power.Click here for additional data file.Click here for additional data file.
Authors: Michael J Thomson; Endang M Septiningsih; Fatimah Suwardjo; Tri J Santoso; Tiur S Silitonga; Susan R McCouch Journal: Theor Appl Genet Date: 2006-11-29 Impact factor: 5.699
Authors: Keyan Zhao; Mark Wright; Jennifer Kimball; Georgia Eizenga; Anna McClung; Michael Kovach; Wricha Tyagi; Md Liakat Ali; Chih-Wei Tung; Andy Reynolds; Carlos D Bustamante; Susan R McCouch Journal: PLoS One Date: 2010-05-24 Impact factor: 3.240
Authors: Michael J Kovach; Mariafe N Calingacion; Melissa A Fitzgerald; Susan R McCouch Journal: Proc Natl Acad Sci U S A Date: 2009-08-17 Impact factor: 11.205
Authors: Keyan Zhao; Chih-Wei Tung; Georgia C Eizenga; Mark H Wright; M Liakat Ali; Adam H Price; Gareth J Norton; M Rafiqul Islam; Andy Reynolds; Jason Mezey; Anna M McClung; Carlos D Bustamante; Susan R McCouch Journal: Nat Commun Date: 2011-09-13 Impact factor: 14.919
Authors: Michael C Schatz; Lyza G Maron; Joshua C Stein; Alejandro Hernandez Wences; James Gurtowski; Eric Biggers; Hayan Lee; Melissa Kramer; Eric Antoniou; Elena Ghiban; Mark H Wright; Jer-ming Chia; Doreen Ware; Susan R McCouch; W Richard McCombie Journal: Genome Biol Date: 2014 Impact factor: 13.583
Authors: Susan R McCouch; Mark H Wright; Chih-Wei Tung; Lyza G Maron; Kenneth L McNally; Melissa Fitzgerald; Namrata Singh; Genevieve DeClerck; Francisco Agosto-Perez; Pavel Korniliev; Anthony J Greenberg; Ma Elizabeth B Naredo; Sheila Mae Q Mercado; Sandra E Harrington; Yuxin Shi; Darcy A Branchini; Paula R Kuser-Falcão; Hei Leung; Kowaru Ebana; Masahiro Yano; Georgia Eizenga; Anna McClung; Jason Mezey Journal: Nat Commun Date: 2016-02-04 Impact factor: 14.919