| Literature DB >> 30718979 |
Devotha G Nyambo1, Edith T Luhanga1, Zaipuna O Yonah1, Fidalis D N Mujibi2.
Abstract
The heterogeneity of smallholder dairy production systems complicates service provision, information sharing, and dissemination of new technologies, especially those needed to maximize productivity and profitability. In order to obtain homogenous groups within which interventions can be made, it is necessary to define clusters of farmers who undertake similar management activities. This paper explores robustness of production cluster definition using various unsupervised learning algorithms to assess the best approach to define clusters. Data were collected from 8179 smallholder dairy farms in Ethiopia and Tanzania. From a total of 500 variables, selection of the 35 variables used in defining production clusters and household membership to these clusters was determined by Principal Component Analysis and domain expert knowledge. Three clustering algorithms, K-means, fuzzy, and Self-Organizing Maps (SOM), were compared in terms of their grouping consistency and prediction accuracy. The model with the least household reallocation between clusters for training and testing data was deemed the most robust. Prediction accuracy was obtained by fitting a model with fixed effects model including production clusters on milk yield, sales, and choice of breeding method. Results indicated that, for the Ethiopian dataset, clusters derived from the fuzzy algorithm had the highest predictive power (77% for milk yield and 48% for milk sales), while for the Tanzania data, clusters derived from Self-Organizing Maps were the best performing. The average cluster membership reallocation was 15%, 12%, and 34% for K-means, SOM, and fuzzy, respectively, for households in Ethiopia. Based on the divergent performance of the various algorithms evaluated, it is evident that, despite similar information being available for the study populations, the uniqueness of the data from each country provided an over-riding influence on cluster robustness and prediction accuracy. The results obtained in this study demonstrate the difficulty of generalizing model application and use across countries and production systems, despite seemingly similar information being collected.Entities:
Mesh:
Year: 2019 PMID: 30718979 PMCID: PMC6334318 DOI: 10.1155/2019/1020521
Source DB: PubMed Journal: ScientificWorldJournal ISSN: 1537-744X
Features used in cluster analysis.
|
|
|
|
|
|---|---|---|---|
|
| Exclusive grazing in dry season | Boolean | 0(no) or 1(yes) |
|
| Exclusive grazing in rainy season | Boolean | 0(no) or 1(yes) |
|
| Mainly grazing in dry season | Boolean | 0(no) or 1(yes) |
|
| Mainly grazing in rainy season | Boolean | 0(no) or 1(yes) |
|
| Mainly stall feed in dry season | Boolean | 0(no) or 1(yes) |
|
| Mainly stall feed in rainy season | Boolean | 0(no) or 1(yes) |
|
| Use of concentrates | Discrete | 1 – 12 (months) |
|
| Watering frequency | Discrete | 0 – 4 |
|
| Distance to water source | Continuous | 0 – 15 |
|
| Total land holding | Continuous | 0 – 100 |
|
| Area under cash cropping | Continuous | 0 – 10 |
|
| Area under food cropping | Continuous | 0 – 83.25 |
|
| Area under fodder production | Continuous | 0 - 80 |
|
| Area under grazing | Continuous | 0 - 13 |
|
| Number of employees | Discrete | 1 - 10 |
|
| Number of casual labors | Discrete | 1 – 10 |
|
| Vaccination frequency | Discrete | 0 – 6 |
|
| Deworming frequency | Discrete | 0 – 5 |
|
| Self-deworming service | Boolean | 0(no) or 1(yes) |
|
| Membership in farmer groups | Discrete | 0 – 5 |
|
| Experience in dairy farming | Discrete | 1 - 50 |
|
| Years of schooling | Discrete | 0 – 21 |
|
| Preferred breeding method | Boolean | 0 (bull) or 1(artificial insemination) |
|
| Distance to breeding service provider | Continuous | 0 - 100 |
|
| Frequency of visit by extension officer | Discrete | 1 – 54 |
|
| Herd size | Discrete | 1 – 50 |
|
| Number of milking cows | Discrete | 1 – 20 |
|
| Number of exotic cattle | Discrete | 1 - 48 |
|
| Number of sheep | Discrete | 1 - 80 |
|
| Peak milk production for the best cow | Continuous | 1 – 40 |
|
| Amount of milk sold in bulk | Continuous | 1 – 100 |
|
| Liters of milk sold | Continuous | 1 – 100 |
|
| Distance to milk buyers | Continuous | 1 – 37 |
|
| Total crop sale | Continuous | 0 – 21000 (Birr), 0 – 950000 (Tsh) |
|
| Distance to market | Continuous | 1 – 8 |
Figure 1Graph showing four optimal clusters for the Ethiopia dataset.
Cluster densities (number of households allocated to the cluster) for the Ethiopia dataset.
|
|
|
|
|
|---|---|---|---|
|
| 342 | 487 | 2673 |
|
| 875 | 2084 | 411 |
|
| 2689 | 1217 | 1309 |
|
| 487 | 605 |
Figure 2Graph showing six optimal clusters for the Tanzania dataset.
Cluster densities (number of households allocated to the cluster) for the Tanzania dataset.
|
|
|
|
|
|---|---|---|---|
|
| 811 | 1180 | 2506 |
|
| 452 | 952 | 811 |
|
| 374 | 203 | |
|
| 616 | 295 | |
|
| 372 | 516 | |
|
| 692 | 171 |
Figure 3Household allocation to four clusters using the K-means model for Ethiopia dairy farmers.
Figure 4Node counts for household clusters derived using the SOM model for Ethiopia (a) and dendrogram for super clusters (b).
Figure 5Household allocation into three clusters using the fuzzy model for Ethiopia dairy farmers.
Cluster composition parameters (intercluster adhesion and intracluster cohesion) for Ethiopian households.
| Model | No. Clusters | Within sum of square | Mean distance from central nodes | Mean silhouette separation |
|---|---|---|---|---|
|
| 4 | 20758 | 0.74 | 0.66 |
|
| 4 | 23178 | 0.92 | 0.51 |
|
| 3 | 21655 | 0.89 | 0.56 |
Cluster composition parameters (intercluster adhesion and intracluster cohesion) for Tanzania households.
| Model | No. Clusters | Within sum of square | Mean distance from central nodes | Mean silhouette separation |
|---|---|---|---|---|
|
| 6 | 12628 | 2.1 | 0.66 |
|
| 6 | 11772 | 1.7 | 0.64 |
Figure 6Household allocation into six clusters using the K-means model for Tanzania dairy farmers.
Figure 7Node counts for household clusters derived using the SOM model for Tanzania (a) and dendrogram for super clusters (b).
Cluster model parameters and ranking accuracy (membership reallocation) based on spearman rank correlation for the Ethiopia dataset.
| Model | AIC | Residual deviance | Ranking accuracy (r) |
|---|---|---|---|
|
| 102 | 2.7e∧-2 | 0.85 |
|
| 102 | 2.8e∧-2 | -0.88 |
|
| 68.09 | 9.35e∧-2 | 0.68 |
Cluster model parameters and ranking accuracy (membership reallocation) based on spearman rank correlation for the Tanzania dataset.
| Model | AIC | Residual deviance | Ranking accuracy (r) |
|---|---|---|---|
|
| 200 | 0.001 | -0.21 |
|
| 200 | 0.006 | 0.39 |
Estimates of prediction accuracy for models fitting cluster of production for milk yield, milk sales, and choice of breeding method in Ethiopia.
| Accuracy of prediction (r) | 0 ≤ p ≤ 1 | ||
|---|---|---|---|
| Algorithm/Response Variable | Milk yield | Milk sold | Preferred breeding method |
|
| 0.68 | 0.40 | 0.54 |
|
| 0.66 | 0.38 | 0.54 |
|
| 0.77 | 0.48 | 0.55 |
Estimates of prediction accuracy for models fitting cluster of production for milk yield, milk sales, and choice of breeding method in Tanzania.
| Accuracy of prediction (r) | 0 ≤ p ≤ 1 | ||
|---|---|---|---|
| Algorithm/ Response Variable |
|
|
|
| K-means |
|
|
|
| SOM |
|
|
|
Proportion of variance accounted for by cluster of production in Ethiopia.
| Fitted model | Total Variance | Residual variance | -2log likelihood | P value | Variance accounted for by cluster | |
|---|---|---|---|---|---|---|
|
|
| |||||
| Model with cluster | 1.015 | 0.239 | 1867.4 | <0.00001 | 73% | |
| Model without cluster | 0.977 | 3718.4 | ||||
|
| ||||||
| Model with cluster | 0.988 | 0.222 | 1770.1 | <0.00001 | 54% | |
| Model without cluster | 0.76 | 3388.6 | ||||
|
| ||||||
|
|
| |||||
| Model with cluster | 1.015 | 0.283 | 2091.8 | <0.00001 | 68% | |
| Model without cluster | 0.977 | 3718.4 | ||||
|
| ||||||
| Model with cluster | 0.988 | 0.258 | 1969.8 | <0.00001 | 51% | |
| Model without cluster | 0.76 | 3388.6 | ||||
|
| ||||||
|
|
| |||||
| Model with cluster | 1.015 | 0.074 | 337 | <0.00001 | 89% | |
| Model without cluster | 0.977 | 3718.4 | ||||
|
| ||||||
| Model with cluster | 0.988 | 0.073 | 319.4 | <0.00001 | 70% | |
| Model without cluster | 0.76 | 3388.6 | ||||
∗Data scaled to have unit variance and mean of zero.
Proportion of variances accounted for by cluster of production in Tanzania.
| Fitted model | Total variance | Residual | -2log likelihood | P value | Variance accounted for by cluster | |
|---|---|---|---|---|---|---|
|
|
| |||||
| Model with cluster | 1.076 | 0.0027 | -2981 | <0.00001 | 71% | |
| Model without cluster | 0.771 | 2584.2 | ||||
|
| ||||||
| Model with cluster | 1.09 | 0.018 | -1084.3 | <0.00001 | 65% | |
| Model without cluster | 0.723 | 2520 | ||||
|
| ||||||
|
|
| |||||
| Model with cluster | 1.076 | 0.294 | 1633 | <0.00001 | 44% | |
| Model without cluster | 0.771 | 2584.2 | ||||
|
| ||||||
| Model with cluster | 1.09 | 0.228 | 1381.6 | <0.00001 | 45% | |
| Model without cluster | 0.723 | 2520.2 | ||||
∗ indicates data scaled to have unit variance and mean of zero.