Literature DB >> 35816512

A machine learning approach using partitioning around medoids clustering and random forest classification to model groups of farms in regard to production parameters and bulk tank milk antibody status of two major internal parasites in dairy cows.

Andreas W Oehm¹, Andrea Springer², Daniela Jordan², Christina Strube², Gabriela Knubben-Schweizer¹, Katharina Charlotte Jensen^3,4, Yury Zablotski¹.

Abstract

Fasciola hepatica and Ostertagia ostertagi are internal parasites of cattle compromising physiology, productivity, and well-being. Parasites are complex in their effect on hosts, sometimes making it difficult to identify clear directions of associations between infection and production parameters. Therefore, unsupervised approaches not assuming a structure reduce the risk of introducing bias to the analysis. They may provide insights which cannot be obtained with conventional, supervised methodology. An unsupervised, exploratory cluster analysis approach using the k-mode algorithm and partitioning around medoids detected two distinct clusters in a cross-sectional data set of milk yield, milk fat content, milk protein content as well as F. hepatica or O. ostertagi bulk tank milk antibody status from 606 dairy farms in three structurally different dairying regions in Germany. Parasite-positive farms grouped together with their respective production parameters to form separate clusters. A random forests algorithm characterised clusters with regard to external variables. Across all study regions, co-infections with F. hepatica or O. ostertagi, respectively, farming type, and pasture access appeared to be the most important factors discriminating clusters (i.e. farms). Furthermore, farm level lameness prevalence, herd size, BCS, stage of lactation, and somatic cell count were relevant criteria distinguishing clusters. This study is among the first to apply a cluster analysis approach in this context and potentially the first to implement a k-medoids algorithm and partitioning around medoids in the veterinary field. The results demonstrated that biologically relevant patterns of parasite status and milk parameters exist between farms positive for F. hepatica or O. ostertagi, respectively, and negative farms. Moreover, the machine learning approach confirmed results of previous work and shed further light on the complex setting of associations a between parasitic diseases, milk yield and milk constituents, and management practices.

Entities: Chemical

Mesh：

Substances：
Antibodies, Helminth

Year: 2022 PMID： 35816512 PMCID： PMC9273072 DOI： 10.1371/journal.pone.0271413

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.752

Introduction

Parasitic diseases often represent a complex system with various consequences for the productivity and physiological integrity of the host. Globally, parasitic diseases rank among the most significant infectious diseases in ruminant livestock species [1, 2]. Fasciola hepatica and Ostertagia ostertagi represent the most abundant helminth species in dairy cattle around the world [3-5]. Infections in adult dairy cows have been associated with decreased animal health, impaired well-being, and compromised economic viability [6, 7]. Cows experience a reduction in milk yield, a decline in body condition, and poor reproductive performance [7-9]. For bovine fasciolosis, Schweizer et al. [10] have estimated financial losses of 299 € per infected cow in Switzerland. Furthermore, changes in milk composition such as lower milk fat and milk protein content have been linked to parasitic infections [11, 12]. Due to the complex nature of parasitic infections, including a large set of relevant factors as well as manifold associations with physiological integrity, health, and productivity of livestock, to determine which variable is outcome and which exposure is often not clearly possible. Cluster analysis is an unsupervised, heuristic, exploratory approach that identifies underlying patterns within the data and sorts the most similar observations into clusters that share common characteristics [13-15]. The basic idea is to aggregate data points within a cluster that are as similar as possible, whereas patterns between clusters are as different as possible. Unsupervised methods reduce subjective influence and show if and what kind of patterns are contained within the data. Such techniques may deliver insights which are not possible to obtain with a traditional, supervised modelling approach. The objectives of the present study were (1) to explore if different clusters can be identified for farm-level bulk tank milk F. hepatica as well as O. ostertagi antibody status and milk parameters, and (2) to characterise potentially clustered farms and compare them in terms of external factors. We assumed that important associations exist among production parameters and antibody status that would naturally group farms in an unsupervised cluster analysis without a priori determination of target or predictor variables. These farms could subsequently be differentiated based on further criteria. We furthermore (3) intended to introduce a yet scarcely implemented modelling technique to the veterinary field which may represent a promising perspective for future investigations on complex biological systems. To our knowledge, this is the first study implementing an unsupervised machine learning technique in this context and the first time to use k–mode clustering and partitioning around medoids in veterinary epidemiology.

Materials and methods

Study farms

In an extensive, descriptive and cross-sectional study on dairy farms across Germany [16], data on housing conditions and animal health were collected. Dairy farms were located in three structurally and geographically different dairying regions in Germany. Within the three study regions North (federal states of Lower Saxony and Schleswig-Holstein), East (federal states of Thuringia, Saxony-Anhalt, Brandenburg, and Mecklenburg-Western Pomerania) and South (federal state of Bavaria), 765 farms (North: 253; East: 252; South: 260) with a total number of 86,304 dairy cows (North: 24,980 cows; East: 49,936 cows South: 11,388 cows) were visited. Farm selection process and sample calculation are elaborated on in [16, 17]. Briefly, different scenarios were calculated given a power of 80% and a level of significance of 5% in order to calculate an optimal and feasible sample size. Given these scenarios and considering feasibility, 250 farms were determined to be visited per study region. The selection of farms was assigned randomly and stratified on their administrative district and herd size within the federal state and study region. The national animal information data base (HIT) as well as farm data from the Milchprüfring Bayern e.V. provided information for sampling. Farms were randomly drawn from these data bases using an automated approach. A response rate of 30–40% was expected. Within each study region, a total amount of 1,250 farms, i.e. 5 times more farms than required for the study, were drawn from the underlying population in order to cover a response rate of at least 20%. Region–specific herd size cut-off values were determined in order to obtain a realistic distribution of herd sizes within the study population and due to structural differences in dairy farming in Germany [18]. Selected farms received an invitation letter to participate, containing information on the study procedure. The farm managers were asked to contact the regional study team on a voluntary basis. If they agreed to participate, they had to give their written consent for participation and data inspection. All farm-specific information was handled in accordance to the principles of the German and European data protection legislation. Study teams visited farms once between December 2016 and August 2019.

On-farm data collection

Data collected during the farm visits were recorded via data entry forms and later manually transferred to a central SQL-data base. The individual ear tag number was recorded for each cow. All lactating and dry animals present on the day of the farm visit were subjected to body condition and locomotion scoring. Body condition score (BCS) was assessed following the 5-point system with 0.25 increments presented by Edmonson et al. [19]. A five-point locomotion scoring approach was implemented [20] to record abnormalities in posture and gait in loosely housed cows. In tie stall operations, cows underwent stall lameness scoring to document weight shifting between the rear limbs, sparing of a limb while standing, unequal weight bearing when stepping from side to side, and standing on the edge of the curb [21]. During the study period, a bulk tank milk (BTM) sample was collected from the central bulk tank on each farm by the farm manager to be analyzed for F. hepatica and O. ostertagi antibodies. To improve comparability across farms, farm managers were asked to collect the BTM sample towards the end of the grazing season, i.e. between August and October. Nevertheless, BTM samples taken in November (nNorth = 3, nEast = 4, nSouth = 1) were also included in the analyses. Data on milk yield (in kg), milk fat content (in %), milk protein content (in %), somatic cell count (SCC, in cells/ml), parity, and days in milk (DIM) were retrieved from HIT and the national milk recording system (DHI). In this context, data on milk yield and the contents of milk fat, milk protein, and SCC were available for up to 12 months prior to the farm visit. Information on pasture access and farming type (conventional vs. organic) was retrieved in a personal interview with the farm manager during the time of the farm visit and recorded via questionnaire. The farms were assigned to either ‘tie stall operation’, ‘free stall operation’ or ‘other’ if 80% of cows were housed in one of the two husbandry systems or another type of housing at the time of the visit.

BTM F. hepatica and O. ostertagi antibody status

Exposure to F. hepatica and O. ostertagi on farm level was assessed by BTM antibodies determined with the IDEXX Fasciolosis Verification Test (IDEXX GmbH) and the Svanovir O. ostertagi–Ab ELISA (Boehringer Ingelheim) in a previous study [22]. As in the previous study and in accordance with the manufacturer’s recommendation, F. hepatica ELISA results of S/P > 30% were considered seropositive, and for O. ostertagi ELISA results the threshold was set to ≥ 0.5 OD indicating herds likely to suffer from a negative impact on herd milk yield.

Data editing and preparation

The central SQL-database automatically checked plausibility of the data based on previously determined thresholds. Furthermore, two of the co–authors individually assessed the distribution and values of each variable, also in regard to other variables. If implausible values occurred, they were assessed both within the data base (to detect potential inconsistencies during data export) as well as within the original paper–based entry forms (to identify incorrect transfer of data from written records to database). The statistical software R versions 4.0.3 and 4.2.0 [23] and R Studio [24] were used for all statistical analyses. A list of all R packages used in the current work is provided in S1 Table. For loosely housed cows, a cow was regarded as lame with a locomotion score ≥ 3 in accordance with previous work [25]. Tied cows were classified as lame if two out of the four behavioral patterns of the SLS were displayed during a 90 seconds period of observation [21, 26, 27]. As information on milk antibody status was retrieved on farm–level, animal–level data needed to be raised to farm level. This applied to the variables BCS, DIM, lameness, milk yield, milk fat content, milk protein content, somatic cell count, and parity. The farm level prevalence of lameness was calculated for each farm. It is important to note that information on milk yield, milk fat content, and milk protein content was available for each cow for a period of up to 12 months prior to the farm visit. Therefore, individual animals may have had a number of one up to twelve different values for each of these three parameters. To obtain a value that best reflected each of the three factors in the individual animal, a Bayesian non–parametric bootstrap approach with 1,000 resamples with replacements was implemented, which yielded Bayesian medians for all three variables. This allowed for a close reflection of the underlying values by the created estimates. Subsequently, a second round of bootstrapping with 1,000 resamples with replacements was conducted in order to transfer the animal–level information to farm-level median values which best reflected the on–farm situation in regard to the three factors. Regarding the variables BCS, DIM, and parity, a single value was present for each individual animal. Hence, the information could be straightforwardly raised to farm level using the aforementioned Bayesian non–parametric bootstrap. This yielded a median value for BCS, DIM, and parity on farm level that most plausibly reflected the BCS and DIM situation as well as the parity level on each of the evaluated farms. A binary variable (Fasciola/Ostertagia seropositive/seronegative) was created based on the aforementioned thresholds of the BTM ELISAs.

Cluster analysis

Farms with missing values were excluded from the analysis and a complete cases data set was generated. In addition to the binary variable ‘antibody status’ (Fasciola or Ostertagia seropositive/seronegative), each cluster analysis included the input variables milk yield, milk fat content, and milk protein content. Clustering was carried out separately for each of the study regions and each of the parasites. Prior to cluster analysis, a distance measurement reflecting proximity or similarity across individual observations was required. As the input variables of the cluster analysis were of mixed nature (continuous variables milk yield, milk fat content, and milk protein content as well as the categorical variable Fasciola or Ostertagia BTM status), Gower’s distance, a common distance metric for a mix of categorical and continuous values and the first distance measure proposed in 1971 [28, 29], was computed as the average of partial dissimilarities across data points using the daisy() function from the cluster package [30]. The optimal number of clusters in the present study was determined via the silhouette method. The silhouette analysis evaluates the separation between resulting clusters, gives an indication of proximity between data points within a cluster and those in neighbouring clusters, and provides the optimal number of clusters [31]. The k-medoids algorithm applying partitioning around medoids (PAM) [32] was then implemented to infer similarities/dissimilarities from the Gower’s distance matrix and to identify clusters of milk parameters and seropositivity/seronegativity for F. hepatica or O. ostertagi, respectively. Compared with the popular k-means algorithm which divides observations into a number of clusters, identifies k number of centroids (= means as the center of a cluster), and aggregates each observation to the cluster with the nearest mean [13, 14, 33], PAM replaces means with medoids, i.e. representatives of a cluster [34], since centroids as well as the Euclidean distance are not available for mixed data [35-37]. Data points are thus allocated based on certain similarities.

Characterisation of clusters by means of random forest

External variables that had not been incorporated within the cluster analyses were used to compare clusters and to understand how these clusters and the farms within each of the clusters can be characterised and distinguished. The external variables included housing system, farm level lameness prevalence, farming type, herd size, pasture access, farm level somatic cell count, parity, DIM, sample month (month of BTM sample being submitted), sample year (year of BTM sample being submitted), visit month (month of farm visit), visit season (season of farm visit; spring: March–May; summer: June–August; autumn: September–November; winter: December–February), and visit year (year of farm visit). The biological reasoning behind these factors was expressed by a network structure (S1 Fig) created with the free software DAGitty (http://www.dagitty.net). Variables were merged to the clustering data set. Farm level lameness prevalence was transformed into a categorical variable based on the variable’s distribution within regions North (low prevalence < 14.72, medium prevalence 14.72–34.74, high prevalence > 34.74), East (low prevalence < 30.91, medium prevalence 30.91–47.92, high prevalence > 47.92), and South (low prevalence < 15.10%, medium prevalence 15.10–33.33%, high prevalence >33.33%). The number of cows being housed on each farm were sorted into three categories to reflect herd size within study region: North (small < 51.50 cows; medium 51.50–115.50 cows; large > 115.50 cows), East (small < 129.00; medium 129.00–418.00; large > 418.00), South (small < 27.00 cows; medium 27.00–59.00 cows; large > 59.00 cows). The R package randomForest was used to implement Breiman’s random forest algorithm for classification [38]. The mean decrease accuracy, describing how much removing a single variable reduces the accuracy of the model, was used as an indicator of variable importance. Hence, the variable with the highest importance gives the best prediction and contributes the most to the model fit and predictions compared with lower ranking variables [39]. This allowed to compare clusters 1 and 2 of each single cluster analysis by all external variables. ROC curves were generated for each random forest procedure. ROC curves are provided in S2 Fig. The r package rfPermute [40] was applied to perform a permutation test providing estimated p-values for the importance metric of the random forest.

Results

Descriptive results

Descriptive statistics for all study regions are summarised in Table 1.

Table 1

Descriptive statistics of the data across the three study regions (North = 191 farms, East = 201 farms, South = 214 farms).

	North					East					South
Variable	Mean	Range	1^st Qu.	Median	3^rd Qu.	Mean	Range	1^st Qu.	Median	3^rd Qu.	Mean	Range	1^st Qu.	Median	3^rd Qu.
Herd size¹	93.71	10.00–486.00	51.50	79.00	115.50	334.80	1.00–2,821.00	129.00	245.00	418.00	46.46	5.00–231.00	27.00	40.50	59.00
BCS²	3.05	2.54–4.58	2.90	3.03	3.16	3.33	2.34–3.98	3.18	3.36	3.50	3.68	2.71–4.26	3.54	3.74	3.85
Milk yield²^, ³	26.08	20.00–30.82	24.83	26.08	27.48	25.65	14.74–31.78	24.61	26.10	27.14	25.19	19.40–31.20	24.21	25.45	26.50
Milk fat²^, ⁴	3.81	3.37–4.32	3.69	3.78	3.92	3.68	3.09–4.58	3.61	3.67	3.74	3.96	3.53–4.35	3.88	3.94	4.35
Milk protein²^, ⁴	3.20	2.93–3.49	3.12	3.18	3.29	3.11	2.54–3.49	3.07	3.13	3.16	3.35	3.12–3.61	3.30	3.36	3.40
SCC²^, ⁵^, ⁶	217.40	122.90–663.90	183.90	221.70	239.20	228.27	27.64–365.94	199.49	222.58	254.36	205.20	106.2–421.8	167.0	197.7	233.0
Lameness⁷	25.82	0.00–76.92	14.72	23.08	34.74	38.51	0.00–77.63	30.91	39.00	47.92	25.59	0.00–76.47	15.10	23.57	33.33
DIM²	213.40	131.60–360.70	194.60	209.40	231.40	205.20	101.30–333.70	187.80	204.50	219.20	197.20	118.50–614.70	175.50	191.00	197.20
Parity²	2.85	1.79–5.10	2.60	2.77	3.08	2.80	1.86–5.15	2.54	2.72	2.97	2.90	1.85–4.76	2.57	2.85	3.11

1 number of lactating and dry cows

2 bayesian median value per farm

3 in kg/day

4 in %

5 × 1,000

6 in number of cells/ml

7 Farm level prevalence in %

1 number of lactating and dry cows 2 bayesian median value per farm 3 in kg/day 4 in % 5 × 1,000 6 in number of cells/ml 7 Farm level prevalence in % Out of the 765 farms visited throughout the study period, 723 farms were enrolled to DHI and 646 farms submitted BTM samples. After merging the datasets for analysis and the removal of rows containing missing values, a total amount of 606 dairy farms were included in the current work.

Region North

Region North included a total number of 17,898 dairy cows on 191 farms. On 162 farms (84.82%), cows were housed in free stall facilities compared with tie stall barns and other housing systems (e.g. deep bedded packs) on 29 farms (15.18%). Organic farming principles were complied with on nine farms (4.7%), whereas 182 farms (95.29%) were conventionally managed. Pasture access was granted on 151 operations (79.06%) and absent on 40 farms (20.94%). Thirty farms (15.71%) were seropositive for F. hepatica and 91 farms for O. ostertagi (48.17%). Out of these farms, 26 farms were seropositive both for F. hepatica and O. ostertagi while four farms were seropositive only for F. hepatica and 65 farms only for O. ostertagi.

Region East

A total number of 201 farms with 24,980 cows was covered by the data set for region East. The vast majority of farms housed their cows in free stall facilities (n = 157, 78.11%). Cows had access to pasture on 107 operations (53.23%) and 20 farms (9.95%) were run on organic farming principles. Two BTM samples were seropositive for F. hepatica (1.00%) and 71 for O. ostertagi (35.32%).

Region South

The data set included 9,942 dairy cows on 214 farms. Out of the 214 farms, 55 housed their cows in tie stalls (25.70%) whereas 152 were free stall operations (71.03%) and seven were “other” (3.27%). A total amount of 181 farms (84.58%) were run conventionally compared with 33 organic farms (15.42%). Pasturing was carried out on 74 farms (34.58%). BTM samples from 51 farms were positive for F. hepatica (24.83%) and 79 farms for O. ostertagi antibodies (36.92%). Six farms were positive only for F. hepatica, and 34 farms only for O. ostertagi. Forty–five farms were positive for both parasites.

Cluster analyses

Clustering was performed separately for each of the region and each parasite. Cluster analyses for O. ostertagia were conducted in all study regions, while F. hepatica was evaluated in study regions North and South. In region East, only two farms were positive for F. hepatica, therefore cluster analyses were not conducted for this parasite in region East. It is important to understand that based on the aforementioned, two clusters (a cluster 1 and a cluster 2) were present in each of the analyses. Cluster analyses were entirely mutually independent, e.g. cluster 1 in the F. hepatica analysis in region North represents different outcomes compared with cluster 1 in region South. For all cluster analyses, the silhouette method selected two clusters to be the optimal number of clusters to group the data points, i.e. the farms, in alignment with the underlying data (S3 Fig).

F. hepatica cluster analyses

Fig 1 contains two cluster plots visualising the allocation of farms to the presented clusters for the F. hepatica analysis in study regions North and South.

Fig 1

Cluster plot of the partitioning around medoids clustering process for F. hepatica (regions North and South).

Cluster plot of the partitioning around medoids clustering process for F. hepatica (regions North and South).

Region North (top): Two distinct clusters are displayed with 161 farms in cluster 1 (red) and 30 observations in cluster 2 (blue). Region South (bottom): Two clusters with 51 observations in cluster 1 (red) and 163 observations in cluster 2 (blue) naturally aggregated. As for region North, 161 farms (84.29%) were assigned to cluster 1 and 30 farms (15.71%). All 30 farms in cluster 2 were positive for F. hepatica and grouped together with their respective production parameters. Descriptive cluster statistics for the F. hepatica cluster analyses in study regions North and South are displayed in Table 2 for continuous variables and in Table 3 for categorical variables, respectively.

Table 2

Descriptive cluster statistics of the F. hepatica cluster analysis in regions North and South (continuous variables).

	Region North					Region South
	Cluster 1					Cluster 1
Variable	Mean	Range	1^st Qu.	Median	3^rd Qu.	Mean	Range	1^st Qu.	Median	3^rd Qu.
BCS	3.06	2.54–4.58	2.93	3.05	3.16	3.55	2.71–4.07	3.34	3.64	3.78
SCC¹^, ²	218.90	122.90–663.90	184.30	211.80	242.70	210.40	112.50–393.40	175.20	205.50	224.50
Lame³	25.05	0.00–76.92	15.71	23.08	34.75	15.98	0.00–48.00	7.87	14.81	21.86
DIM	213.50	143.50–360.70	197.30	210.30	231.10	200.10	139.70–291.30	174.60	201.80	219.70
Parity	2.85	1.79–5.06	2.60	2.77	3.09	3.11	2.13–4.76	2.74	2.96	3.47
Milk yield⁴	26.24	20.77–30.31	24.95	26.36	27.60	24.59	21.12–29.23	23.00	24.48	25.60
Milk fat⁵	3.79	3.37–4.24	3.68	3.76	3.91	3.92	3.53–4.18	3.83	3.92	4.00
Milk protein⁵	3.19	2.93–3.49	3.11	3.17	3.27	3.33	3.18–3.61	3.27	3.32	3.39
	Cluster 2					Cluster 2
BCS	2.98	2.73–3.41	2.82	2.94	3.08	3.73	2.92–4.26	3.62	3.77	3.87
SCC¹^, ²	209.80	134.40–342.00	182.30	211.00	228.10	203.60	106.20–421.80	164.60	194.50	235.20
Lame³	24.56	0.00–73.13	13.40	22.57	33.77	28.60	0.00–76.47	19.59	26.32	36.24
DIM	212.70	131.60–360.60	183.80	202.70	233.70	196.30	118.50–614.70	176.70	189.90	210.80
Parity	2.85	1.90–4.15	2.61	2.77	3.07	2.84	1.85–4.50	2.54	2.81	3.07
Milk yield⁴	25.27	20.00–30.82	24.43	25.26	26.07	25.38	19.40–31.20	24.41	25.66	25.55
Milk fat⁵	3.92	3.62–4.32	3.81	3.90	4.04	3.97	3.69–4.35	3.88	3.96	4.05
Milk protein⁵	3.26	3.07–3.46	3.18	3.25	3.36	3.36	3.18–3.53	3.31	3.36	3.40

1 × 1,000

2 in number of cells/ml

3 Farm level prevalence in %

4 in kg

5 in %

Table 3

Descriptive statistics (observations per cluster) of the F. hepatica cluster analysis in study regions North and South (categorical variables).

Study Region Cluster (Counts [%])	Variable Categories Counts [%]
	Housing system			Pasture access		Farming type		Herd size¹			F. hepatica		O. ostertagi
	Tie Stall	Free Stall	Other	Yes	No	Conventional	Organic	Small	Medium	Large	Negative	Positive	Negative	Positive
Region North
Cluster 1 (161.00 [84.29])	7.00 [4.35]	137.00 [85.09]	17.00 [10.56]	121.00 [75.16]	40.00 [24.84]	156.00 [96.89]	5.00 [3.11]	41.00 [25.47]	79.00 [49.07]	41.00 [25.47]	161.00 [100]	0.00 [0.00]	96.00 [59.63]	65.00 [40.37]
Cluster 2 (30.00 [15.71])	1.00 [3.33]	25.00 [83.33]	4.00 [13.33]	30.00 [100.00]	0.00 [0.00]	26.00 [86.67]	4.00 [13.33]	7.00 [23.33]	16.00 [53.33]	7.00 [23.33]	0.00 [0.00]	30.00 [100.00]	4.00 [13.33]	26.00 [86.67]
Region South
Cluster 1 (51.00 [23.83])	22.00 [43.14]	28.00 [54.90]	1.00 [1.96]	47.00 [92.16]	4.00 [7.84]	30.00 [58.82]	21.00 [41.18]	23.00 [45.10]	26.00 [50.98]	2.00 [3.93]	0.00 [0.00]	51.00 [100.00]	6.00 [11.76]	45.00 [88.24]
Cluster 2 (163.00 [76.17])	33.00 [20.25]	124.00 [76.07]	6.00 [3.68]	136.00 [83.44]	27.00 [16.56]	151.00 [92.64]	12.00 [7.36]	29.00 [17.79]	84.00 [51.53]	50.00 [30.67]	163.00 [100.00]	0.00 [0.00]	129.00 [79.14]	34.00 [20.86]

1 number of cows present on farm

region North: small < 51.50 cows, medium 51.50–115.50 cows, large > 115.50 cows

region South: small < 27 cows, medium 27–59 cows, large > 59 cow

1 × 1,000 2 in number of cells/ml 3 Farm level prevalence in % 4 in kg 5 in % Cluster 1 (161.00 [84.29]) Cluster 2 (30.00 [15.71]) Cluster 1 (51.00 [23.83]) Cluster 2 (163.00 [76.17]) 1 number of cows present on farm region North: small < 51.50 cows, medium 51.50–115.50 cows, large > 115.50 cows region South: small < 27 cows, medium 27–59 cows, large > 59 cow In region South, cluster 1 included 51 farms (23.83%) compared with 163 farms (76.17%) in cluster 2. Similarly to region North, all 51 farms positive for F. hepatica grouped together with their respective production parameters in one cluster (cluster 1). For both regions, the presence of pasture access appears to discriminate between clusters. In region North cluster 1 incorporated fewer farms (75.16%) providing pasture access to their animals compared with farms in cluster 2 (100%). In study region South, pasture access appeared to be more common in farms within cluster 1 (92.16%) compared with cluster 2 (83.44%). Another predictor differing by clusters in both regions was farming type: conventional farming was more prevalent among Northern farms in cluster 1 (96.89%) than in cluster 2 (86.67%). Among farms in region South, more organic farms were allocated in cluster 1 (41.18%) compared with cluster 2 (7.36%). Moreover, positivity for O. ostertagi appeared to be different between clusters in both regions. In region North, the majority (86.67%) of O ostertagi positive farms were assigned to cluster 2 compared with 40.37% in cluster 1. Similarly, in region South, 88.24% of farms positive for O. ostertagi were within one cluster (cluster 1), whereas only 20.86% of positive farms were part of the other cluster (cluster 2).

O. ostertagi cluster analyses

The results from the clustering procedure for O. ostertagi (all study regions) are illustrated in Fig 2.

Fig 2

Cluster plot of the k-medoids clustering process for O. ostertagi in the three study regions.

Cluster plot of the k-medoids clustering process for O. ostertagi in the three study regions.

Region North (top): Two clusters with 91 observations in cluster 1 (red) and 100 observations in cluster 2 (blue). Region East (middle): Two clusters with 71 observations in cluster 1 (red) and 130 observations in cluster 2 (blue). Region South (bottom): Two clusters with 79 observations in cluster 1 (red) and 135 observations in cluster 2 (blue). In study region North, all 91 farms positive for O. ostertagi were grouped into cluster 1 whereas the remaining 100 farms were grouped into cluster 2. Similarly in the other two study regions, positive farms clustered together: in region East all 71 O. ostertagi–positive farms were included in cluster 1 and the remaining 130 farms in cluster 2. In region South, all 79 farms positive for O. ostertagi were assigned to cluster 1 and the 135 negative farms to cluster 2. Table 4 provides a summary of the cluster descriptive statistics across study regions for continuous variables, whereas results for categorical variables are provided in Table 5.

Table 4

Descriptive statistics of the O. ostertagi cluster analysis across study regions (continuous variables).

	Region North					Region East					Region South
	Cluster 1					Cluster 1					Cluster 1
Variable	Mean	Range	1^st Qu.	Median	3^rd Qu.	Mean	Range	1^st Qu.	Median	3^rd Qu.	Mean	Range	1^st Qu.	Median	3^rd Qu.
BCS	3.02	2.54–4.58	2.85	2.97	3.12	3.28	2.34–3.93	3.11	3.33	3.49	3.60	2.71–4.07	3.44	3.67	3.80
SCC¹^, ²	224.10	134.40–663.90	183.70	211.80	244.90	232.44	27.64–365.94	198.32	230.94	267.71	206.60	106.20–393.40	170.80	205.50	227.40
Lame³	22.14	0.00–76.92	10.19	19.12	28.00	36.52	0.00–70.95	25.40	38.12	49.07	19.92	0.00–57.90	9.55	18.75	26.77
DIM	211.40	131.60–360.60	187.80	205.00	230.20	210.40	101.30–333.70	193.90	208.40	226.40	201.40	118.50–291.30	176.30	203.00	220.90
Parity	2.90	1.79–5.10	2.59	2.78	3.18	2.82	1.97–5.09	2.52	2.71	3.01	3.00	1.92–4.76	2.59	2.92	3.21
Milk yield⁴	25.52	20.00–29.90	24.42	25.50	26.71	24.40	14.74–29.49	23.25	24.92	26.11	24.58	20.11–31.20	22.89	24.40	26.03
Milk fat⁵	3.84	3.37–4.32	3.72	3.79	3.96	3.71	3.40–4.41	3.63	3.70	3.77	3.92	3.53–4.18	3.84	3.92	4.00
Milk protein⁵	3.21	2.95–3.49	3.13	3.19	3.30	3.10	2.76–3.49	3.04	3.11	3.16	3.34	3.18–3.61	3.27	3.33	3.40
	Cluster 2					Cluster 2					Cluster 2
BCS	3.07	2.69–3.81	2.97	3.06	3.17	3.35	2.84–3.98	3.21	3.37	3.50	3.74	2.92–4.26	3.62	3.79	3.87
SCC¹^, ²	211.40	122.90–458.60	184.30	211.00	234.80	226.00	132.00–332.40	200.80	220.30	246.90	204.50	115.80–421.80	166.60	194.50	236.40
Lame³	29.17	3.18–61.36	17.67	27.48	39.10	39.60	0.00–77.63	32.64	39.40	47.59	28.90	0.00–76.47	19.59	27.12	36.47
DIM	215.20	166.60–360.70	199.10	212.20	232.60	202.30	127.00–320.10	185.00	203.80	217.20	194.70	134.50–614.70	175.20	188.90	203.90
Parity	2.80	1.98–4.18	2.60	2.74	2.99	2.79	1.86–5.15	2.56	2.74	2.95	2.84	1.85–4.50	2.56	2.82	3.07
Milk yield⁴	26.60	20.77–30.82	25.43	26.75	27.84	26.33	16.36–31.78	25.48	26.51	27.58	25.55	19.40–29.11	24.76	25.75	26.59
Milk fat⁵	3.78	3.44–4.13	3.68	3.76	3.91	3.66	3.09–4.58	3.60	3.66	3.72	3.98	3.69–4.35	3.89	3.97	4.05
Milk protein⁵	3.29	2.93–3.45	3.11	3.18	3,28	3.12	2.54–3.33	3.08	3.13	3.16	3.36	3.19–3.53	3.31	3.36	3.40

1 × 1,000

2 in number of cells/ml

3 Farm level prevalence in %

4 in kg

5 in %

Table 5

Descriptive statistics (observations per cluster) of the O. ostertagi cluster analysis across study regions (categorical variables).

Study Region Cluster (Counts [%])	Variable Categories Counts [%]
	Housing system			Pasture access		Farming type		Herd size¹			F. hepatica		O. ostertagi
	Tie Stall	Free Stall	Other	Yes	No	Conventional	Organic	Small	Medium	Large	Negative	Positive	Negative	Positive
Region North
Cluster 1 (91.00 [47.64])	7.00 [7.69]	70.00 [76.92]	14.00 [15.38]	85.00 [93.41]	6.00 [6.59]	82.00 [90.11]	9.00 [9.89]	27.00 [29.67]	54.00 [59.34]	10.00 [10.99]	65.00 [71.43]	26.00 [28.57]	0.00 [0.00]	91.00 [100.00]
Cluster 2 (100.00 [52.36])	1.00 [1.00]	92.00 [92.00]	7.00 [7.00]	66.00 [66.00]	34.00 [34.00]	100 [100.00]	0.00 [0.00]	21.00 [21.00]	41.00 [41.00]	38.00 [38.00]	96.00 [96.00]	4.00 [4.00]	100.00 [100.00]	0.00 [0.00]
Region East
Cluster 1 (71.00 [35.32])	2.00 [2.82]	55.00 [77.46]	14.00 [19.72]	55.00 [77.46]	16.00 [22.54]	55.00 [77.46]	16.00 [22.54]	24.00 [33.80]	38.00 [53.52]	9.00 [12.68]	-	-	0.00 [0.00]	71.00 [100.00]
Cluster 2 (130.00 [64.68])	0.00 [0.00]	102.00 [78.46]	28.00 [21.54]	78.00 [60.00]	52.00 [40.00]	126.00 [96.92]	4.00 [3.08]	26.00 [20.00]	63.00 [48.46]	41.00 [31.54]	-	-	130.00 [100.00]	0.00 [0.00]
Region South
Cluster 1 (79.00 [36.92])	27.00 [34.18]	50.00 [63.29]	2.00 [2.53]	57.00 [72.15]	22.00 [27.85]	52.00 [65.82]	27.00 [34.18]	27.00 [34.18]	39.00 [49.37]	13 [16.46]	34.00 [43.04]	45.00 [56.96]	0.00 [0.00]	79.00 [100.00]
Cluster 2 (135.00 [63.08])	28.00 [20.74]	102.00 [75.56]	5.00 [3.70]	17.00 [12.59]	118.00 [87.41]	129.00 [95.56]	6.00 [4.44]	25 [18.52]	71 [52.59]	39 [28.89]	129.00 [95.56]	6.00 [4.44]	135.00 [0.00]	0.00 [0.00]

1 number of cows present on farm, categorised

region North: small < 51.50 cows, medium 51.50–115.50 cows, large > 115.50 cows

region East: small < 129.00 cows, medium 129.00–418.00 cows, large > 418.00 cows

region South: small < 27 cows, medium 27–59 cows, large > 59 cow

1 × 1,000 2 in number of cells/ml 3 Farm level prevalence in % 4 in kg 5 in % Cluster 1 (91.00 [47.64]) Cluster 2 (100.00 [52.36]) Cluster 1 (71.00 [35.32]) Cluster 2 (130.00 [64.68]) Cluster 1 (79.00 [36.92]) Cluster 2 (135.00 [63.08]) 1 number of cows present on farm, categorised region North: small < 51.50 cows, medium 51.50–115.50 cows, large > 115.50 cows region East: small < 129.00 cows, medium 129.00–418.00 cows, large > 418.00 cows region South: small < 27 cows, medium 27–59 cows, large > 59 cow The presence of pasture access appeared to differ among clusters in all study regions: pasture access was more prevalent in one cluster than in the other cluster across study regions. Similarly, more organic farms were assigned to one cluster in all study regions compared with the other cluster. Positivity for. F. hepatica did not appear to as clearly distinguish between clusters as positivity for O. ostertagi did in the F. hepatica cluster analyses.

Classification of clusters by means of random forest

F. hepatica analyses

Fig 3 illustrates the outcome of the random forest classification of the F. hepatica cluster analyses for study regions North (Fig 3A) and South (Fig 3B).

Fig 3

Mean decrease accuracy plots and variable importance plots for the random forest classification process of clusters 1 and 2 in study regions North and South (F. hepatica analyses).

Mean decrease accuracy plots and variable importance plots for the random forest classification process of clusters 1 and 2 in study regions North and South (F. hepatica analyses).

A: study region North. B: study region South. The mean decrease accuracy plot expresses how much accuracy the model loses by excluding each variable. The more the accuracy suffers, the more important the variable is in distinguishing clusters. Vice versa, the higher the value of the mean decrease accuracy, the higher the importance of the variable in the model. The first three criteria appear to be the most valuable ones in characterizing clusters in region North, whereas five variables were ranked as the most relevant in region South. BTM positivity for O. ostertagi (North: p = 0.01; South: p = 0.01) and pasture access (North: p = 0.050; South: p = 0.01) were the two most important variables discriminating clusters both in study region North and South. Whereas farming type discriminated among clusters ranking third in region North (p = 0.042), the variable ranked fourth in region South (p = 0.01). The third most important variable in region South was farm level lameness prevalence (p = 0.01). Furthermore, herd size appeared to be relevant in discriminating among clusters in region South (p = 0.01).

O. ostertagi analyses

As for the O. ostertagi analyses, eight variables appeared to be the most important discriminating clusters in region North (Fig 4A): lameness prevalence (p = 0.01), BTM status for F. hepatica (p = 0.01), BCS (p = 0.01), herd size (p = 0.01), pasture access (p = 0.01), SCC (p = 0.01), DIM (p = 0.04), and farming type (p = 0.02). In region East, three variables, i. e. farming type (p = 0.01), pasture (p = 0.01), and DIM (p = 0.04) were the most important criteria differentiating both clusters (Fig 4B). BTM status for F. hepatica (p = 0.01), pasture access (p = 0.01), and farming type (p = 0.01) were the top ranking variables in region South (Fig 4C). Since only two farms were positive for F. hepatica in region East, the variable was excluded from the random forest approach in this study region.

Fig 4

Mean decrease accuracy plots and variable importance plots for the random forest classification process of clusters 1 and 2 in study regions North and South (O. ostertagi analyses).

A: study region North. B: study region South. C: study region South. Changes of prediction accuracy, i. e. decrease of prediction accuracy if excluding single variables. In region North the first eight factors are the most valuable ones characterising clusters compared with the remaining variables. In study regions East and South, the first three variables were the most important ones differentiating clusters.

Mean decrease accuracy plots and variable importance plots for the random forest classification process of clusters 1 and 2 in study regions North and South (O. ostertagi analyses).

Discussion

The aims of the present study were (1) to evaluate if distinct groups of farms can be identified in regard to milk parameters and BTM antibodies against two major helminth endoparasites in dairy cattle by means of an exploratory, unsupervised machine learning approach and (2) to compare potential clusters based on external factors. Furthermore, we intended to introduce k–medoids clustering and partitioning around medoids to a veterinary context. As we assumed that a certain number of groups may be present within the data, cluster analysis was chosen to reveal these groups in an unsupervised way. This assumption was based on previous work on the association of production parameters with the presence of F. hepatics and O. ostertagi [7–9, 42]. Cluster analysis is an exploratory technique that autonomously identifies potential naturally present similarities among observations within an underlying data set and subsequently assigns data points to different clusters [13-15]. The k-medoids algorithm was chosen for clustering as it allows for the evaluation of mixed data. Furthermore, it has the advantage of robustness and is not affected by potential outliers or extreme values [32, 34, 41]. To our knowledge, this is the first study implementing a cluster analysis approach to evaluate production parameters and anti-parasite BTM antibody status in dairy cows. Moreover, the current work is among the very first of its kind to implement a k-medoids cluster algorithm and PAM in epidemiology [42, 43]. Clustering revealed two distinct clusters for all analyses as assumed. We hence can infer that similarities exist between farms regarding the input variables (milk yield, milk fat content, milk protein content, antibody status) of the present analysis. The formation of distinct clusters of F. hepatica and O. ostertagi BTM results and parameters of milk production (i.e. milk yield) and milk quality (i.e. milk fat content, milk protein content) in the current study supports previous findings on associations of parasite exposure with production parameters [9, 44, 45]. Interestingly, all parasite–positive farms grouped together with their respective production parameters across all analyses. Hence, seropositivity for F. hepatica or O. ostertagi was a principal feature characterising clusters, as apparently seropositive farms strongly grouped together with the respective production parameters and formed a separate cluster compared with seronegative farms. Furthermore, this finding not only corroborates the results of the present study that relevant differences exist among seropositive and seronegative farms per se, but also indicates that marked differences exist between those farms in regard to production parameters. Associations between parasite exposure and milk composition have been previously described [7-9]. The results of distinct groups of production parameters and BTM antibody status for F. hepatica and O. ostertagi in the current study appear intuitive and plausible and confirm the evidence from previous work [46-48]. May et al. [49] estimated an average loss of 1.62 kg milk per cow and day in dairy herds strongly infected with F. hepatica based on BTM ELISA, similar to the current work. The authors furthermore showed negative effects of F. hepatica infection status on milk quality as represented by milk protein and milk fat. For O. ostertagi, the situation appears similar to the evidence available for F. hepatica. Blanco-Penedo et al. [50] have reported an adverse effect of BTM infection status on milk yield, which is underscored by further studies indicating a strong association between O. ostertagi antibody status and milk production and quality parameters [51-53]. Interestingly, in their recent study which in large parts was based on the data set collected in the underlying three years cross–sectional study throughout Germany, Springer et al. [22] identified a negative impact on milk production level only in high–performance dairy breeds while lower milk fat content was observed in dual–purpose herds positive for O. ostertagi. The authors traced it back to region–specific characteristics which may be relevant in this context. They also speculated that dual–purpose breeds may be more resilient towards the adverse effects of parasitic infection compared with high performance dairy breeds. In the current study, breed was not included in the data set as the main focus was different. Furthermore, Springer et al. [22] built generalised linear models and results may differ as a result of the chosen technique. The implementation of an unsupervised machine learning approach in the present study is a particular advantage in comparison with common supervised modelling techniques, since no prior specification is made about response and explanatory variables and each factor equally enter the analysis. It is yet important to be aware that cluster analysis is exploratory and does not explain the quality of similarity within as well as the quality of dissimilarity between the clusters. Therefore, further steps are required following a cluster analysis in order to compare formed clusters and to identify external (not part of the cluster analysis) variables, which characterise clusters. The random forest algorithm was implemented in the present study to identify factors distinguishing clusters for the single analyses. As two clusters were present in each of the cluster analyses, cluster was used as a binary outcome for the random forest approach. Random forest consistently provide among the highest accuracy of prediction compared with other classification techniques and complex random forest classifiers produce high discriminative performance [54, 55]. Throughout the cluster analyses, the random forest algorithm confirmed that organic farming and pasture access were relevant factors differentiating clusters of similar quality both for F. hepatica and O. ostertagi. In all random forests, these two factors were among the top ranking criteria of classification. Solely in the O. ostertagi cluster analysis of region North, pasture access (ranking fifth) and organic farming (ranking eight), despite being identified relevant based on the permutation test, were not ranking at the very top of variables discriminating clusters. This is probably due to the fact that compared with the remaining two study regions, pasture access is more prevalent on farms in region North and hence other criteria appeared to be more relevant to differ between farms. Organic farming is based on the idea of maintaining animal health and productivity through proactive rather than reactive management [56]. Practices often differ between organic and conventional operations [57, 58]. Dairy cow health and welfare represent one of the main components of an organic farming mindset and are achieved by improved housing conditions, less intensive production, and smaller herds. Furthermore, as organic farming associations often impose strict rules on the provenance and usage of feed components, the proportion of organic concentrate is limited and feeding regimes frequently are less intensive in an organic setting. Most importantly, pasturing cows is a central element of organic dairy farming [56-58]. Based on the epidemiology of F. hepatica and O. ostertagi, the results of the random forests regarding organic farming and pasture access are plausible findings. Feeding grass is a crucial element in the transmission dynamics of endoparasitic helminths in cattle. Infective stages present on pasture are transmitted to potential hosts via fresh grass, hay or silage [59-61]. Moreover, the aforementioned differences between farming types may also explain the difference of the clusters in regard to the other input variable (milk yield, milk fat, milk protein). The milk composition related input variables of the cluster analyses may well be influenced by the varying settings (e.g. pasture access, feeding regimes) on conventional and organic farms. BTM positivity for F. hepatica was an important factor differentiating clusters in regions North and South of the O. ostertagi analysis. Similarly, BTM positivity for O. ostertagi appeared to be top ranking when discriminating clusters in the F. hepatica analyses. This is supported by the descriptive results that co–infections with both parasites was very common throughout study regions. Given the biology and life cycle of the parasites which involve transmission through oral ingestion of infective stages via grass or pasture, it appears plausible that they might both benefit from similar factors present at farm level and co–infections may be common [62]. However, it is very interesting to understand that not only is co–infection very frequent, but also that the presence of the one parasite appeared to characterise the presence of the other parasite. It would be a promising approach for future investigation to evaluate, how both species are intertwined in their impact on dairy farms. Furthermore, identifying criteria that farms with co–infection share among each other as well as criteria that differ between co -infected, mono–infected, and parasite–free operations may yield practically relevant insights. Knowledge on infection intensity, impact, spatial distribution, and risk factors are pivotal to develop targeted intervention strategies. Interestingly, lameness prevalence appeared to be a relevant variable discriminating clusters in the F. hepatica analysis of region South as well as in the O. ostertagi analysis of Region North. In the latter, lameness prevalence was the most important criterion to classify clusters. Lameness is a very common, major problem in dairy production that impinges upon animal welfare, physiological integrity, and economic viability [17, 63, 64]. Lame animals are not only impaired in their natural behavioral patterns but also experience severe, often chronic pain [64, 65]. The association of lameness with production related parameters has been demonstrated by previous research [17, 25, 66, 67]. It is hence reasonable to assume that lameness also has an association with the milk composition parameters used as input variables in the current cluster analysis. In addition, lameness is a condition that is strongly associated with housing and management practices on the farm [17, 25, 68]. As a consequence, farm-level lameness prevalence may reflect a certain type of on farm setting regarding management elements and housing conditions which may translate into additional animal health and welfare related issues. This may further characterise farms, which is the reason we chose this variable as an external predictor in our analysis. As for the O. ostertagi cluster analysis in study region North, BCS appeared to be a relevant factor to discriminate clusters. Body condition scoring is an invaluable tool to assess animal health, production level, and management of dairy cows [69, 70]. A generally lower BCS on farm level may hence indicate impaired health of cows or management shortcomings. Parasitic infections have been demonstrated to be associated with a lower BCS [8, 22, 71] which renders it most plausible that a BCS was identified as an important criterion to differ between parasite–positive and negative farms. Moreover, BCS has been associated with production parameters, which further supports the outcome of this variable separating clusters [72, 73]. Herd size was ranked fourth by the random forest for the O. ostertagi cluster analysis in study region North and fifth in the F. hepatica cluster analysis in study region South. Herd size may reflect different farm characteristics. For example, large farms may follow more intense managing procedures and feeding regimes. On the other hand, small farms may rather be organically managed or provide pasture access. These characteristics mediated by herd size translate not only into production parameters, but also into parasite status as production levels may differ between small and large herds and the risk of parasite infection also has been associated with herd size [62]. In regions North and East, DIM was identified as an important variable differentiating clusters for the O. ostertagi analysis. According to Caffin et al. [74], the selective transport of antibodies to the udder decreases during peak lactation. Furthermore, in late lactation, antibodies are constantly transported to the mammary gland while milk production drops. This may translate into the level of specific antibodies against O. ostertagi detected by the ELISA used in the present study [75, 76]. On the other hand, stage of lactation has an effect on production parameters [77, 78], i. e. milk yield, milk fat, milk protein, all input variables of the cluster analyses, which could also explain the relevance of this variable in discriminating clusters. The random forest of the cluster analysis in study region North revealed the relevance of SCC as a factor differing between clusters. This is plausible given the fact that SCC, being a broadly used indicator for mastitis, has been shown to be correlated with an optical density ratio (ODR) for O. ostertagi in milk samples. Charlier et al. [79] stated that acute mastitis lead to an subsequent increase in the O. ostertagi ODR values. Apart from that, SCC is associated with other input variables of the cluster analysis such as milk yield [80]. Furthermore, SCC has been associated with lameness [17], the top ranking variable discriminating clusters in study region North. Together, both factors are very common in dairy production and may act as a representation of certain type of farm management and housing conditions. One striking advantage of the present study is the evaluation of a total number of 606 dairy farms with more than 50,000 cows in three structurally different dairy regions in Germany. It is paramount to be aware of regional variations which can be marked in the dairy sector in Germany [18], which necessitates that epidemiological studies are stratified by regions to obtain valid and reliable results. This was the main rationale for conducting analyses separately for the three study regions. As a wide variety of dairy cow management systems and local characteristics were included within the current study, the results not only are applicable to the entire country but we are also confident that the results from the present work can be extrapolated to other countries and the range of different systems in dairy production. Furthermore, the evaluation of data from all three study regions could provide the possibility to reveal relevant factors which may not have been uncovered by solely analysing one region or a smaller sample size. However, the cross-sectional nature of the present study has several limitations inherent to the study design itself [81] which need to be addressed. As potential target variables as well as predictors are simultaneously assessed, a certain degree of bias can enter the data collection at this stage. Even though we cannot entirely rule out the incorporation of bias during the farm visits, we are confident to have minimised the risk of assessment bias by the implementation of a strict study protocol as well as concise standard operating procedures as well as continuing assessment of observers. Secondly, the cross–sectional study design does not allow for the inference of causality among variables. For this purpose, specific study designs ought to be used. One important feature of the present study was the voluntary participation of farmers which may have created a certain level of selection bias by either encouraging proactive farms with improved housing and management conditions or by motivating those farms to participate, which are dealing with specific problems in their management. As outlined previously, farmer characteristics are crucial regarding animal health as well as the intrinsic compliance with external consultation [82-84]. The present study as well as the sampling procedure were based on randomisation in each of the study regions in alignment with region–specific herd size. Although we cannot entirely rule out selection bias in this context, we believe the introduction of bias was minimised by the way farm selection was randomised. Additionally, all outcomes are in accordance with the literature, which further corroborates our results. A bayesian non–parametric bootstrap approach was chosen to calculate bayesian medians. A non-parametric bootstrap can be implemented to estimate a parameter from a set of observations where the assumption about the distribution can be relaxed. Repeated measures for production parameters (milk yield, milk fat, milk protein, SCC) were present for each of the cows within the data set up to 12 months prior to this study. In order to be able to compare farms, one single value per farm was necessary for the clustering procedure. We furthermore had a large set of data of which we intended to use the entirety of available values for each animal to be subsequently transferred to farm–level. According to Rubin [85], who introduced the bayesian bootstrap, the most evident advantage of this method is the likelihood statement made about a parameter rather than just a frequency statement. Therefore, the most plausible value reflecting the individual animal in a specific farm setting could be used to transfer the animal-level information on farm–level during a second round of bootstrapping to obtain farm–level bayesian medians. Since we intended to compare farms, the principal aim was to conserve the highest level of information in regard to the individual animal as well as subsequently in regard to the single farm. Generating a simple median would have entailed a considerable loss of information. A bayesian bootstrap however represented an innovative, promising method to obtain the most plausible and most reliable values reflecting different farms. We were able to detect farm–level patterns of parameters of milk composition and the BTM F. hepatica and O. ostertagi antibody status in a large set of farms across three structurally different regions in Germany. Co–infections with F. hepatica or O. ostertagi, respectively, pasture access, organic farming, BCS, farm level lameness prevalence, DIM, SCC, and herd size appeared to discriminate clusters confirmed by the random forest classifier. One striking finding of our study is the fact that the cluster analysis detected evident differences between parasite–positive and negative farms without any supervision solely based on the incorporated data. This was further supported by the biological reasoning in regard to other external predictors which render the findings even more plausible. K–mode clustering and partitioning around medoids proved to be useful and innovative tools to handle complex data. This approach represents a promising tool for the evaluation of complex settings in a biological or veterinary context.

Conclusions

Based on the findings of the current study, it is reasonable to infer that considerable and biologically relevant differences exist between farms positive for F. hepatica or O. ostertagi, respectively and negative farms. An unsupervised cluster analysis was implemented using the partitioning around medoids algorithm. The partitioning around medoids algorithm of the present study confirmed previous evidence and shed further light on the complex systems of associations a between parasitic diseases, milk yield and milk constituents, and management practices. Future efforts may well use the presented approach on a broader scale to gain insights and identify relevant factors in complex disease settings.

List of R packages.

A complete list of implemented R packages in alphabetical order. (PDF) Click here for additional data file.

Network structure of interrelationships among variables included in the random forest.

Input variables of cluster analyses and potential associations with external factors. (TIF) Click here for additional data file.

ROC curves of the random forests.

(TIF) Click here for additional data file.

Silhouette plots of each cluster analysis.

Across all cluster analyses, silhouette plots suggested k = 2 clusters to be the most appropriate number of clusters given the underlying data. (TIF) Click here for additional data file. (CSV) Click here for additional data file. (CSV) Click here for additional data file. (CSV) Click here for additional data file. (PDF) Click here for additional data file. (PDF) Click here for additional data file. (PDF) Click here for additional data file. (PDF) Click here for additional data file. 22 Apr 2022

PONE-D-22-04434

A machine learning approach using k – mode clustering and random forest classification to model groups of production parameters and bulk tank milk antibody status of two major internal parasites in dairy cows

PLOS ONE Dear Dr. Oehm, Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. Please submit your revised manuscript by Jun 06 2022 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'. A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'. An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'. If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter. If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols. We look forward to receiving your revised manuscript. Kind regards, Rebecca Lee Smith, D.V.M., M.S., Ph.D. Academic Editor PLOS ONE Journal Requirements: When submitting your revision, we need you to address these additional requirements. 1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf 2. Thank you for stating the following financial disclosure: “Farm visits and data collection in the context of the underlying cross-sectional study were financially supported by the German Federal Ministry of Food and Agriculture (BMEL) through the Federal Office for Agriculture and Food (BLE) grant number 2814HS008.” Please state what role the funders took in the study. If the funders had no role, please state: "The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript." If this statement is not correct you must amend it as needed. Please include this amended Role of Funder statement in your cover letter; we will change the online submission form on your behalf. Additional Editor Comments (if provided): Please be certain to address all reviewer concerns, particularly as to terminology and full description of data. [Note: HTML markup is below. Please do not edit.] Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: No Reviewer #2: Yes ********** 2. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: No Reviewer #2: Yes ********** 3. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: No Reviewer #2: Yes ********** 4. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes Reviewer #2: Yes ********** 5. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: In the title the authors write "k-mode clustering" but in the method section it states "K-medoids" without mentioning "k-mode"(line 207). It should be noted that these terms and the corresponding algorithms are very different. title - What are these "groups of production parameters and bulk tank milk antibody status"? I don't see any groups of parameters reported. The groups should be of the farms according to the context. line 67, This sentence is confusing. what is the difference between the terms "cluster" and "group" in this sentence. For clustering analyses, the cluster is the group. line 191, "Missing values were excluded from" -> Farms with missing values were excluded Line 192, "Apart from" could mean "except for" or " in addition to". Is antibody status in or not in the model? I think " in addition to" is better here. line 213 "PAM replaces centroids with medoids"-> PAM replaces means with medoids line 218 "Classification of Clusters by means of Random Forest". Clustering analyses is for unsupervised classification, I don't see why clusters need to be classified again. Besides, there are only 2 cluster reported (figure 1), how could these be classified again? Reviewer #2: The manuscript describes a strong and novel work that used k-mode clustering to separate the internal features of the different diary farms and used random forest to analyze features that are important to the cluster separation. The authors used very rigorous statistical approaches in data acquisition, cleaning, aggregation, and analysis. The authors clearly demonstrated the methodology and results, and come to a well-supported conclusion that the antibody status of two major internal parasites in diary cows strongly affects the milk yield and nutrition content. The work thoroughly covered a vast number of farms over major diary production regions in Germany. The conclusion was also supported by relevant studies and field. In addition to the scientific insight that brings direct economical benefits, the manuscript demonstrated the feasibility of using un-supervised methods to perform non-biased analyze on the topic. Therefore, I strongly recommend publication of the work. ********** 6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: Yes: Weihao Ge [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step. 25 Apr 2022 Dear Editor Dear Reviewer 1 Dear Reviewer 2 We wish to express our gratitude to each of you for taking the time and effort to critically revise our manuscript. We appreciate the overall supportive feedback and feel that the raised points are worth being addressed. Please find below your remarks and our replies. While working on the improvement of this work, we have applied great care to comply with your comments. If there are any further points that need to be modified, we are happy to discuss. 1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf We have modified the naming for the supplementary files both within the main body of the manuscript as well as of the uploaded files. We furthermore have effected some minor modifications within the references section. 2. Thank you for stating the following financial disclosure: “Farm visits and data collection in the context of the underlying cross-sectional study were financially supported by the German Federal Ministry of Food and Agriculture (BMEL) through the Federal Office for Agriculture and Food (BLE) grant number 2814HS008.” Please state what role the funders took in the study. If the funders had no role, please state: "The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript." If this statement is not correct you must amend it as needed. Please include this amended Role of Funder statement in your cover letter; we will change the online submission form on your behalf. We have added the required information that the funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. We have also added this to the cover letter. Reviewer #1: In the title the authors write "k-mode clustering" but in the method section it states "K-medoids" without mentioning "k-mode"(line 207). It should be noted that these terms and the corresponding algorithms are very different. We have modified the title accordingly title - What are these "groups of production parameters and bulk tank milk antibody status"? I don't see any groups of parameters reported. The groups should be of the farms according to the context. Yes, the groups are indeed the farms. in fact, groups of farms were modelled. We have modified the title, to further clarify. line 67, This sentence is confusing. what is the difference between the terms "cluster" and "group" in this sentence. For clustering analyses, the cluster is the group. We can understand the confusion in this regard and have modified the sentence. line 191, "Missing values were excluded from" -> Farms with missing values were excluded Effected. Line 192, "Apart from" could mean "except for" or " in addition to". Is antibody status in or not in the model? I think " in addition to" is better here. Thank you. We have modified the expression in alignment with your remark. line 213 "PAM replaces centroids with medoids"-> PAM replaces means with medoids Effected. line 218 "Classification of Clusters by means of Random Forest". Clustering analyses is for unsupervised classification, I don't see why clusters need to be classified again. Besides, there are only 2 cluster reported (figure 1), how could these be classified again? The cluster analysis solely grouped the farms together ion the first place. As you said, this is already a sort of classification by sorting similar data point (i.e. farms together). However, after this initial clustering classification (i.e. assignment of farms to one of the two clusters within each study region), we only know that within each study region apparently there are two different groups of farms represented by the clusters. So far, so good. Our intention was to further understand what was behind these two groups of farms meaning how can the farms be characterised or rather how can both clusters be characterised or differentiated based on further variables. Therefore, after the initial sorting of farms into the two clusters, we took external/further variables which initially were not used for cluster creation and which were potentially related to the input variables of the cluster analyses (please refer to the network structure where we provided illustration on this) in order to find out how these external/further factors could help characterising the clusters. This was done using the random forest approach which allowed for the identification of factors that could characterise and differentiate the initially formed clusters and for finding the most important variables reflecting differences among clusters. To avoid confusion, we have replaced the term “classification” with the term “characterisation" in order to render it more clearly. Furthermore, we have added some more information to the first sentence of this paragraph. Reviewer #2: The manuscript describes a strong and novel work that used k-mode clustering to separate the internal features of the different diary farms and used random forest to analyze features that are important to the cluster separation. The authors used very rigorous statistical approaches in data acquisition, cleaning, aggregation, and analysis. The authors clearly demonstrated the methodology and results, and come to a well-supported conclusion that the antibody status of two major internal parasites in diary cows strongly affects the milk yield and nutrition content. The work thoroughly covered a vast number of farms over major diary production regions in Germany. The conclusion was also supported by relevant studies and field. In addition to the scientific insight that brings direct economical benefits, the manuscript demonstrated the feasibility of using un-supervised methods to perform non-biased analyze on the topic. Therefore, I strongly recommend publication of the work. We kindly appreciate this very positive and supportive feedback and are happy about the support. Submitted filename: Response to Reviewers.docx Click here for additional data file. 21 Jun 2022

PONE-D-22-04434R1

A machine learning approach using partitioning around medoids clustering and random forest classification to model groups of farms in regard to production parameters and bulk tank milk antibody status of two major internal parasites in dairy cows

Please submit your revised manuscript by Aug 05 2022 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. Please include the following items when submitting your revised manuscript:

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter. A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'. A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'. An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'. If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols. We look forward to receiving your revised manuscript. Kind regards, Rebecca Lee Smith, D.V.M., M.S., Ph.D. Academic Editor PLOS ONE Journal Requirements: Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice. Additional Editor Comments: Please consider the recommendations of the reviewer. [Note: HTML markup is below. Please do not edit.] Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation. Reviewer #2: All comments have been addressed Reviewer #3: (No Response) ********** 2. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #2: Yes Reviewer #3: Yes ********** 3. Has the statistical analysis been performed appropriately and rigorously? Reviewer #2: Yes Reviewer #3: Yes ********** 4. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #2: Yes Reviewer #3: Yes ********** 5. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #2: Yes Reviewer #3: Yes ********** 6. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #2: The manuscript is strong. The work applied an unsupervised approach that naturally separated the studied farms by positive/negative of common bacterial antibodies. With further feature analysis, the work identified the variables that are associated with infections as well as milk yield. The work is of important direct economical impact and is based on sound statistics. Therefore, I recommend the publication of the work. Reviewer #3: The results presented in this paper are interesting and of practical importance in identifying biologically relevant differences between farms positive for F. hepatica or O. ostertagi, respectively and negative farms using farm-level bulk tank milk. This important in the veterinary field. The revisions suggested are minor and involve some additional explanations, plots, and/or statistical procedures. Minor Revisions and comments: 1. p. 9, line 200: The authors have chosen the Gower’s distance matrix for clustering stating that it is the most common distance matrix for a mix of categorical and continuous values and cite 2 papers (28, 29). It is probably the first distance measure proposed in 1971, but there are currently many more choices. An unsupervised random forest can be used to obtain a proximity matrix that could also be used for clustering. A random forest easily handles mixed types of data. See Conrad and Bailey 2015 PLoS One paper on clustering Cystic Fibrosis patients, for an example. It will probability not make much difference in the clustering results, but the authors should be aware of other measures. 2. p. 15, line 303: For all cluster analyses, the silhouette method selected 2 clusters to be optimal … Authors should provide an average silhouette plot that shows that 2 clusters are optimal. This is because in the Cluster Analyses Section and the description of Fig 1 (F. hepatica cluster analyses) and Fig 2 (O. ostertagi cluster analyses) there were often a “majority cluster” and a “mixed cluster”. It would be very interesting to know that if k=3 clusters were chosen (assuming that the average silhouette plot showed that 2 or 3 clusters were reasonable choices), if the mixed cluster divided into 2 additional more homogeneous or identifiable clusters. 3. p. 27-28 Section: Classification of Clusters by means of Random Forest Results of a supervised random forest and variable importance plots are given in Fig 3 (F. hepatica for North (Fig 3 A) and South (Fig 3 B)) and Fig 4 (O. ostertagi for North (Fig 4 A) and East (Fig 4 B) and South (Fig 4 C)). Statements are made about what variable are “most important”. This is obtained from the ranking of the variables. There is an rfPermute package that will perform a permutation test and provide estimated permutation p-values for the importance metric of the random forest by permuting the response variable. This would allow the authors to make statements about important variables with a reasonable p-value cut-off. This will strengthen the statements about identifying variables that are most important in classifying or separating the clusters. 4. p. 2: In the Abstract it is stated “Across all study regions, co-infections with F. hepatica or O. ostertagi, respectively, farming type, and pasture access appeared to be the most important factors discriminating clusters (i.e.39 farms). Furthermore, herd size, BCS, and stage of lactation were relevant criteria distinguishing clusters.” Permutation p-values will allow the statement to be made based using a p-value cut-off. 5. Discussion: p. 29, line 477: Authors state “Moreover, the current work is probably the first of its kind to implement a k-medoids cluster algorithm and PAM.” Please see and cite: DJ Conrad, BA Bailey, 2015 Multidimensional clinical phenotyping of an adult cystic fibrosis patient population PLoS One 10 (3), e0122705 DJ Conrad, J Billings, C Teneback, J Koff, D Rosenbluth, BA Bailey, ..., 2021 Multi-dimensional clinical phenotyping of a national cohort of adult cystic fibrosis patients Journal of Cystic Fibrosis 20 (1), 91-96 ********** 7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #2: Yes: Weihao Ge Reviewer #3: No ********** [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

24 Jun 2022 Reviewer #2: The manuscript is strong. The work applied an unsupervised approach that naturally separated the studied farms by positive/negative of common bacterial antibodies. With further feature analysis, the work identified the variables that are associated with infections as well as milk yield. The work is of important direct economical impact and is based on sound statistics. Therefore, I recommend the publication of the work. Dear Reviewer 2, we again kindly appreciate your support and positive appraisal of this article. Reviewer #3: The results presented in this paper are interesting and of practical importance in identifying biologically relevant differences between farms positive for F. hepatica or O. ostertagi, respectively and negative farms using farm-level bulk tank milk. This important in the veterinary field. Dear Reviewer 3, thank you very much for the positive and constructive feedback to our work. We have the impression that even though they are minor, the points you have raised are important to be addressed and will considerably improve and round up this article. The revisions suggested are minor and involve some additional explanations, plots, and/or statistical procedures. Minor Revisions and comments: 1. p. 9, line 200: The authors have chosen the Gower’s distance matrix for clustering stating that it is the most common distance matrix for a mix of categorical and continuous values and cite 2 papers (28, 29). It is probably the first distance measure proposed in 1971, but there are currently many more choices. An unsupervised random forest can be used to obtain a proximity matrix that could also be used for clustering. A random forest easily handles mixed types of data. See Conrad and Bailey 2015 PLoS One paper on clustering Cystic Fibrosis patients, for an example. It will probability not make much difference in the clustering results, but the authors should be aware of other measures. Thank you. We have changed the wording of this sentence 2. p. 15, line 303: For all cluster analyses, the silhouette method selected 2 clusters to be optimal … Authors should provide an average silhouette plot that shows that 2 clusters are optimal. This is because in the Cluster Analyses Section and the description of Fig 1 (F. hepatica cluster analyses) and Fig 2 (O. ostertagi cluster analyses) there were often a “majority cluster” and a “mixed cluster”. It would be very interesting to know that if k=3 clusters were chosen (assuming that the average silhouette plot showed that 2 or 3 clusters were reasonable choices), if the mixed cluster divided into 2 additional more homogeneous or identifiable clusters. We have provided all silhouette plots for every single one of the cluster analyses. We agree with you that it would have been very interesting -given the constant presence of a «majority cluster» and a «mixed cluster» - to know that if k=3 clusters were chosen (assuming that the average silhouette plot showed that 2 or 3 clusters were reasonable choices), if the mixed cluster divided into 2 additional more homogeneous or identifiable clusters. The silhouette plots yet very unequivocally demonstrated that for the underlying data set a 2-cluster solution appeared to be statistically most appropriate, as the objects match best with their own cluster and poorly with other clusters indicated by a silhouette score of > 0.6. In opposite to that, the silhouette score for a k=3 cluster solution already visibly appeared to be considerably lower. Considering then the subsequent descriptive results from the clustering as well as the random forest results and including biological reasoning in the consideration, the selection of two clusters seems even more plausible and strengthens the work in itself. By providing this piece of information and including all silhouette plots as additional files to this manuscript, we believe that the plausibility, transparency, and traceability of our analyses is ensured. 3. p. 27-28 Section: Classification of Clusters by means of Random Forest Results of a supervised random forest and variable importance plots are given in Fig 3 (F. hepatica for North (Fig 3 A) and South (Fig 3 B)) and Fig 4 (O. ostertagi for North (Fig 4 A) and East (Fig 4 B) and South (Fig 4 C)). Statements are made about what variable are “most important”. This is obtained from the ranking of the variables. There is an rfPermute package that will perform a permutation test and provide estimated permutation p-values for the importance metric of the random forest by permuting the response variable. This would allow the authors to make statements about important variables with a reasonable p-value cut-off. This will strengthen the statements about identifying variables that are most important in classifying or separating the clusters. We appreciate this very useful suggestion. We have now included a permutation test as suggested using the rfPermute package (also updated in the supplementary file of R packages used). Based on the results, we have made modifications in the results and discussion section (and a short remark in the Materials & Methods Section) in order to present all relevant variables discriminating clusters based on these p values. We are convinced that this course of action has had a very positive impact on the quality of this work and has increased the strength of the conclusions drawn from our analyses. Therefore, we kindly appreciate your comment inciting us to look into this direction. 4. p. 2: In the Abstract it is stated “Across all study regions, co-infections with F. hepatica or O. ostertagi, respectively, farming type, and pasture access appeared to be the most important factors discriminating clusters (i.e.39 farms). Furthermore, herd size, BCS, and stage of lactation were relevant criteria distinguishing clusters.” Permutation p-values will allow the statement to be made based using a p-value cut-off. We have now included the estimation of permutation p values within the manuscript to support this statement based on p values. In order to comply with the journal`s guidelines specifically concerning abstract length, which should not exceed 300 words, we have refrained from including these p values into the abstract. 5. Discussion: p. 29, line 477: Authors state “Moreover, the current work is probably the first of its kind to implement a k-medoids cluster algorithm and PAM.” Please see and cite: DJ Conrad, BA Bailey, 2015 Multidimensional clinical phenotyping of an adult cystic fibrosis patient population PLoS One 10 (3), e0122705 DJ Conrad, J Billings, C Teneback, J Koff, D Rosenbluth, BA Bailey, ..., 2021 Multi-dimensional clinical phenotyping of a national cohort of adult cystic fibrosis patients Journal of Cystic Fibrosis 20 (1), 91-96 We have rephrased the sentence and included both references Submitted filename: Response to Reviewers.docx Click here for additional data file. 30 Jun 2022 A machine learning approach using partitioning around medoids clustering and random forest classification to model groups of farms in regard to production parameters and bulk tank milk antibody status of two major internal parasites in dairy cows PONE-D-22-04434R2 Dear Dr. Oehm, We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements. Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication. An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org. If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org. Kind regards, Rebecca Lee Smith, D.V.M., M.S., Ph.D. Academic Editor PLOS ONE Additional Editor Comments (optional): Reviewers' comments: 1 Jul 2022 PONE-D-22-04434R2 A machine learning approach using partitioning around medoids clustering and random forest classification to model groups of farms in regard to production parameters and bulk tank milk antibody status of two major internal parasites in dairy cows Dear Dr. Oehm: I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department. If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org. If we can help with anything else, please email us at plosone@plos.org. Thank you for submitting your work to PLOS ONE and supporting open access. Kind regards, PLOS ONE Editorial Office Staff on behalf of Dr. Rebecca Lee Smith Academic Editor PLOS ONE

60 in total

1. Association between Dictyocaulus viviparus status and milk production parameters in Dutch dairy herds.

Authors: M Dank; M Holzhauer; A Veldhuis; K Frankena
Journal: J Dairy Sci Date: 2015-08-20 Impact factor: 4.034

2. Moderate lameness leads to marked behavioral changes in dairy cows.

Authors: H C Weigele; L Gygax; A Steiner; B Wechsler; J-B Burla
Journal: J Dairy Sci Date: 2017-12-28 Impact factor: 4.034

3. Apparent prevalence of and risk factors for infection with Ostertagia ostertagi, Fasciola hepatica and Dictyocaulus viviparus in Swiss dairy herds.

Authors: C F Frey; R Eicher; K Raue; C Strube; M Bodmer; B Hentrich; B Gottstein; N Marreros
Journal: Vet Parasitol Date: 2017-12-05 Impact factor: 2.738

4. Diagnosis before treatment: Identifying dairy farmers' determinants for the adoption of sustainable practices in gastrointestinal nematode control.

Authors: F Vande Velde; E Claerebout; V Cauberghe; L Hudders; H Van Loo; J Vercruysse; J Charlier
Journal: Vet Parasitol Date: 2015-07-20 Impact factor: 2.738

5. Lameness prevalence and risk factors in organic and non-organic dairy herds in the United Kingdom.

Authors: Kenneth M D Rutherford; Fritha M Langford; Mhairi C Jack; Lorna Sherwood; Alistair B Lawrence; Marie J Haskell
Journal: Vet J Date: 2008-05-06 Impact factor: 2.688

6. Prevalence of Fasciola hepatica in the intermediate host Lymnaea truncatula detected by real time TaqMan PCR in populations from 70 Swiss farms with cattle husbandry.

Authors: G Schweizer; M L Meli; P R Torgerson; H Lutz; P Deplazes; U Braun
Journal: Vet Parasitol Date: 2007-11-01 Impact factor: 2.738