| Literature DB >> 32624081 |
D Schokker1, I N Athanasiadis2, B Visser3, R F Veerkamp1, C Kamphuis1.
Abstract
With the increasing availability of large amounts of data in the livestock domain, we face the challenge to store, combine and analyse these data efficiently. With this study, we explored the use of a data lake for storing and analysing data to improve scalability and interoperability. Data originated from a 2-day animal experiment in which the gait score of approximately 200 turkeys was determined through visual inspection by an expert. Additionally, inertial measurement units (IMUs), a 3D-video camera and a force plate (FP) were installed to explore the effectiveness of these sensors in automating the visual gait scoring. We deployed a data lake using the IMU and FP data of a single day of that animal experiment. This encompasses data from 84 turkeys for which we preprocessed by performing an 'extract, transform and load' (ETL-) procedure. To test scalability of the ETL-procedure, we simulated increasing volumes of the available data from this animal experiment and computed the 'wall time' (elapsed real time) for converting FP data into comma-separated files and storing these files. With a simulated data set of 30 000 turkeys, the wall time reduced from 1 h to less than 15 min, when 12 cores were used compared to 1 core. This demonstrated the ETL-procedure to be scalable. Subsequently, a machine learning (ML) pipeline was developed to test the potential of a data lake to automatically distinguish between two classses, that is, very bad gait scores v. other scores. In conclusion, we have set up a dedicated customized data lake, loaded data and developed a prediction model via the creation of an ML pipeline. A data lake appears to be a useful tool to face the challenge of storing, combining and analysing increasing volumes of data of varying nature in an effective manner.Entities:
Keywords: data lake; extract, transform and load; machine learning; scalability; sensors
Mesh:
Year: 2020 PMID: 32624081 PMCID: PMC7538337 DOI: 10.1017/S175173112000155X
Source DB: PubMed Journal: Animal ISSN: 1751-7311 Impact factor: 3.240
Summary statistics for the sensory data and the expert assigned gait score of turkeys
| Sensor | Attribute | Unit | Type | Minimum | Maximum | Mean | SD |
|---|---|---|---|---|---|---|---|
| Force plate | Walk duration | Seconds | Numeric | 0.69 | 14.7 | 2.5 | 1.9 |
| Force plate | Max vertical axis force | Newton | Numeric | 13.75 | 57.27 | 19.5 | 5.6 |
| Inertial measurement units | Roll axis sign changes | Dimensionless (integer) | Numeric | 3 | 2568 | 406.2 | 445.3 |
|
| |||||||
| Human expert | Gait score | Dimensionless (ordinal) | Categorical (two classes) | Very bad (30), otherwise (54) | |||
Figure 1Flow diagram of the data lake. Turkey data are first ingested into the data lake, followed by the ‘extract, transform and load’ procedure, and lastly the data can be processed and analysed. IMU, inertial measurement unit.
File sizes in original binary format and after the ETL-procedure testing the turkey data preprocessing and data analyses
| Number of turkeys | Original data size (binary format) | Data size after the ETL-procedure (open data format) |
|---|---|---|
| 30 | 8.4 MB | 18.6 MB |
| 300 | 84 MB | 186.1 MB |
| 3000 | 837 MB | 1.8 GB |
| 30 000 | 8.4 GB | 18.2 GB |
ETL = extraxt, transform and load.
Figure 2Wall time (s, min and h) for converting binary force plate data of turkeys into comma-separated file format and storing them on Hadoop Distributed File System (HDFS). The x-axis depicts the number of cores for each configuration, whereas the y-axis is the wall time (note the logarthimic scale).
Area under the receiver operating curve (AUROC) and the area under the precisision-recall curve (AUPRC) for the random forest classification models using 5-fold cross-validation of turkey data
| Number of trees | AUROC | AUPRC |
|---|---|---|
| 3 | 0.754 (0.04) | 0.688 (0.05) |
| 10 | 0.752 (0.02) | 0.682 (0.04) |
| 25 | 0.8 (0.02) | 0.718 (0.04) |
| 40 | 0.783 (0.01) | 0.704 (0.03) |
Reported average values over five different seeds (SD in brackets).