| Literature DB >> 32582448 |
Fritz A Francisco1,2,3, Paul Nührenberg1,2,3, Alex Jordan1,2,3.
Abstract
BACKGROUND: Acquiring high resolution quantitative behavioural data underwater often involves installation of costly infrastructure, or capture and manipulation of animals. Aquatic movement ecology can therefore be limited in taxonomic range and ecological coverage.Entities:
Keywords: 3D tracking; Aquatic ecosystems; Collective behaviour; Computer vision; Machine learning; Structure from motion
Year: 2020 PMID: 32582448 PMCID: PMC7310323 DOI: 10.1186/s40462-020-00214-w
Source DB: PubMed Journal: Mov Ecol ISSN: 2051-3933 Impact factor: 3.600
Summary of acquired datasets
| Dataset | Location | Approach | Species | Duration (m:ss) | N | Setup | Dist. (m) | Tags | Pose | Complexity |
|---|---|---|---|---|---|---|---|---|---|---|
| STARESO | snorkel | 0:20 | 1 | stereo | 0.4 | no | yes | high | ||
| STARESO | snorkel | 7:09 | 1, | stereo | 0.6 | no | yes | low | ||
| 2 | ||||||||||
| Tanganyika | dive | 0:35 | 11 | multi (12) | 0.2 | yes | no | medium | ||
| STARESO | dive | NA | 3:14 - 5:00 | NA | multi (4) | 0.6 | NA | NA | varying |
N lists the number of tracked individuals, Dist. the minimum camera-to-camera distance in the setups, Tags whether individual animals were tagged and Pose if animal spine pose estimation was used during tracking. Complexity lists an estimate of overall complexity: high (single individual with complex posture, variable lighting and contrast, motile background elements), medium (multiple individuals, high turbidity and greater depth, visible tags), low (few individuals, good lighting, homogeneous background), varying (intentionally varied complexity). NA: not applicable
Dataset parameters and accuracy metrics
| Dataset | Annotations | Rate (Hz) | Resolution (px) | Coverage (%) | Accuracy metrics | |||
|---|---|---|---|---|---|---|---|---|
| Metric | Reconstruction (cm) | Reprojection (px) | Tracking (cm) | |||||
| 171 | 30 | 2.7k | 97.79 | median | 0.30 | 9.65 | NA | |
| RMSE | 1.28 | 16.30 | NA | |||||
| as above | 100.00 | as above | ||||||
| 80 | 30 | 4k | 69.60 | median | 0.44 | 3.77 | NA | |
| RMSE | 1.09 | 7.77 | NA | |||||
| 160 | 60 | 2.7k | 78.38 | median | 0.06 | 2.57 | NA | |
| RMSE | 0.30 | 3.78 | NA | |||||
| as above | 94.02 | as above | ||||||
| 73 | 30 | 4k | 80.64 ±16.73 | median | -0.14 ±0.06 | 3.53 ±1.96 | 0.14 ±0.33 | |
| RMSE | 1.34 ±0.79 | 8.56 ±5.21 | 1.09 ±0.47 | |||||
| as above | 97.29 ±2.20 | median | as above | 0.28 ±0.32 | ||||
| RMSE | 2.12 ±1.37 | |||||||
’w/ sv’ indicates that trajectory points were also estimated from single-view projections at an interpolated depth component. Annotations lists how many frames were annotated for training Mask R-CNN, Rate the frames per second of each video set, i.e. the temporal tracking resolution. Resolution is video resolution, 2.7k: 2704 ×1520 px, 4k: 3840 ×2160 px. Coverage is the mean coverage off all individual trajectories of a dataset. Reconstruction metrics refer to the deviation of reconstructed camera-to-camera distances from the actual distance, Reprojection metrics to the reprojection of triangulated 3D tracks to the original video pixel coordinates and Tracking to the deviation of the tracked calibration wand length from its actual length. In case of the ’accuracy’ dataset, the accuracy results are listed as the mean and standard deviation of the four repeated trials. NA: not applicable
Fig. 1Schematic workflow. Data processing starts with the acquisition of synchronized, multi-view videos, which serve as input to the SfM reconstruction pipeline to recover camera positions and movement. In addition, Mask R-CNN predictions, after training the detection model on a subset of images, result in segmented masks for each video frame, from which animal poses can be estimated. These serve as locations of multi-view animals trajectories in the pixel coordinate system. Subsequently, trajectories can be triangulated using known camera parameters and positions from the SfM pipeline, yielding 3D animal trajectories and poses. Integrating the environmental information from the scene reconstruction, these data can be used for in depth downstream analyses
Fig. 2Accuracy validation. Top down view of one of the ’accuracy’ dataset trials with the COLMAP dense reconstruction in the background (left). A calibration wand with a length of 0.5 m was moved through the environment to create two trajectories with known per-frame distances (visualized as lines at a frequency of 3 Hz, the full temporal resolution of the trajectories is 30 Hz). This allowed the calculation of relative tracking errors as the difference of the triangulated calibration wand end-to-end distance from the its known length of 0.5 m, resulting in the shown error distribution (normalized histogram with probability density function, right). The per-frame tracking error is visualized as line color
Fig. 33D environments and animal trajectories. a Top down view of the ’single’ dataset result. Red lines and dots show estimated spine poses and head positions of the tracked European eel (C. conger, visualized with one pose per second). The point cloud resulting from the COLMAP reconstruction is shown in the background. b Trajectories of M. surmuletus (orange) and D. vulgaris (purple/blue), and the dense point cloud resulting from the ’mixed’ dataset. Dots highlight three positions per second, lines visualize the trajectories at full temporal resolution (30 Hz) over a duration of seven minutes. b Reconstruction results and trajectories of the ’school’ dataset, visualizing the trajectories of a small school of L. callipterus in Lake Tanganyika. See Additional files 1, 2, 3 for high resolution images