| Literature DB >> 35448223 |
Martino Trapanotto1, Loris Nanni1, Sheryl Brahnam2, Xiang Guo2.
Abstract
The classification of vocal individuality for passive acoustic monitoring (PAM) and census of animals is becoming an increasingly popular area of research. Nearly all studies in this field of inquiry have relied on classic audio representations and classifiers, such as Support Vector Machines (SVMs) trained on spectrograms or Mel-Frequency Cepstral Coefficients (MFCCs). In contrast, most current bioacoustic species classification exploits the power of deep learners and more cutting-edge audio representations. A significant reason for avoiding deep learning in vocal identity classification is the tiny sample size in the collections of labeled individual vocalizations. As is well known, deep learners require large datasets to avoid overfitting. One way to handle small datasets with deep learning methods is to use transfer learning. In this work, we evaluate the performance of three pretrained CNNs (VGG16, ResNet50, and AlexNet) on a small, publicly available lion roar dataset containing approximately 150 samples taken from five male lions. Each of these networks is retrained on eight representations of the samples: MFCCs, spectrogram, and Mel spectrogram, along with several new ones, such as VGGish and stockwell, and those based on the recently proposed LM spectrogram. The performance of these networks, both individually and in ensembles, is analyzed and corroborated using the Equal Error Rate and shown to surpass previous classification attempts on this dataset; the best single network achieved over 95% accuracy and the best ensembles over 98% accuracy. The contributions this study makes to the field of individual vocal classification include demonstrating that it is valuable and possible, with caution, to use transfer learning with single pretrained CNNs on the small datasets available for this problem domain. We also make a contribution to bioacoustics generally by offering a comparison of the performance of many state-of-the-art audio representations, including for the first time the LM spectrogram and stockwell representations. All source code for this study is available on GitHub.Entities:
Keywords: African lions; convolutional neural networks; transfer learning; vocal individuality
Year: 2022 PMID: 35448223 PMCID: PMC9029749 DOI: 10.3390/jimaging8040096
Source DB: PubMed Journal: J Imaging ISSN: 2313-433X
Figure 1Overview of the proposed system. Each audio representation is trained on each of the three pretrained CNNs, some of which are ensembled (in groups of two or three) by Sum Rule.
Figure 2Illustration of the eight audio representations used in this work, each generated from our dataset’s first sample.
Figure 3Stockwell images of two different lions (the three images of each column are related to a given lion).
Performance Accuracy of the Ten Best Single Networks on the ERR Day and ERR Bout Datasets.
| Network and Feature | Day | Bout |
|---|---|---|
| VGG16 LM S | 95.67% | 98.9% |
| VGG16 dB LM S | 94.50% | 100% |
| AlexNet custom LM S | 93.66% | 96.8% |
| ResNet50 S | 91.97% | 97.6% |
| VGG16 S | 91.10% | 97.2% |
| ResNet50 min-max S | 90.59% | 95.9% |
| AlexNet box_n Mel S | 90.17% | 96.3% |
| VGG16 MFCC | 90.10% | 94.8% |
| VGG16 L2M S | 89.78% | 94.6% |
| VGG16 min-max S | 89.69% | 97.6% |
Performance Accuracy of the 2-Ensemble on the ERR Day and ERR Bout Datasets (sorted by performance).
| Network 1 | Network 2 | Day | Bout |
|---|---|---|---|
| VGG16 MFCC | VGG16 LM S | 97.64% | 99.3% |
| ResNet50 S | AlexNet LM S | 97.61% | 98.7% |
| ResNet50 S | VGG16 LM S | 97.61% | 99.4% |
| ResNet50 S | VGG16 dB LM S | 97.45% | 99.4% |
| ResNet50 min–max S | VGG16 LM S | 97.31% | 100% |
| AlexNet min–max scaled S | VGG16 LM S | 97.06% | 100% |
| ResNet50 min–max scaled S | VGG16 LM S | 96.90% | 99.1% |
| ResNet50 Mel S | VGG16 LM S | 96.78% | 99.5% |
| AlexNet dB LM S | VGG16 LM S | 96.78% | 99.5% |
| VGG16 min-max scaled S | VGG16 LM S | 96.68% | 97.6% |
Performance Accuracy of 3-Ensemble on the ERR Day and ERR Bout Datasets (sorted by performance).
| Network 1 | Network 2 | Network 3 | Day | Bout |
|---|---|---|---|---|
| VGG16 min–max S | AlexNet LM S | VGG16 LM S | 98.67% | 100% |
| ResNet50 S | ResNet50 Mel S | VGG16 dB LM S | 98.42% | 100% |
| ResNet50 min–max S | AlexNet LM S | VGG16 dB LM S | 98.42% | 100% |
| ResNet50 S | VGG16 MFCC | VGG16 LM S | 98.42% | 100% |
| ResNet50 S | VGG16 dB LM S | VGG16 LM S | 98.19% | 100% |
| AlexNet min–max Mel S | VGG16 dB LM S | VGG16 LM S | 98.17% | 100% |
| VGG16 dB LM S | AlexNet box_n Mel S | VGG16 LM S | 98.17% | 100% |
| ResNet50 S | AlexNet VGGish | VGG16 dB LM S | 98.14% | 100% |
| ResNet50 S | AlexNet LM S | VGG16 LM S | 98.14% | 100% |
| ResNet50 min–max S | AlexNet dB LM S | VGG16 LM S | 98.14% | 100% |
Equal Error Rate (EER) of Best Performing Single Network on the EER Day and EER Bout Datasets (sorted by performance).
| Networks and Feature | Day | Bout |
|---|---|---|
| VGG16 LM S | 5.62 | 0.53 |
| VGG16 dB LM S | 3.67 | 0.69 |
| AlexNet LM S | 4.87 | 2.38 |
| ResNet50 S | 6.97 | 1.92 |
| VGG16 S | 7.27 | 2.38 |
| ResNet50 min–max S | 7.87 | 1.84 |
| AlexNet box_n Mel S | 3.82 | 2.99 |
| VGG16 MFCC | 11.6 | 6.07 |
| VGG16 L2M Mel S | 9.82 | 6.07 |
| VGG16 min-max S | 7.49 | 2.45 |
Equal Error Rate (EER) of Best Performing 2-Ensemble on the EER Day and EER Bout Datasets (sorted by performance).
| Network 1 | Network 2 | Day | Bout |
|---|---|---|---|
| VGG16 MFCC | VGG16 LM S | 6.82 | 1.53 |
| ResNet50 S | AlexNet LM S | 2.69 | 0.53 |
| ResNet50 S | VGG16 LM S | 5.02 | 0.53 |
| ResNet50 S | VGG16 dB LM S | 4.34 | 0 |
| ResNet50 min–max S | VGG16 LM S | 2.47 | 0.53 |
| AlexNet min–max S | VGG16 LM S | 5.47 | 0.53 |
| ResNet50 min–max S | VGG16 LM S | 4.20 | 0.53 |
| ResNet50 Mel S | VGG16 LM S | 4.49 | 0.53 |
| AlexNet dB LM S | VGG16 LM S | 6.07 | 1.15 |
| VGG16 min–max S | VGG16 LM S | 6.75 | 0.69 |
Equal Error Rate (EER) of Best Performing 3-Ensemble on the EER Day and EER Bout Datasets (sorted by performance).
| Networks | Networks | Networks | Day | Bout |
|---|---|---|---|---|
| VGG16 min–max S | AlexNet c. LM S | VGG16 LM S | 3.00 | 0.53 |
| ResNet50 S | ResNet50 Mel S | VGG16 dB LM S | 5.09 | 0 |
| ResNet50 min–max S | AlexNet LM S | VGG16 LM S | 3.00 | 0.53 |
| ResNet50 S | VGG16 MFCC | VGG16 LM S | 5.47 | 0.61 |
| ResNet50 S | VGG16 dB LM S | VGG16 LM S | 3.89 | 0.07 |
| AlexNet min–max Mel S | VGG16 dB LM S | VGG16 LM S | 3.60 | 0.53 |
| VGG16 dB LM S | AlexNet box_n Mel S | VGG16 LM S | 2.40 | 0.15 |
| ResNet50 S | AlexNet VGGish | VGG16 dB LM S | 6.22 | 0.07 |
| ResNet50 S | AlexNet LM S | VGG16 LM S | 2.47 | 0.23 |
| ResNet50 min–max S | AlexNet dB LM S | VGG16 LM S | 3.29 | 0.53 |
Performance of other CNN topologies.
| Network and Feature | ERR Bout |
|---|---|
| VGG19 LM S | 98.9% |
| ResNet101 LM S | 97.6% |
| MobileNetV2 LM S | 96.3% |
Classification time (seconds) for a batch of 100 spectrograms.
| Networks | Classification Time |
|---|---|
| AlexNet | 0.148 |
| ResNet50 | 0.299 |
| VGG16 | 0.688 |
Computation times (seconds) for representing audio file as an image.
| Networks | Computation Time |
|---|---|
| Spectrograms | 0.015 |
| MFCC | 0.009 |
| Stockwell | 0.340 |
| VGGish | 0.015 |
| Mel spectrogram | 0.055 |