| Literature DB >> 29558485 |
Khurram Ashfaq Qazi1, Tabassam Nawaz1, Zahid Mehmood1, Muhammad Rashid2, Hafiz Adnan Habib3.
Abstract
Recent research on speech segregation and music fingerprinting has led to improvements in speech segregation and music identification algorithms. Speech and music segregation generally involves the identification of music followed by speech segregation. However, music segregation becomes a challenging task in the presence of noise. This paper proposes a novel method of speech segregation for unlabelled stationary noisy audio signals using the deep belief network (DBN) model. The proposed method successfully segregates a music signal from noisy audio streams. A recurrent neural network (RNN)-based hidden layer segregation model is applied to remove stationary noise. Dictionary-based fisher algorithms are employed for speech classification. The proposed method is tested on three datasets (TIMIT, MIR-1K, and MusicBrainz), and the results indicate the robustness of proposed method for speech segregation. The qualitative and quantitative analysis carried out on three datasets demonstrate the efficiency of the proposed method compared to the state-of-the-art speech segregation and classification-based methods.Entities:
Mesh:
Year: 2018 PMID: 29558485 PMCID: PMC5860734 DOI: 10.1371/journal.pone.0194151
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1Time-frequency graph of the audio sample.
Fig 2Proposed architecture of speech/music segregation for an audio sample having background noise.
Fig 3Layer separation architecture.
Fig 4Hamming window of input noisy audio signal.
Fig 5Dictionary-based sparse coding.
Dictionary-based fisher algorithm.
| Input = Signal Items, Dictionary Items |
Fig 6Computation of segments for dictionary item creation.
Comparison of accuracy rate of different features extracted with classification algorithms.
| Features | Classification algorithm accuracy rate | |||
|---|---|---|---|---|
| SVM [ | K-NN [ | Naive Bayes [ | DBN (Dictionary-based fisher) | |
| MFCC (Proposed) | 88.10% | 85.80% | 86.20% | 91.60% |
| Bark Scale | 82.10% | 79.90% | 80.10% | 87.23% |
| GFCC | 84.10% | 81.20% | 82.00% | 83.20% |
| MRCG | 78.60% | 73.20% | 76.00% | 82.19% |
| STFT | 77.97% | 72.12% | 71.23% | 81.23% |
| Chromagram | 76.78% | 71.19% | 70.15% | 80.67% |
| Spectral Skewness | 75.45% | 70.89% | 69.67% | 79.65% |
| Spectral Kurtosis | 74.32% | 69.87% | 68.37% | 77.37% |
Comparison of different speech separation models with respect to methodology used.
| Algorithms/ System | Methodology Used | Accuracy Rate (%) | Processing Time (Sec.) |
|---|---|---|---|
| Proposed Model | Multi-layered separation with deep recurrent neural network and MFCC features with DBN model classification | 91.60% | 1.4 |
| Panako [ | Local maxima are calculated using constant Q of the spectrogram. Set of hashes is generated for matching | 87.25% | 2.1 |
| Echoprint [ | 8 bins and sub-fingerprints generated using cosine band filtration | 85.9% | 2.4 |
| Landmark [ | 16 bins and sub-fingerprints generated using STFT | 84.9% | 2.6 |
| Chromaprint [ | 12 Hash bins and sub-fingerprints generated using STFT | 82.35% | 2.7 |
Performance comparison of TIMIT and MusicBrainz datasets with respect to STOI and PESQ for noisy signal and proposed algorithm.
| Dataset | SNR (db.) | Noisy Original Signal | Proposed Algorithm | ||
|---|---|---|---|---|---|
| STOI | PESQ | STOI | PESQ | ||
| TIMIT | 3 | 0.802 | 1.395 | ||
| 0 | 0.743 | 1.259 | |||
| -3 | 0.678 | 1.124 | |||
| MusicBrainz | 3 | 0.752 | 1.415 | ||
| 0 | 0.857 | 1.359 | |||
| -3 | 0.669 | 1.224 | |||
Descriptive statistics for paired samples test.
| Success Ratio for sample test | T—Value | df | P—Value | Mean | Std. Deviation | Skewness | Kurtosis | ||
|---|---|---|---|---|---|---|---|---|---|
| Statistic | Statistic | Statistic | Std. Error | Statistic | Std. Error | ||||
| 23.050 | 153 | .000 | .73 | .447 | -1.031 | .195 | -.950 | .389 | |