Warning: Undefined array key "mm" in /www/wwwroot/www.ai-bt.com/si.php on line 10 Deprecated: trim(): Passing null to parameter #1 ($string) of type string is deprecated in /www/wwwroot/www.ai-bt.com/si.php on line 10 Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation.

Literature DB >> 31485462

Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation.

Abstract

Single-channel, speaker-independent speech separation methods have recently seen great progress. However, the accuracy, latency, and computational cost of such methods remain insufficient. The majority of the previous methods have formulated the separation problem through the time-frequency representation of the mixed signal, which has several drawbacks, including the decoupling of the phase and magnitude of the signal, the suboptimality of time-frequency representation for speech separation, and the long latency of the entire system. To address these shortcomings, we propose a fully-convolutional time-domain audio separation network (Conv-TasNet), a deep learning framework for end-to-end time-domain speech separation. Conv-TasNet uses a linear encoder to generate a representation of the speech waveform optimized for separating individual speakers. Speaker separation is achieved by applying a set of weighting functions (masks) to the encoder output. The modified encoder representations are then inverted back to the waveforms using a linear decoder. The masks are found using a temporal convolutional network (TCN) consisting of stacked 1-D dilated convolutional blocks, which allows the network to model the long-term dependencies of the speech signal while maintaining a small model size. The proposed Conv-TasNet system significantly outperforms previous time-frequency masking methods in separating two- and three-speaker mixtures. Additionally, Conv-TasNet surpasses several ideal time-frequency magnitude masks in two-speaker speech separation as evaluated by both objective distortion measures and subjective quality assessment by human listeners. Finally, Conv-TasNet has a significantly smaller model size and a much shorter minimum latency, making it a suitable solution for both offline and real-time speech separation applications. This study therefore represents a major step toward the realization of speech separation systems for real-world speech processing technologies.

Entities: Chemical Disease Gene Species

Keywords: Source separation; deep learning; real-time; single-channel; time-domain

Year: 2019 PMID： 31485462 PMCID： PMC6726126 DOI： 10.1109/TASLP.2019.2915167

Source DB: PubMed Journal: IEEE/ACM Trans Audio Speech Lang Process

10 in total

1. Blind source separation by sparse decomposition in a signal dictionary.

Authors: M Zibulevsky; B A Pearlmutter
Journal: Neural Comput Date: 2001-04 Impact factor: 2.026

2. Effects of fundamental frequency and vocal-tract length changes on attention to one of two simultaneous talkers.

Authors: Christopher J Darwin; Douglas S Brungart; Brian D Simpson
Journal: J Acoust Soc Am Date: 2003-11 Impact factor: 1.840

3. Nonnegative least-correlated component analysis for separation of dependent sources by volume maximization.

Authors: Fa-Yu Wang; Chong-Yung Chi; Tsung-Han Chan; Yue Wang
Journal: IEEE Trans Pattern Anal Mach Intell Date: 2010-05 Impact factor: 6.226

4. Supervised Speech Separation Based on Deep Learning: An Overview.

Authors: DeLiang Wang; Jitong Chen
Journal: IEEE/ACM Trans Audio Speech Lang Process Date: 2018-05-30

5. Convex and semi-nonnegative matrix factorizations.

Authors: Chris Ding; Tao Li; Michael I Jordan
Journal: IEEE Trans Pattern Anal Mach Intell Date: 2010-01 Impact factor: 6.226

6. Tonotopic organization of the auditory cortex: pitch versus frequency representation.

Authors: C Pantev; M Hoke; B Lütkenhöner; K Lehnertz
Journal: Science Date: 1989-10-27 Impact factor: 47.728

7. DEEP ATTRACTOR NETWORK FOR SINGLE-MICROPHONE SPEAKER SEPARATION.

Authors: Zhuo Chen; Yi Luo; Nima Mesgarani
Journal: Proc IEEE Int Conf Acoust Speech Signal Process Date: 2017-06-19

8. Tonotopic organization of the human auditory cortex.

Authors: G L Romani; S J Williamson; L Kaufman
Journal: Science Date: 1982-06-18 Impact factor: 47.728

9. On Training Targets for Supervised Speech Separation.

Authors: Yuxuan Wang; Arun Narayanan; DeLiang Wang
Journal: IEEE/ACM Trans Audio Speech Lang Process Date: 2014-12

10. DEEP CLUSTERING AND CONVENTIONAL NETWORKS FOR MUSIC SEPARATION: STRONGER TOGETHER.

Authors: Yi Luo; Zhuo Chen; John R Hershey; Jonathan Le Roux; Nima Mesgarani
Journal: Proc IEEE Int Conf Acoust Speech Signal Process Date: 2017-06-19

10 in total

17 in total

9. Deep Layer Kernel Sparse Representation Network for the Detection of Heart Valve Ailments from the Time-Frequency Representation of PCG Recordings.

Authors: Samit Kumar Ghosh; R N Ponnalagu; R K Tripathy; U Rajendra Acharya
Journal: Biomed Res Int Date: 2020-12-21 Impact factor: 3.411

10. Towards Model Compression for Deep Learning Based Speech Enhancement.

Authors: Ke Tan; DeLiang Wang
Journal: IEEE/ACM Trans Audio Speech Lang Process Date: 2021-05-21

Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation.

1. Blind source separation by sparse decomposition in a signal dictionary.

2. Effects of fundamental frequency and vocal-tract length changes on attention to one of two simultaneous talkers.

3. Nonnegative least-correlated component analysis for separation of dependent sources by volume maximization.

4. Supervised Speech Separation Based on Deep Learning: An Overview.

5. Convex and semi-nonnegative matrix factorizations.

6. Tonotopic organization of the auditory cortex: pitch versus frequency representation.

7. DEEP ATTRACTOR NETWORK FOR SINGLE-MICROPHONE SPEAKER SEPARATION.

8. Tonotopic organization of the human auditory cortex.

9. On Training Targets for Supervised Speech Separation.

10. DEEP CLUSTERING AND CONVENTIONAL NETWORKS FOR MUSIC SEPARATION: STRONGER TOGETHER.

1. A two-stage deep learning algorithm for talker-independent speaker separation in reverberant conditions.

2. Hierarchical Encoding of Attended Auditory Objects in Multi-talker Speech Perception.

3. Monaural Speech Dereverberation Using Temporal Convolutional Networks with Self Attention.

4. Divide and Conquer: A Deep CASA Approach to Talker-independent Monaural Speaker Separation.

5. Causal Deep CASA for Monaural Talker-Independent Speaker Separation.

6. Multi-microphone Complex Spectral Mapping for Utterance-wise and Continuous Speech Separation.

7. Dense CNN with Self-Attention for Time-Domain Speech Enhancement.

8. Estimating and interpreting nonlinear receptive field of sensory neural responses with deep neural network models.

9. Deep Layer Kernel Sparse Representation Network for the Detection of Heart Valve Ailments from the Time-Frequency Representation of PCG Recordings.

10. Towards Model Compression for Deep Learning Based Speech Enhancement.