Literature DB >> 31485462

Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation.

Yi Luo, Nima Mesgarani.   

Abstract

Single-channel, speaker-independent speech separation methods have recently seen great progress. However, the accuracy, latency, and computational cost of such methods remain insufficient. The majority of the previous methods have formulated the separation problem through the time-frequency representation of the mixed signal, which has several drawbacks, including the decoupling of the phase and magnitude of the signal, the suboptimality of time-frequency representation for speech separation, and the long latency of the entire system. To address these shortcomings, we propose a fully-convolutional time-domain audio separation network (Conv-TasNet), a deep learning framework for end-to-end time-domain speech separation. Conv-TasNet uses a linear encoder to generate a representation of the speech waveform optimized for separating individual speakers. Speaker separation is achieved by applying a set of weighting functions (masks) to the encoder output. The modified encoder representations are then inverted back to the waveforms using a linear decoder. The masks are found using a temporal convolutional network (TCN) consisting of stacked 1-D dilated convolutional blocks, which allows the network to model the long-term dependencies of the speech signal while maintaining a small model size. The proposed Conv-TasNet system significantly outperforms previous time-frequency masking methods in separating two- and three-speaker mixtures. Additionally, Conv-TasNet surpasses several ideal time-frequency magnitude masks in two-speaker speech separation as evaluated by both objective distortion measures and subjective quality assessment by human listeners. Finally, Conv-TasNet has a significantly smaller model size and a much shorter minimum latency, making it a suitable solution for both offline and real-time speech separation applications. This study therefore represents a major step toward the realization of speech separation systems for real-world speech processing technologies.

Entities:  

Keywords:  Source separation; deep learning; real-time; single-channel; time-domain

Year:  2019        PMID: 31485462      PMCID: PMC6726126          DOI: 10.1109/TASLP.2019.2915167

Source DB:  PubMed          Journal:  IEEE/ACM Trans Audio Speech Lang Process


  10 in total

1.  Blind source separation by sparse decomposition in a signal dictionary.

Authors:  M Zibulevsky; B A Pearlmutter
Journal:  Neural Comput       Date:  2001-04       Impact factor: 2.026

2.  Effects of fundamental frequency and vocal-tract length changes on attention to one of two simultaneous talkers.

Authors:  Christopher J Darwin; Douglas S Brungart; Brian D Simpson
Journal:  J Acoust Soc Am       Date:  2003-11       Impact factor: 1.840

3.  Nonnegative least-correlated component analysis for separation of dependent sources by volume maximization.

Authors:  Fa-Yu Wang; Chong-Yung Chi; Tsung-Han Chan; Yue Wang
Journal:  IEEE Trans Pattern Anal Mach Intell       Date:  2010-05       Impact factor: 6.226

4.  Supervised Speech Separation Based on Deep Learning: An Overview.

Authors:  DeLiang Wang; Jitong Chen
Journal:  IEEE/ACM Trans Audio Speech Lang Process       Date:  2018-05-30

5.  Convex and semi-nonnegative matrix factorizations.

Authors:  Chris Ding; Tao Li; Michael I Jordan
Journal:  IEEE Trans Pattern Anal Mach Intell       Date:  2010-01       Impact factor: 6.226

6.  Tonotopic organization of the auditory cortex: pitch versus frequency representation.

Authors:  C Pantev; M Hoke; B Lütkenhöner; K Lehnertz
Journal:  Science       Date:  1989-10-27       Impact factor: 47.728

7.  DEEP ATTRACTOR NETWORK FOR SINGLE-MICROPHONE SPEAKER SEPARATION.

Authors:  Zhuo Chen; Yi Luo; Nima Mesgarani
Journal:  Proc IEEE Int Conf Acoust Speech Signal Process       Date:  2017-06-19

8.  Tonotopic organization of the human auditory cortex.

Authors:  G L Romani; S J Williamson; L Kaufman
Journal:  Science       Date:  1982-06-18       Impact factor: 47.728

9.  On Training Targets for Supervised Speech Separation.

Authors:  Yuxuan Wang; Arun Narayanan; DeLiang Wang
Journal:  IEEE/ACM Trans Audio Speech Lang Process       Date:  2014-12

10.  DEEP CLUSTERING AND CONVENTIONAL NETWORKS FOR MUSIC SEPARATION: STRONGER TOGETHER.

Authors:  Yi Luo; Zhuo Chen; John R Hershey; Jonathan Le Roux; Nima Mesgarani
Journal:  Proc IEEE Int Conf Acoust Speech Signal Process       Date:  2017-06-19
  10 in total
  17 in total

1.  A two-stage deep learning algorithm for talker-independent speaker separation in reverberant conditions.

Authors:  Masood Delfarah; Yuzhou Liu; DeLiang Wang
Journal:  J Acoust Soc Am       Date:  2020-09       Impact factor: 1.840

2.  Hierarchical Encoding of Attended Auditory Objects in Multi-talker Speech Perception.

Authors:  James O'Sullivan; Jose Herrero; Elliot Smith; Catherine Schevon; Guy M McKhann; Sameer A Sheth; Ashesh D Mehta; Nima Mesgarani
Journal:  Neuron       Date:  2019-10-21       Impact factor: 17.173

3.  Monaural Speech Dereverberation Using Temporal Convolutional Networks with Self Attention.

Authors:  Yan Zhao; DeLiang Wang; Buye Xu; Tao Zhang
Journal:  IEEE/ACM Trans Audio Speech Lang Process       Date:  2020-05-18

4.  Divide and Conquer: A Deep CASA Approach to Talker-independent Monaural Speaker Separation.

Authors:  Yuzhou Liu; DeLiang Wang
Journal:  IEEE/ACM Trans Audio Speech Lang Process       Date:  2019-09-12

5.  Causal Deep CASA for Monaural Talker-Independent Speaker Separation.

Authors:  Yuzhou Liu; DeLiang Wang
Journal:  IEEE/ACM Trans Audio Speech Lang Process       Date:  2020-07-08

6.  Multi-microphone Complex Spectral Mapping for Utterance-wise and Continuous Speech Separation.

Authors:  Zhong-Qiu Wang; Peidong Wang; DeLiang Wang
Journal:  IEEE/ACM Trans Audio Speech Lang Process       Date:  2021-05-26

7.  Dense CNN with Self-Attention for Time-Domain Speech Enhancement.

Authors:  Ashutosh Pandey; DeLiang Wang
Journal:  IEEE/ACM Trans Audio Speech Lang Process       Date:  2021-03-08

8.  Estimating and interpreting nonlinear receptive field of sensory neural responses with deep neural network models.

Authors:  Menoua Keshishian; Hassan Akbari; Bahar Khalighinejad; Jose L Herrero; Ashesh D Mehta; Nima Mesgarani
Journal:  Elife       Date:  2020-06-26       Impact factor: 8.140

9.  Deep Layer Kernel Sparse Representation Network for the Detection of Heart Valve Ailments from the Time-Frequency Representation of PCG Recordings.

Authors:  Samit Kumar Ghosh; R N Ponnalagu; R K Tripathy; U Rajendra Acharya
Journal:  Biomed Res Int       Date:  2020-12-21       Impact factor: 3.411

10.  Towards Model Compression for Deep Learning Based Speech Enhancement.

Authors:  Ke Tan; DeLiang Wang
Journal:  IEEE/ACM Trans Audio Speech Lang Process       Date:  2021-05-21
View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.