Monaural Target Speaker Extraction via Distance and Speaker Information

accepted by INTERSPEECH 2023.

Abstract

Previously, Target Speaker Extraction (TSE) has yielded outstanding performance in certain application scenarios for speech enhancement and source separation. However, obtaining auxiliary speaker-related information is still challenging in noisy environments with significant reverberation. inspired by the recently proposed distance-based sound separation, we propose the near sound (NS) extractor, which leverages distance information for TSE to reliably extract speaker information without requiring previous speaker enrolment, called speaker embedding self-enrollment (SESE). Full- & sub-band modeling is introduced to enhance our NS-Extractor's adaptability towards environments with significant reverberation. Experimental results on several cross-datasets demonstrate the effectiveness of our improvements and the excellent performance of our proposed NS-Extractor in different application scenarios.

Compare with baseline models

Case 1: 1 near speaker & 1 far speaker.

Case 1
Mixture

SI-SDR = -4.94 dB
GT
mixture GT
LSTM

SI-SDR = -3.8dB
UNet

SI-SDR = 4.62dB
Proposed

SI-SDR = 10.17dB
LSTM UNet Proposed

Case 2: 1 near speaker & 2 far speakers.

Case 2
Mixture

SI-SDR = -1.21dB
GT
mixture GT
LSTM

SI-SDR = 9.86dB
UNet

SI-SDR = 10.11dB
Proposed

SI-SDR = 13.12dB
LSTM UNet Proposed

Case 3: 1 near speaker & 3 far speakers.

Case 3
Mixture

SI-SDR = -8.46dB
GT
mixture GT
LSTM

SI-SDR = -0.05dB
UNet

SI-SDR = -1.86dB
Proposed

SI-SDR = 3.94dB
LSTM UNet Proposed

Ablation Study 1

Case 1: 1 near speaker & 1 far speaker

Case 1
Mixture

SI-SDR = -4.84dB
GT
Proposed

SI-SDR = 15.73dB
mixture GT Proposed
w/o F-Att

SI-SDR = 14.84dB
w/o T-Att

SI-SDR = 12.71dB
w/o SE

SI-SDR = 15.73dB
w/o F-Att w/o T-Att w/o SE

Case 2: 1 near speaker & 2 far speakers.

Case 2
Mixture

SI-SDR = -4.27
GT

Proposed

SI-SDR = 7.13dB
mixture GT Proposed
w/o F-Att

SI-SDR = 4.19dB
w/o T-Att

SI-SDR = 3.67dB
w/o SE

SI-SDR = -2.25dB
w/o F-Att w/o T-Att w/o SE

Ablation Study 2

Case 1: normal RIR & unintruded speech

Case 1
Mixture

SI-SDR = -4.53dB
GT
Mixture GT
w/o SE

SI-SDR = 9.44dB
Proposed

SI-SDR = 10.60dB
w/o SE Proposed

Case 2: normal RIR & intruded speech

Case 2
Mixture

SI-SDR = -14.88dB
GT
Mixture GT
w/o SE

SI-SDR = -22.15dB
Proposed

SI-SDR = 5.85dB
w/o SE Proposed

Case 3: faint RIR & unintruded speech

Case 3
Mixture

SI-SDR = 2.27dB
GT

Mixture GT
w/o SE

SI-SDR = 14.22dB
Proposed

SI-SDR = 16.26dB
w/o SE Proposed

Case 4: faint RIR & intruded speech

Case 4
Mixture

SI-SDR = -0.53dB
GT

Mixture GT
w/o SE

SI-SDR = 0.34dB
Proposed

SI-SDR = 11.63dB
w/o SE Proposed