Speech-dependent Data Augmentation for Own Voice Reconstruction with Hearable Microphones in Noisy Environments

Mattes Ohlenbusch, Christian Rollwage, Simon Doclo

Abstract

Hearable devices, equipped with one or more microphones, can be used to capture the user’s own voice for speech communication in noisy environments. In such environments, an own voice reconstruction (OVR) system is needed to enhance the quality and intelligibility of the recorded own voice. In this work, we aim to estimate clean broadband speech from a microphone at the outer face of the hearable and an in-ear microphone, which captures the own voice at a higher signal-to-noise ratio than the outer microphone, but with a limited bandwidth. Training a supervised deep learning-based own voice reconstruction system requires a substantial amount of own voice signals as training data. Such training data can be collected by recording many utterances from many different talkers using a specific device, which is costly, or by augmenting standard clean speech datasets. In this paper, we investigate several data augmentation techniques to simulate a large amount of in-ear own voice signals from a limited amount of recorded own voice signals. More specifically, we consider different models for the own voice transfer characteristics between the outer microphone and the in-ear microphone, ranging from a fixed relative transfer function to a phoneme-dependent individual model. Experimental results show that training using the proposed speech-dependent individual data augmentation technique and additional fine-tuning with recorded signals considerably improves own voice quality in terms of objective performance metrics, even when only few recorded own voice signals are available.

clean outer microphone	noisy outer microphone	noisy in-ear microphone
only recorded	only augmented (speech-dependent individual)	augmented+finetune full (speech-dependent individual)
only recorded (3 talkers)	only augmented (3 talkers) (speech-dependent individual)	augmented+finetune (3 talkers) (speech-dependent individual)
only recorded (25 utterances)	only augmented (25 utterances) (speech-dependent individual)	augmented+finetune (25 utterances) (speech-dependent individual)

clean outer microphone	noisy outer microphone	noisy in-ear microphone
only recorded	only augmented (speech-dependent individual)	augmented+finetune full (speech-dependent individual)
only recorded (3 talkers)	only augmented (3 talkers) (speech-dependent individual)	augmented+finetune (3 talkers) (speech-dependent individual)
only recorded (25 utterances)	only augmented (25 utterances) (speech-dependent individual)	augmented+finetune (25 utterances) (speech-dependent individual)

Speech-dependent Data Augmentation for Own Voice Reconstruction with Hearable Microphones in Noisy Environments

Abstract

Links

Results

Audio Examples

Diffuse Factory Noise, 5 dB SNR

Metal Grinder Noise, 0 dB SNR