Speech-dependent Data Augmentation for Own Voice Reconstruction with Hearable Microphones in Noisy Environments

Mattes Ohlenbusch, Christian Rollwage, Simon Doclo

Abstract

Hearable devices, equipped with one or more microphones, can be used to capture the user’s own voice for speech communication in noisy environments. In such environments, an own voice reconstruction (OVR) system is needed to enhance the quality and intelligibility of the recorded own voice. In this work, we aim to estimate clean broadband speech from a microphone at the outer face of the hearable and an in-ear microphone, which captures the own voice at a higher signal-to-noise ratio than the outer microphone, but with a limited bandwidth. Training a supervised deep learning-based own voice reconstruction system requires a substantial amount of own voice signals as training data. Such training data can be collected by recording many utterances from many different talkers using a specific device, which is costly, or by augmenting standard clean speech datasets. In this paper, we investigate several data augmentation techniques to simulate a large amount of in-ear own voice signals from a limited amount of recorded own voice signals. More specifically, we consider different models for the own voice transfer characteristics between the outer microphone and the in-ear microphone, ranging from a fixed relative transfer function to a phoneme-dependent individual model. Experimental results show that training using the proposed speech-dependent individual data augmentation technique and additional fine-tuning with recorded signals considerably improves own voice quality in terms of objective performance metrics, even when only few recorded own voice signals are available.

Links

Arxiv preprint: https://arxiv.org/abs/2405.11592

Dataset of German own voice recordings: https://doi.org/10.5281/zenodo.10844599

Transfer function measurements for simulating environmental noise at hearable microphones: https://doi.org/10.5281/zenodo.11196867

Results

PESQ results (talker count).
PESQ improvement achieved by the own voice reconstruction system
trained with a different number of talkers.
PESQ results (utterance count).
PESQ improvement achieved by the own voice reconstruction system
trained with a different number of utterances per talker.

Audio Examples

Diffuse Factory Noise, 5 dB SNR

clean outer microphone
noisy outer microphone
noisy in-ear microphone
only recorded
only augmented
(speech-dependent individual)
augmented+finetune full
(speech-dependent individual)
only recorded (3 talkers)
only augmented (3 talkers)
(speech-dependent individual)
augmented+finetune (3 talkers)
(speech-dependent individual)
only recorded (25 utterances)
only augmented (25 utterances)
(speech-dependent individual)
augmented+finetune (25 utterances)
(speech-dependent individual)

Metal Grinder Noise, 0 dB SNR

clean outer microphone
noisy outer microphone
noisy in-ear microphone
only recorded
only augmented
(speech-dependent individual)
augmented+finetune full
(speech-dependent individual)
only recorded (3 talkers)
only augmented (3 talkers)
(speech-dependent individual)
augmented+finetune (3 talkers)
(speech-dependent individual)
only recorded (25 utterances)
only augmented (25 utterances)
(speech-dependent individual)
augmented+finetune (25 utterances)
(speech-dependent individual)