Low Complexity Own Voice Reconstruction for Hearables with an In-ear Microphone

Mattes Ohlenbusch, Christian Rollwage, Simon Doclo

Abstract

Hearable devices, equipped with one or more microphones, are commonly used for speech communication. Here, we consider the scenario where a hearable is used to capture the user’s own voice in a noisy environment. In this scenario, own voice reconstruction (OVR) is essential for enhancing the quality and intelligibility of the recorded noisy own voice signals. In previous work, we developed a deep learning-based OVR system, aiming to reduce the amount of device-specific recordings for training by using data augmentation with phoneme-dependent models of own voice transfer characteristics. Given the limited computational resources available on hearables, in this paper we propose low-complexity variants of an OVR system based on the FT-JNF architecture and investigate the required amount of device-specific recordings for effective data augmentation and fine-tuning. Simulation results show that the proposed OVR system considerably improves speech quality, even under constraints of low complexity and a limited amount of device-specific recordings.

Links

Dataset of German own voice recordings: https://doi.org/10.5281/zenodo.10844599

Transfer function measurements for simulating environmental noise at hearable microphones: https://doi.org/10.5281/zenodo.11196867

Results

Performance, size and complexity of the baseline systems and the proposed FT-JNF variants (XL, L, M, S, XS). 'M' indicates millions, and 'G' indicates billions. Rows with a gray background indicate systems using only the in-ear microphone.
Intrusive metrics Size and complexity
System PESQ ESTOI LSD Param. MACs/s RTF
Unprocessed 1.25 0.51 2.46 - - -
UNet (IM) 1.85 0.65 1.30 10.278 M 6.03 G 0.157
EBEN (IM) 1.51 0.57 1.64 1.946 M 1.02 G 0.034
FT-JNF XL (IM) 1.47 0.61 1.73 1.390 M 22.38 G 0.387
GCBFSNet 1.93 0.68 1.36 0.100 M 0.31 G 0.303
FT-JNF XL 2.58 0.78 1.08 1.390 M 22.45 G 0.392
FT-JNF L 2.50 0.77 1.10 0.466 M 7.55 G 0.173
FT-JNF M 2.22 0.72 1.27 0.118 M 1.93 G 0.071
FT-JNF S 2.18 0.72 1.28 0.031 M 0.50 G 0.029
FT-JNF XS 1.95 0.69 1.40 0.013 M 0.23 G 0.011
PESQ Improvement
PESQ improvement of the baseline systems and the proposed FT-JNF variants for different amounts of device-specific recordings (talkers, utterances). Different systems are distinguished by different symbols, while different amounts of recordings are represented by different colors.

Audio Examples (recorded pseudo-diffuse surgery noise at 5 dB SNR)

System Audio example
Clean outer microphone
Noisy outer microphone
Noisy in-ear microphone
EBEN
UNet
GCBFSNet
FT-JNF XL
FT-JNF X
FT-JNF M
FT-JNF S
FT-JNF XS
FT-JNF XL (3 talkers)
FT-JNF XL (25 utterances)
FT-JNF S (3 talkers)
FT-JNF S (25 utterances)