Mattes Ohlenbusch, Christian Rollwage, Simon Doclo
Hearable devices, equipped with one or more microphones, are commonly used for speech communication. Here, we consider the scenario where a hearable is used to capture the user’s own voice in a noisy environment. In this scenario, own voice reconstruction (OVR) is essential for enhancing the quality and intelligibility of the recorded noisy own voice signals. In previous work, we developed a deep learning-based OVR system, aiming to reduce the amount of device-specific recordings for training by using data augmentation with phoneme-dependent models of own voice transfer characteristics. Given the limited computational resources available on hearables, in this paper we propose low-complexity variants of an OVR system based on the FT-JNF architecture and investigate the required amount of device-specific recordings for effective data augmentation and fine-tuning. Simulation results show that the proposed OVR system considerably improves speech quality, even under constraints of low complexity and a limited amount of device-specific recordings.
Dataset of German own voice recordings: https://doi.org/10.5281/zenodo.10844599
Transfer function measurements for simulating environmental noise at hearable microphones: https://doi.org/10.5281/zenodo.11196867
Intrusive metrics | Size and complexity | |||||
---|---|---|---|---|---|---|
System | PESQ | ESTOI | LSD | Param. | MACs/s | RTF |
Unprocessed | 1.25 | 0.51 | 2.46 | - | - | - |
UNet (IM) | 1.85 | 0.65 | 1.30 | 10.278 M | 6.03 G | 0.157 |
EBEN (IM) | 1.51 | 0.57 | 1.64 | 1.946 M | 1.02 G | 0.034 |
FT-JNF XL (IM) | 1.47 | 0.61 | 1.73 | 1.390 M | 22.38 G | 0.387 |
GCBFSNet | 1.93 | 0.68 | 1.36 | 0.100 M | 0.31 G | 0.303 |
FT-JNF XL | 2.58 | 0.78 | 1.08 | 1.390 M | 22.45 G | 0.392 |
FT-JNF L | 2.50 | 0.77 | 1.10 | 0.466 M | 7.55 G | 0.173 |
FT-JNF M | 2.22 | 0.72 | 1.27 | 0.118 M | 1.93 G | 0.071 |
FT-JNF S | 2.18 | 0.72 | 1.28 | 0.031 M | 0.50 G | 0.029 |
FT-JNF XS | 1.95 | 0.69 | 1.40 | 0.013 M | 0.23 G | 0.011 |
System | Audio example |
Clean outer microphone | |
Noisy outer microphone | |
Noisy in-ear microphone | |
EBEN | |
UNet | |
GCBFSNet | |
FT-JNF XL | |
FT-JNF X | |
FT-JNF M | |
FT-JNF S | |
FT-JNF XS | |
FT-JNF XL (3 talkers) | |
FT-JNF XL (25 utterances) | |
FT-JNF S (3 talkers) | |
FT-JNF S (25 utterances) |