Low Complexity Own Voice Reconstruction for Hearables with an In-ear Microphone

Mattes Ohlenbusch, Christian Rollwage, Simon Doclo

Abstract

Hearable devices, equipped with one or more microphones, are commonly used for speech communication. Here, we consider the scenario where a hearable is used to capture the user’s own voice in a noisy environment. In this scenario, own voice reconstruction (OVR) is essential for enhancing the quality and intelligibility of the recorded noisy own voice signals. In previous work, we developed a deep learning-based OVR system, aiming to reduce the amount of device-specific recordings for training by using data augmentation with phoneme-dependent models of own voice transfer characteristics. Given the limited computational resources available on hearables, in this paper we propose low-complexity variants of an OVR system based on the FT-JNF architecture and investigate the required amount of device-specific recordings for effective data augmentation and fine-tuning. Simulation results show that the proposed OVR system considerably improves speech quality, even under constraints of low complexity and a limited amount of device-specific recordings.

Results

Performance, size and complexity of the baseline systems and the proposed FT-JNF variants (XL, L, M, S, XS). 'M' indicates millions, and 'G' indicates billions. Rows with a gray background indicate systems using only the in-ear microphone.
	Intrusive metrics	Size and complexity
Unprocessed	1.25	0.51	2.46	-	-	-
UNet (IM)	1.85	0.65	1.30	10.278 M	6.03 G	0.157
EBEN (IM)	1.51	0.57	1.64	1.946 M	1.02 G	0.034
FT-JNF XL (IM)	1.47	0.61	1.73	1.390 M	22.38 G	0.387
GCBFSNet	1.93	0.68	1.36	0.100 M	0.31 G	0.303
FT-JNF XL	2.58	0.78	1.08	1.390 M	22.45 G	0.392
FT-JNF L	2.50	0.77	1.10	0.466 M	7.55 G	0.173
FT-JNF M	2.22	0.72	1.27	0.118 M	1.93 G	0.071
FT-JNF S	2.18	0.72	1.28	0.031 M	0.50 G	0.029
FT-JNF XS	1.95	0.69	1.40	0.013 M	0.23 G	0.011

Performance, size and complexity of the baseline systems and the proposed FT-JNF variants (XL, L, M, S, XS). 'M' indicates millions, and 'G' indicates billions. Rows with a gray background indicate systems using only the in-ear microphone.

Intrusive metrics

Size and complexity

System

PESQ

ESTOI

LSD

Param.

MACs/s

RTF

Unprocessed

1.25

0.51

2.46

UNet (IM)

1.85

0.65

1.30

10.278 M

6.03 G

0.157

EBEN (IM)

1.51

0.57

1.64

1.946 M

1.02 G

0.034

FT-JNF XL (IM)

1.47

0.61

1.73

1.390 M

22.38 G

0.387

GCBFSNet

1.93

0.68

1.36

0.100 M

0.31 G

0.303

FT-JNF XL

2.58

0.78

1.08

1.390 M

22.45 G

0.392

FT-JNF L

2.50

0.77

1.10

0.466 M

7.55 G

0.173

FT-JNF M

2.22

0.72

1.27

0.118 M

1.93 G

0.071

FT-JNF S

2.18

0.72

1.28

0.031 M

0.50 G

0.029

FT-JNF XS

1.95

0.69

1.40

0.013 M

0.23 G

0.011

PESQ improvement of the baseline systems and the proposed FT-JNF variants for different amounts of device-specific recordings (talkers, utterances). Different systems are distinguished by different symbols, while different amounts of recordings are represented by different colors.

Audio Examples (recorded pseudo-diffuse surgery noise at 5 dB SNR)

System

Audio example

Clean outer microphone