An Approach to Mispronunciation Detection and Diagnosis with Acoustic, Phonetic and Linguistic (APL) Embeddings

Accepted by ICASSP 2022

Abstract

Many mispronunciation detection and diagnosis (MD&D) research approaches try to exploit both the acoustic and linguistic features as input. Yet the improvement of the performance is limited, partially due to the shortage of large amount annotated training data at the phoneme level. Phonetic embeddings, extracted from ASR models trained with huge amount of word level annotations, can serve as a good representation of the content of input speech, in a noise-robust and speaker-independent manner. These embeddings, when used as implicit phonetic supplementary information, can alleviate the data shortage of explicit phoneme annotations. We propose to utilize Acoustic, Phonetic and Linguistic (APL) embedding features jointly for building a more powerful MD&D system. Experimental results obtained on the L2-ARCTIC database show the proposed approach outperforms the baseline by 9.93%, 10.13% and 6.17% on the detection accuracy, diagnosis error rate and the F-measure, respectively.

Experiments

Phoneme Recognition

Confusion Matrix

The confusion matrices of most frequently misrecognized vowels and consonants with respect to baseline2 (AL), PL-2 and APL-2 are listed in Table 1 and Table 2 respectively. Almost all vowels and consonants presented in the tables are more accurately recognized when the acoustic features are replaced by the new phonetic embeddings. The APL-2 further improves performance.

Table 1: Confusion matrix of frequently misrecognized vowels

		Annotation
		aa	ah	ae	eh	ih	iy
Baseline-2 (AL)	aa	220	63	10	2	0	0
	ah	31	2025	23	23	104	13
	ae	10	64	617	63	16	1
	eh	1	95	58	534	79	7
	ih	0	128	1	32	1212	183
	iy	0	23	1	8	184	1043
AL-2	aa	239	51	14	2	0	0
	ah	13	2152	10	18	83	5
	ae	12	41	660	56	14	0
	eh	1	49	45	654	37	11
	ih	1	74	3	22	1399	130
	iy	0	13	0	3	174	1138
APL-2	aa	290	18	11	3	0	0
	ah	8	2298	11	10	19	12
	ae	5	39	728	16	4	0
	eh	0	41	34	710	17	15
	ih	0	35	0	15	1526	89
	iy	0	17	0	2	147	1185

Table 2: Confusion matrix of frequently misrecognized consonants

		Annotation
		d	dh	t	sh	s	z
Baseline-2 (AL)	d	1085	104	107	1	6	3
	dh	149	104	20	0	3	0
	t	141	4	1212	0	12	1
	sh	0	0	0	324	3	0
	s	2	0	6	20	1485	83
	z	2	0	1	6	180	247
AL-2	d	1159	185	74	0	6	4
	dh	106	202	9	0	1	0
	t	95	13	1337	3	11	3
	sh	0	0	1	327	4	1
	s	0	0	7	4	1410	204
	z	1	0	3	0	121	346
APL-2	d	1187	230	33	0	5	2
	dh	104	215	0	0	0	1
	t	74	9	1379	0	8	1
	sh	0	0	1	327	4	0
	s	0	0	3	1	1479	125
	z	0	0	1	0	141	325

Phonemes that APL Most Improves on (PL vs. APL)

With taking acoustic features, the top 5 phonemes benefited are ‘aa’, ‘jh’, ‘ao’, ‘ae’ and ‘uh’, which get 21.24%, 14.29%, 12.14%, 10.30% and 10.24% relative improvements in accuracy, respectively. Most of them are vowels.

Mispronunciation Detection and Diagnosis

Mispronunciations Hardest to Detect across Different Accents

The top 3 challenge mispronunciations in APL-2 model of the six accents in the test set are shown in Table 3, in which each mispronunciation is represented in a tuple. For example, (r, r) indicates that ‘r’ is mispronounced with ‘r’, and (< eps>, ax) stands for an insertion error. Since for each accent only one speaker is included in the testset, more data is needed to show a more robust trend across the six accents.

Table 3: Top 3 challenging mispronunciations across accents

Mother Tongue	Top 3 Mispronunciations
Arabic	(r, r*); (ih, eh); (ow, ao)
Hindi	(t, d); (<eps>, ax); (w, w*)
Korean	(v, f); (jh, ch); (aw, aa)
Mandarin	(th, t); (d, t); (l, w)
Spanish	(ey, eh); (ah, eh); (<eps>, eh)
Vietnamese	(ah, ao); (s, z); (ae, eh)