An Approach to Mispronunciation Detection and Diagnosis with Acoustic, Phonetic and Linguistic (APL) Embeddings

Accepted by ICASSP 2022

Abstract

Many mispronunciation detection and diagnosis (MD&D) research approaches try to exploit both the acoustic and linguistic features as input. Yet the improvement of the performance is limited, partially due to the shortage of large amount annotated training data at the phoneme level. Phonetic embeddings, extracted from ASR models trained with huge amount of word level annotations, can serve as a good representation of the content of input speech, in a noise-robust and speaker-independent manner. These embeddings, when used as implicit phonetic supplementary information, can alleviate the data shortage of explicit phoneme annotations. We propose to utilize Acoustic, Phonetic and Linguistic (APL) embedding features jointly for building a more powerful MD&D system. Experimental results obtained on the L2-ARCTIC database show the proposed approach outperforms the baseline by 9.93%, 10.13% and 6.17% on the detection accuracy, diagnosis error rate and the F-measure, respectively.

Experiments

Phoneme Recognition

Confusion Matrix

The confusion matrices of most frequently misrecognized vowels and consonants with respect to baseline2 (AL), PL-2 and APL-2 are listed in Table 1 and Table 2 respectively. Almost all vowels and consonants presented in the tables are more accurately recognized when the acoustic features are replaced by the new phonetic embeddings. The APL-2 further improves performance.

Table 1: Confusion matrix of frequently misrecognized vowels

Annotation
aa ah ae eh ih iy
Baseline-2 (AL) aa 220 63 10 2 0 0
ah 31 2025 23 23 104 13
ae 10 64 617 63 16 1
eh 1 95 58 534 79 7
ih 0 128 1 32 1212 183
iy 0 23 1 8 184 1043
AL-2 aa 239 51 14 2 0 0
ah 13 2152 10 18 83 5
ae 12 41 660 56 14 0
eh 1 49 45 654 37 11
ih 1 74 3 22 1399 130
iy 0 13 0 3 174 1138
APL-2 aa 290 18 11 3 0 0
ah 8 2298 11 10 19 12
ae 5 39 728 16 4 0
eh 0 41 34 710 17 15
ih 0 35 0 15 1526 89
iy 0 17 0 2 147 1185

Table 2: Confusion matrix of frequently misrecognized consonants

Annotation
d dh t sh s z
Baseline-2 (AL) d 1085 104 107 1 6 3
dh 149 104 20 0 3 0
t 141 4 1212 0 12 1
sh 0 0 0 324 3 0
s 2 0 6 20 1485 83
z 2 0 1 6 180 247
AL-2 d 1159 185 74 0 6 4
dh 106 202 9 0 1 0
t 95 13 1337 3 11 3
sh 0 0 1 327 4 1
s 0 0 7 4 1410 204
z 1 0 3 0 121 346
APL-2 d 1187 230 33 0 5 2
dh 104 215 0 0 0 1
t 74 9 1379 0 8 1
sh 0 0 1 327 4 0
s 0 0 3 1 1479 125
z 0 0 1 0 141 325

Phonemes that APL Most Improves on (PL vs. APL)

With taking acoustic features, the top 5 phonemes benefited are ‘aa’, ‘jh’, ‘ao’, ‘ae’ and ‘uh’, which get 21.24%, 14.29%, 12.14%, 10.30% and 10.24% relative improvements in accuracy, respectively. Most of them are vowels.

Mispronunciation Detection and Diagnosis

Mispronunciations Hardest to Detect across Different Accents

The top 3 challenge mispronunciations in APL-2 model of the six accents in the test set are shown in Table 3, in which each mispronunciation is represented in a tuple. For example, (r, r) indicates that ‘r’ is mispronounced with ‘r’, and (< eps>, ax) stands for an insertion error. Since for each accent only one speaker is included in the testset, more data is needed to show a more robust trend across the six accents.

Table 3: Top 3 challenging mispronunciations across accents

Mother Tongue Top 3 Mispronunciations
Arabic (r, r*); (ih, eh); (ow, ao)
Hindi (t, d); (<eps>, ax); (w, w*)
Korean (v, f); (jh, ch); (aw, aa)
Mandarin (th, t); (d, t); (l, w)
Spanish (ey, eh); (ah, eh); (<eps>, eh)
Vietnamese (ah, ao); (s, z); (ae, eh)