Abstract
Many mispronunciation detection and diagnosis (MD&D) research approaches try to exploit both the acoustic and linguistic features as input. Yet the improvement of the performance is limited, partially due to the shortage of large amount annotated training data at the phoneme level. Phonetic embeddings, extracted from ASR models trained with huge amount of word level annotations, can serve as a good representation of the content of input speech, in a noise-robust and speaker-independent manner. These embeddings, when used as implicit phonetic supplementary information, can alleviate the data shortage of explicit phoneme annotations. We propose to utilize Acoustic, Phonetic and Linguistic (APL) embedding features jointly for building a more powerful MD&D system. Experimental results obtained on the L2-ARCTIC database show the proposed approach outperforms the baseline by 9.93%, 10.13% and 6.17% on the detection accuracy, diagnosis error rate and the F-measure, respectively.
Experiments
Phoneme Recognition
Confusion Matrix
The confusion matrices of most frequently misrecognized vowels and consonants with respect to baseline2 (AL), PL-2 and APL-2 are listed in Table 1 and Table 2 respectively. Almost all vowels and consonants presented in the tables are more accurately recognized when the acoustic features are replaced by the new phonetic embeddings. The APL-2 further improves performance.
Table 1: Confusion matrix of frequently misrecognized vowels
Annotation | |||||||
---|---|---|---|---|---|---|---|
aa | ah | ae | eh | ih | iy | ||
Baseline-2 (AL) | aa | 220 | 63 | 10 | 2 | 0 | 0 |
ah | 31 | 2025 | 23 | 23 | 104 | 13 | |
ae | 10 | 64 | 617 | 63 | 16 | 1 | |
eh | 1 | 95 | 58 | 534 | 79 | 7 | |
ih | 0 | 128 | 1 | 32 | 1212 | 183 | |
iy | 0 | 23 | 1 | 8 | 184 | 1043 | |
AL-2 | aa | 239 | 51 | 14 | 2 | 0 | 0 |
ah | 13 | 2152 | 10 | 18 | 83 | 5 | |
ae | 12 | 41 | 660 | 56 | 14 | 0 | |
eh | 1 | 49 | 45 | 654 | 37 | 11 | |
ih | 1 | 74 | 3 | 22 | 1399 | 130 | |
iy | 0 | 13 | 0 | 3 | 174 | 1138 | |
APL-2 | aa | 290 | 18 | 11 | 3 | 0 | 0 |
ah | 8 | 2298 | 11 | 10 | 19 | 12 | |
ae | 5 | 39 | 728 | 16 | 4 | 0 | |
eh | 0 | 41 | 34 | 710 | 17 | 15 | |
ih | 0 | 35 | 0 | 15 | 1526 | 89 | |
iy | 0 | 17 | 0 | 2 | 147 | 1185 |
Table 2: Confusion matrix of frequently misrecognized consonants
Annotation | |||||||
---|---|---|---|---|---|---|---|
d | dh | t | sh | s | z | ||
Baseline-2 (AL) | d | 1085 | 104 | 107 | 1 | 6 | 3 |
dh | 149 | 104 | 20 | 0 | 3 | 0 | |
t | 141 | 4 | 1212 | 0 | 12 | 1 | |
sh | 0 | 0 | 0 | 324 | 3 | 0 | |
s | 2 | 0 | 6 | 20 | 1485 | 83 | |
z | 2 | 0 | 1 | 6 | 180 | 247 | |
AL-2 | d | 1159 | 185 | 74 | 0 | 6 | 4 |
dh | 106 | 202 | 9 | 0 | 1 | 0 | |
t | 95 | 13 | 1337 | 3 | 11 | 3 | |
sh | 0 | 0 | 1 | 327 | 4 | 1 | |
s | 0 | 0 | 7 | 4 | 1410 | 204 | |
z | 1 | 0 | 3 | 0 | 121 | 346 | |
APL-2 | d | 1187 | 230 | 33 | 0 | 5 | 2 |
dh | 104 | 215 | 0 | 0 | 0 | 1 | |
t | 74 | 9 | 1379 | 0 | 8 | 1 | |
sh | 0 | 0 | 1 | 327 | 4 | 0 | |
s | 0 | 0 | 3 | 1 | 1479 | 125 | |
z | 0 | 0 | 1 | 0 | 141 | 325 |
Phonemes that APL Most Improves on (PL vs. APL)
With taking acoustic features, the top 5 phonemes benefited are ‘aa’, ‘jh’, ‘ao’, ‘ae’ and ‘uh’, which get 21.24%, 14.29%, 12.14%, 10.30% and 10.24% relative improvements in accuracy, respectively. Most of them are vowels.
Mispronunciation Detection and Diagnosis
Mispronunciations Hardest to Detect across Different Accents
The top 3 challenge mispronunciations in APL-2 model of the six accents in the test set are shown in Table 3, in which each mispronunciation is represented in a tuple. For example, (r, r) indicates that ‘r’ is mispronounced with ‘r’, and (< eps>, ax) stands for an insertion error. Since for each accent only one speaker is included in the testset, more data is needed to show a more robust trend across the six accents.
Table 3: Top 3 challenging mispronunciations across accents
Mother Tongue | Top 3 Mispronunciations |
---|---|
Arabic | (r, r*); (ih, eh); (ow, ao) |
Hindi | (t, d); (<eps>, ax); (w, w*) |
Korean | (v, f); (jh, ch); (aw, aa) |
Mandarin | (th, t); (d, t); (l, w) |
Spanish | (ey, eh); (ah, eh); (<eps>, eh) |
Vietnamese | (ah, ao); (s, z); (ae, eh) |