"Speech verification" redirects here and is not to be confused with speaker verification.
Automatic pronunciation assessment is the use of speech recognition to verify the correctness of pronounced speech,[1][2] as distinguished from manual assessment by an instructor or proctor.[3] Also called speech verification, pronunciation evaluation, and pronunciation scoring, the main application of this technology is computer-aided pronunciation teaching (CAPT) when combined with computer-aided instruction for computer-assisted language learning (CALL), speech remediation, or accent reduction.
In 2022, researchers found that some newer speech to text systems, based on end-to-end reinforcement learning to map audio signals directly into words, produce word and phrase confidence scores closely correlated with genuine listener intelligibility.[25] In 2023, others were able to assess intelligibility using dynamic time warping based distance from Wav2Vec2 representation of good speech.[26]
Evaluation
Although there are as yet no industry-standard benchmarks for evaluating pronunciation assessment accuracy, researchers occasionally release evaluation speech corpuses for others to use for improving assessment quality.[27][28] Such evaluation databases often emphasize formally unaccented pronunciation to the exclusion of genuine intelligibility evident from blinded listener transcriptions.[5]
Ethical issues in pronunciation assessment are present in both human and automatic methods. Authentic validity, fairness, and mitigating bias in evaluation are all crucial. Diverse speech data should be included in automatic pronunciation assessment models. Combining human judgment with automated feedback can improve accuracy and fairness.[29]
^El Kheir, Yassine; et al. (October 21, 2023), Automatic Pronunciation Assessment — A Review, Conference on Empirical Methods in Natural Language Processing, arXiv:2310.13974, S2CID264426545
^ abO’Brien, Mary Grantham; et al. (31 December 2018). "Directions for the future of technology in pronunciation research and teaching". Journal of Second Language Pronunciation. 4 (2): 182–207. doi:10.1075/jslp.17001.obr. hdl:2066/199273. ISSN2215-1931. S2CID86440885. pronunciation researchers are primarily interested in improving L2 learners' intelligibility and comprehensibility, but they have not yet collected sufficient amounts of representative and reliable data (speech recordings with corresponding annotations and judgments) indicating which errors affect these speech dimensions and which do not. These data are essential to train ASR algorithms to assess L2 learners' intelligibility.
^Bernstein, Jared; et al. (November 18, 1990), "Automatic Evaluation and Training in English Pronunciation"(PDF), First International Conference on Spoken Language Processing (ICSLP 90), Kobe, Japan: International Speech Communication Association, pp. 1185–1188, retrieved 11 February 2023, listeners differ considerably in their ability to predict unintelligible words.... Thus, it seems the quality rating is a more desirable... automatic-grading score. (Section 2.2.2.)
^Bonk, Bill (25 August 2020). "New innovations in assessment: Versant's Intelligibility Index score". Resources for English Language Learners and Teachers. Pearson English. Archived from the original on 2023-01-27. Retrieved 11 February 2023. you don't need a perfect accent, grammar, or vocabulary to be understandable. In reality, you just need to be understandable with little effort by listeners.
^Gao, Yuan; et al. (May 25, 2018), "Spoken English Intelligibility Remediation with PocketSphinx Alignment and Feature Extraction Improves Substantially over the State of the Art", 2nd IEEE Advanced Information Management, Communication, Electronic and Automation Control Conference (IMCEC 2018), pp. 924–927, arXiv:1709.01713, doi:10.1109/IMCEC.2018.8469649, ISBN978-1-5386-1803-5, S2CID31125681
^Alnafisah, Mutleb (September 2022), "Technology Review: Speechace", Proceedings of the 12th Pronunciation in Second Language Learning and Teaching Conference (Virtual PSLLT), no. 40, vol. 12, St. Catharines, Ontario, ISSN2380-9566, retrieved 14 February 2023{{citation}}: CS1 maint: location missing publisher (link)
^E.g., CMUDICT, "The CMU Pronouncing Dictionary". www.speech.cs.cmu.edu. Retrieved 15 February 2023. Compare "four" given as "F AO R" with the vowel AO as in "caught," to "row" given as "R OW" with the vowel OW as in "oat."
^Fu, Kaiqi; Peng, Linkai; Yang, Nan; Zhou, Shuran (18 July 2024). "Pronunciation Assessment with Multi-modal Large Language Models". arXiv:2407.09209 [cs.CL].