Personalization of CTC-based End-to-End Speech Recognition Using Pronunciation-Driven Subword Tokenization


Recent advances in deep learning and automatic speech recognition have boosted the accuracy of end-to-end speech recognition to a new level. However, recognition of personal content such as contact names remains a challenge. In this work, we present a personalization solution for an end-to-end system based on connectionist temporal classification. Our solution uses class-based language model, in which a general language model provides modeling of the context for named entity classes, and personal named entities are compiled in a separate finite state transducer. We further introduce a phoneme-to-wordpeice model to map rare named entities to more frequent homophonic wordpieces, and also wordpiece prior normalization to bias for rare wordpieces, leading to another 48.9% relative improvement in personal named entity accuracy on top of an already personalized baseline. This work allows our systems to match highly competitive personalized hybrid systems on personal named entity recognition.



Source link