Leveraging Large Language Models for Exploiting ASR Uncertainty


With the help of creative prompt engineering and in-context learning, large language models (LLMs) are known to generalize well on a variety of text-based natural language processing (NLP) tasks. However, for performing well on spoken language understanding (SLU) tasks, LLMs either need to be equipped with in-built speech modality or they need to rely on speech-to-text conversion from an off-the-shelf automation speech recognition (ASR) system. In this work, we focus on the latter setup where the accuracy of LLM on SLU tasks is constrained by the accuracy of a frozen ASR system on the given speech input. Specifically, we tackle the task of speech intent classification where a high word-error-rate (WER) implies that the LLM may not have the correct textual information to understand the spoken intent. To alleviate this problem, we propose to prompt the LLM with an n-best list of ASR hypotheses instead of only the error-prone 1-best hypothesis. We first explore prompting the LLM with descriptive prompts which explain the concept of n-best lists to invoke LLM’s emergent abilities to understand the task; followed by finetuning of LoRA adapters on the intent classification task. We demonstrate the efficacy of our approach on a binary device-directed speech detection task as well as on a keyword spotting task on Google speech commands dataset where systems using n-best list prompts outperform the ones using 1-best ASR outputs; thus paving way for an efficient method to exploit ASR uncertainty via LLMs for speech-based applications.



Source link