Contextual Biasing (2022)

Objectives

We want to improve the accuracy of the Transducer model to a certain domain where we only have text data. (e.g., contact list from your phone)
We want to do so without extra training on either the Transducer model or an individual language model.

Shallow fusion
- Build a language model (n-gram if extra training is not allowed) and linearly integrate the logprobs for the next token predicted by the speech and the language model.
- An alternative is to estimate the internal langauge model and subtract it from the speech model before adding the external language model. (HAT)
  - where the internal language model is estimated with Joint(Pred) with the assumption that Joint(Tran + Pred) = Joint(Pred) + Joint(Tran) and that the prediction network holds language information.
Fine-tuning
- Fine tune the prediction network with the rest of the model frozen. [paper]
- Fine tune the joint network only.

What if we can make a detachable language model and feed the output into the Prediction network instead of the previous token?
While the ASR accuracy for the specific domain increased, there were more addition error in the result.
- Repetitive text tokens occur because the blank token was not predicted properly.
- The projection layer in the Joint network is capturing the pattern of the Transcription/Prediction vectors when the text token should be yielded.

Modular Hybrid Autoregressive Transducer [paper]
- The decoder is explicitly divided into “label decoder” and “blank decoder”.
- AM score and ILM score are joined more simply, where as the original Joint network is used for predicting the blank label.

The training does not necessarily follow the “design” of the model, and we need to conduct experiments to confirm the meanings of the representations
What are the ways to explicitly encourage the model to be trained as we designed (model structure, objective functions, etc.)