Out Of Vocabulary
Words that are unknown. They appear in the test Corpus but not training Corpus.
Usually words such as names and locations.
Idea: Model OOV words by
- adding a new word token, e.g.
to the Vocabulary, - in the training corpus, replacing the respective first occurrence of a previously
unknown word by
, - counting n grams as usual, treating
as a regular word. This trick can be refined if we have a word classifier, then use a new token per class, e.g. or .