Command line utility to calculate the perplexity of a corpus given a model. Verbose mode gives the per word contribution to the perplexity. Out-of-vocabulary items can be dealt with in several ways. If an existing OOV token exists in the model, and thus has probability mass, then that symbol can be specified with the switch --OOV_symbol. Every symbol not found in the vocabulary will be mapped to that symbol. If there is no OOV symbol with allocated probability mass in the model, the option --OOV_probability allows unigram probability mass to be allocated to the class of OOVs. Note that any OOV symbol represents a class of words. To appropriately assign probability to any given instance, that class probability should be shared among the set. To do this, we must specify the OOV class size, which by default is 10000.


ngramperplexity [--options] ngram.fst [in.far [out.txt]]
  --OOV_symbol: type = string, default = ""
  --OOV_class_size: type = double, default = 10000
  --OOV_probability: type = double, default = 0


$ ngramperplexity earnest.aa.mod earnest.ab.far


If there is no OOV_symbol specified, and the OOV_probability is zero, any encountered OOVs -- which would receive 0 probability under these parameterizations -- will be ignored in perplexity calculation.

Edit | Attach | Watch | Print version | History: r5 < r4 < r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r5 - 2012-03-04 - BrianRoark
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback