OpenGrm NGram Forum

Line: 39 to 39

BrianRoark - 2014-05-15 - 11:39

Hi Aaron,

ngramcount wants an explicit symbol table, which is why the utf8 token_type for farcompilestrings isn't working. And, correct, farcompilestrings uses whitespace as the symbol delimiter, so representing spaces is a problem. Agreed, it would be nice to have that option in ngramcount, perhaps in a subsequent version... In the meantime, we generally use underscore as a proxy for space for these sorts of LMs. So, convert whitespace to underscore, then whitespace delimit and run as with a standard corpus. Then you just have the chore of converting to/from underscore when using the model. As an aside, you might find Witten-Bell to be a good smoothing method for character-based LMs, or any scenario with a relatively small vocabulary and a large number of observations. You can set the witten_bell_k switch to be above 10, and that should give you better regularization. Hope that helps.


