Difference: GrmNGramForum (114 vs. 115)

Revision 1152016-07-13 - BrianRoark

Line: 1 to 1

OpenGrm NGram Forum

Line: 82 to 82
  Given counts1.grm, an FST produced by ngramcount, I would have thought "ngramprint counts1.grm | ngramread - counts2.grm" would result in counts2.grm being identical to counts1.grm, but this isn't true in general. With the earnest.cnts example the new version is only slightly different in terms of file size. For part of my own corpus the difference in file size is much more substantial. Could this difference be due to symbol table changes only, or could the procedure change the FST in other ways?

BrianRoark - 2016-07-13 - 09:23

Without seeing the specifics of your model, I would speculate that when you are filtering your n-grams, there are some prefixes or suffixes of included n-grams that have been pruned. For example, if you leave in the n-gram xyz, then the FST topology needs xy in the model (to be able to reach the history state from the unigram state) and yz in the model (to back off to). These n-grams can be re-introduced into the topology if they are missing, but the count of the n-gram will be zero. Kneser-Ney modifies the counts of lower order n-grams based on the number of higher order n-grams with that as its suffix, so the missing counts are repaired in that method, avoiding the error. I would have thought that your method for filtering would result in prefixes and suffixes being retained, but their absence would explain the behavior you are seeing. It also explains the size differences. There can be small file size difference in some cases, with your round trip, because some operations on the model (pruning, for example) will remove 'useless' states (histories that only backoff), while the reading of n-grams will build those states in the course of building the model topology and will not have removed them. In your case, though, the large differences seem to indicate that many 'needed' ngrams (prefixes and suffixes) are being added. You might try the following round trip to debug: ngramprint counts1.grm >counts1.ngrams.txt; ngramread counts1.ngrams.txt | ngramprint - >counts2.ngrams.txt. Then compare the ngrams that result, which will include the costs. This should show you what ngrams that ngramread had to 'hallucinate' to achieve a canonical ngram topology. Hope that helps, and thanks for bringing up these issues!
Log In
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback