You can use the formatting commands describes in TextFormattingRules in your comment.
If you want to post some code, surround it with <verbatim> and </verbatim> tags.
Auto-linking of WikiWords is now disabled in comments, so you can type VectorFst and it won't result in a broken link.
You now need to use <br> to force new lines in your comment (unless inside verbatim tags). However, a blank line will automatically create a new paragraph.
Hi Luke,
this is basically a floating point precision issue, the system is trying to subtract two approximately equal numbers (while calculating backoff weights). The new version of the library coming out in a month or so has much improved floating point precision, which will help. In the meantime, you can get this to work by modifying a constant value in src/include/ngram/ngram-model.h which will allow these two numbers to be judged to be approximately equal. Look for:
static const double kNormEps = 0.000001;
near the top of that file. Change to 0.0001, then recompile.
This sort of problem usually comes up when you train a model with a relatively small vocabulary (like a phone or POS-tag model) and a relatively large corpus. The n-gram counts end up not following Good-Turing assumptions about what the distribution should look like (hence the odd discount values). In those cases, you're probably better off with Witten-Bell smoothing with the --witten_bell_k=15 or something like that. Or even trying an unsmoothed model.
And stay tuned for the next release, which deals more gracefully with some of these small vocabulary scenarios.
Brian
I generated an ngram model from a .arpa file with the following command:
ngramread --ARPA lm.arpa > lm.model
ngramread does not complain, but ngraminfo and trying to load the model from C++ code generate the following error:
FATAL: NGramModel: bad ngram model topology
How can I troubleshoot the problem?
Hi,
that error is coming from a sanity check that verifies that every state in the language model (other than the start and unigram states) is reached by exactly one 'ascending' arc, that goes from a lower order to a higher order state. ARPA format models can diverge from this, by, for example, having 'holes' (e.g., bigrams pruned but trigrams with that bigram as a suffix retained). But ngramread should plug all of those. maybe duplication? I'll email you about this.
Benoit found a case where certain 'holes' from a pruned ARPA model were not being filled appropriately in the conversion. The sanity check routines on loading the model ensured that this anomaly was caught (causing the errors he mentioned), and we were able to find the cases where this was occurring and update the code. The updated conversion functions will be in the forthcoming version update release of the library, within the next month or two. In the meantime, if anyone encounters this problem, let me know and I can provide a workaround.
-- CyrilAllauzen - 09 Aug 2012