Forum Web>GrmNGramForum (revision 102)~~EditAttach~~

OpenGrm NGram Forum

You need to be a registered user to participate in the discussions.
Log In or Register

OpenGrm NGram Forum

You can start a new discussion here:

You can use the formatting commands describes in TextFormattingRules in your comment.
If you want to post some code, surround it with <verbatim> and </verbatim> tags.
Auto-linking of WikiWords is now disabled in comments, so you can type VectorFst and it won't result in a broken link.
You now need to use <br> to force new lines in your comment (unless inside verbatim tags). However, a blank line will automatically create a new paragraph.

Subject
Comment

Context across multiple sentences.

PafnoutiT - 2016-04-26 - 08:46

Hi,

In my training corpus, I have paragraphs, with many sentences following each other. I would like to be able to keep the context across these sentences. It would be equivalent to computing an ngram on a sentences formed with this pattern: <s> (word+ <s>). And here I would have only one symbol for start/end of sentence (so no </s>).

Is it possible to do this with OpenGrm?

Best regards.

PafnoutiT - 2016-04-26 - 09:27

Basically this unique start/end of sentence symbol should behave like any other word, except that it should also be a start and end state of the fst I think. I thought about having a different symbol explicitly in my text like <period>, but then <s> and </s> would not get the right ngram counts.

BrianRoark - 2016-04-26 - 11:43

Hi,

Haven't really done this before, but should be possible, though you'll need to do some FST modification after training, so it requires a bit of FST know-how. Yes, having a single long string with a sentence delimiter token is the right way to go, something like <period>. Pad the beginning and end of your corpus with k <period> tokens, where k is the ngram order of your model. Once the n-gram model is build with OpenGrm, you'll want to make the bigram <period> state the start state, either by explicitly doing this or by replacing all arcs leaving the start state (including backoff) with a single (free) epsilon arc to the bigram <period>state. Note that the final cost at that state will be quite high, since you only really end the string at the end of the corpus, which makes the model a bit weird, so maybe pad any string you are scoring with multiple <periods> to get the right exit probability. As for replacing the arc, a quick way to do this is to use fstprint to create a textual representation of the ngram model. The first printed arcs will be leaving the start state. The epsilon arc from there points to the unigram state, and the <period> arc from the unigram state will point to the bigram state that you want to be your new real start state. Then just get rid of all the start state arcs and add just one with epsilons to the desired target state, all in the textual representation. Then use fstcompile to compile the model. Quite a bit of a work around, but should give you the functionality you want.

PafnoutiT - 2016-04-27 - 05:00

Thanks. It's a bit convoluted, I hoped something simpler! So there is no way of specifying the sos/eos words when building the ngram? If I could set both of them to be <period> that could solve my problem. If I wanted to modify the source code to do that, where should I look at?

BrianRoark - 2016-04-27 - 10:01

sos/eos are not words/symbols in the FST model. sos is represented by the start state, and eos is represented by the final cost. You can get closure (representing many strings in sequence) by using fstclosure, which will simply create epsilon arcs from final states back to the start state. But that won't capture cross sentence dependencies (i.e., P(the | end <period>)) which is what I think you want. In essence, these are no longer sos/eos, because the 'string' that you are modeling is the paragraph as a whole. Actually, now that I think of it, if it is paragraph that you want to model, the sos/eos would just be start-of-paragraph and end-of-paragraph and you could simply concatenate the sentences together with a <period> delimiter and that would be fine. The probabilities at start-of-paragraph would not be the same as at the start of a sentences (which is fine, because there is no previous sentence in the paragraph to condition the probability on), and the probability of end-of-paragraph would be a separate prediction in addition to end of sentence. But that would be a proper 'paragraph' granularity model that should just work. If you want start of paragraph to be the same as the bigram start of sentence history, that's when all the FST hacking starts, but isn't obvious to me that this is absolutely needed. As far as changing the way start/stop of string works in the library, the concept of start state encoding <s> and final cost encoding </s> is pretty central to the whole thing, might as well rewrite from scratch. hope that helps.

PafnoutiT - 2016-04-27 - 11:14

Thanks for your help, I see the issue.

The use would be to score an arbitrary number of sentences in a row without the concept of paragraph. So I'd like just to model the transition between these sentences, to have the right sos/eos context not only at the beginning and end of this batch, but also across the sentences.

For example the score of "<period> Hi <period> How are you <period>" should be higher than "<period> Bye <period> How are you <period>".

I used SRILM in the past which allows not to have counts for <s> and </s>, and I wanted to switch to something a bit more modern, but maybe that OpenGrm is not adapted to my use.

make fails when compiling Ngram on OSX 10.9.5

KorNat - 2016-04-13 - 09:25

Hi, I should install OpenGRM Ngram as one of the dependencies for the gramophone tool (http://kaskade.dwds.de/~moocow/gramophone/README.txt

) which I'd like to try out. I have installed the OpenFst tool which was also on the list of dependencies. Then I tried to install Ngram 1.1.0 (this version was tested with gramphopone). I ran the configure command: <verbatim> $./configure CXX="clang++ -std=c++11 -stdlib=libstdc++" </verbatim> and it didn't yield any errors, so I continued with the make command and got the following error: <verbatim> /bin/sh ../../libtool --tag=CXX --mode=link clang++ -std=c++11 -stdlib=libstdc++ -g -O2 -L/usr/local/lib/fst -lfstfar -lfst -lm -ldl -o ngramapply ngramapply.o ../lib/libngram.la libtool: link: clang++ -std=c++11 -stdlib=libstdc++ -g -O2 -o .libs/ngramapply ngramapply.o -Wl,-bind_at_load -L/usr/local/lib/fst -lfstfar -lfst -lm -ldl ../lib/.libs/libngram.dylib ld: warning: directory not found for option '-L/usr/local/lib/fst' ld: library not found for -lfstfar clang: error: linker command failed with exit code 1 (use -v to see invocation) make[3]: * [ngramapply] Error 1 make[2]: * [all-recursive] Error 1 make[1]: * [all-recursive] Error 1 make: * [all] Error 2 </verbatim>

I have spent many hours trying to resolve this problem, but with no success. I hope someone here can help me. Thank you!

BrianRoark - 2016-04-14 - 11:32

It looks like it can't find the FAR extension to OpenFst. When you install OpenFst, you need to configure with --enable-far so that this particular extension is enabled.

KorNat - 2016-04-17 - 13:00

Thank you for your answer, Brian. I compiled the OpenFST 1.5.2 (read somewhere that python bindings are supported in this version) with this configuration: ./configure --enable-pdt --enable-bin --enable-ngram-fsts --enable-far=true --enable-python. I am trying to continue my Ngram compilation, indicating the include path with fst headers (otherwise they are not found at all): ./configure CPPFLAGS="-I../openfst-1.5.2/src/include/". Now the headers are kinda found but looks like they cannot be compiled... checking fst/fst.h usability... no checking fst/fst.h presence... yes configure: WARNING: fst/fst.h: present but cannot be compiled configure: WARNING: fst/fst.h: check for missing prerequisite headers? configure: WARNING: fst/fst.h: see the Autoconf documentation configure: WARNING: fst/fst.h: section "Present But Cannot Be Compiled" configure: WARNING: fst/fst.h: proceeding with the compiler's result configure: WARNING: ## ------------------------------------ ## configure: WARNING: ## Report this to ngram@www.opengrm.org ## configure: WARNING: ## ------------------------------------ ## checking for fst/fst.h... no configure: error: fst/fst.h header not found.

The autotools' page ( https://autotools.io/autoconf/finding.html ) says it has something to do with the incompatibility of the C dialects.. Could it be true? and if so, which C dialect should I use?

BrianRoark - 2016-04-18 - 01:53

yeah, latest OpenGrm NGram library was released a couple of months ago, built to be compatible with OpenFst 1.5.1, and we haven't got the 1.5.2 compatible version out yet (working on it). If you are really interested in the ngram library, you'll have to work with 1.5.1 for the next week or two. We are planning a fairly large release quite soon, so keep an eye peeled for that. But in the meantime, you can access OpenFst version 1.5.1 on openfst.org.

KorNat - 2016-04-18 - 09:17

I installed the 1.5.1 (at least it compiled without errors), but I am still getting the same "Present But Cannot Be Compiled" error for fst.h when running ./configure CPPFLAGS="-I../openfst-1.5.1/src/include/"

BrianRoark - 2016-04-18 - 13:34

Sorry, I missed that you were installing NGram 1.1.0. The new openfst has been ported to C++11, so, yeah, this is the issue. you'll need to either (1) include a newer version of opengrm or (2) include an older version of openfst. Looks like openfst 1.3.4 was the last version that is not C++11.

KorNat - 2016-04-19 - 11:49

Hi Brian, thank you for pointing out this version story )) I followed your first advice: I got the Ngram 1.2.2 and looks like it compiled just fine! Now I will struggle with the pyfst python binding. Thank you for being so helpful and responsive.

loading ngram format fst fail on arm

JerryLin - 2016-02-20 - 09:24

Hi all, I built G.fst in ngram format on x86 , and run on x86. it works fine I built G.fst in ngram format on arm , and run on arm, it throws bad_malloc exception. if you built G.fst in vector format on arm, and run on arm, it works fine. How to solve the problem of loading ngram format fst ? Thank you.

BrianRoark - 2016-02-20 - 14:18

I haven't played with that particular processor combination, but it is often the case that binary format FSTs built on one system architecture won't be readable in another. The easy workaround is to use fstprint (from the openfst library) to print the FST to a text representation on x86, then use fstcompile (also from the openfst library) on arm to get the binary format on that side. Keep in mind that you'll need the symbol table, too, so fstprint has flags to save_isymbols to a file; and fstcompile allows you to specify the isymbols and osymbols (both are the same in the case of the ngram models, what you saved from fstprint) and also that you want to keep_isymbols and keep_osymbols with the model. That would result in something that is in standard opengrm FST format.

Error Compiling OpenGRM

YossarianLives - 2015-08-31 - 05:15

Hi, I want to run my code (below) : <verbatim>

using namespace ngram; using namespace fst; using namespace std;

int main() { cout<<"\n Hello Wordl" <<endl;

// ngramcount -order=5 in.fst > out.fst NGramCounter<Log64Weight> ngram_counter(3); StdMutableFst *fst = StdMutableFst::Read("in.fst", true); ngram_counter.Count(*fst); VectorFst<StdArc> ofst; ngram_counter.GetFst(&ofst); ofst.Write("out.fst");

return(0); } </verbatim>

but it gives me the error:

<verbatim>In function `fst::CompatProperties(unsigned long, unsigned long)': ... undefined reference to `fst::PropertyNames' undefined reference to 'fst::PropertyNames'</verbatim>.

I think that it is a link error, but i can't get any solution. Thanks.

BrianRoark - 2015-08-31 - 12:00

Not sure how much help we can be in these cases, this depends on the OS, the way in which you installed openfst, and so many other factors local to your setup. It certainly isn't finding something and you are likely correct that this is a linking error. I would suggest double checking that your library linking paths are correct. perhaps reinstalling openfst and opengrm with as few divergences from standard installation as possible. Not sure that helps. Good luck.

Class based ngrams

BillyK - 2015-03-15 - 14:10

Hi is it possible to create class-based ngrams with opengrm? I have been creating them in SRILM and then converting them, but I was wondering if there is a method for doing this in opengrm. smile

BrianRoark - 2015-03-17 - 12:07

yes, you can build class-based models with opengrm, though you have to build a second transducer that maps from words to classes. (See section 5 of this paper: http://www.aclweb.org/anthology/P/P03/P03-1006.pdf

.) For example, if you are using POS-tags, then you would have a single state transducer T mapping from words to POS-tags, each arc having the word on the input side and the POS-tag on the output side, weighted with the cost -log P(word | tag). Then train a standard ngram model on POS strings, which gives you a WFST G encoding the sequential dependencies of the classes. Finally, compose T and G, and this is your class-based language model. The paper above contains information on more general cases, such as classes including multiple words. While there is no single command to run and produce a class-based language model, the WFST encoding of the models makes this quite straightforward by using composition. Hope that helps!

brian

outputs of fstprint and of ngramprint | ngramread | fstprint different

GezaKiss - 2015-02-26 - 10:46

Why is it that the above two command sequences result in very different outputs for automata created using ngramcount | ngrammake? When using an n-gram order of 1, the first one has just one state (and numerous arcs), the second one has numerous states (half as many as arcs). Is there a way to print and read back the n-gram model without changing it? Thanks! Géza

BrianRoark - 2015-02-26 - 19:56

Hi Géza,

Short answer: use the --ARPA flag to produce ARPA format models, which do a better job of that sort of round-trip conversion. The other format is more for inspection -- for one thing, the backoff information is elided unless explicitly asked for. But to answer your question about what's going on here, for the single state unigram model, it seems that ngramread is constructing the ngram automaton so that each unigram points to a bigram state, but since there are no bigrams, that bigram state backs off (with probability 1, cost 0) to the unigram state. So you get the unigram arc, a bigram state and the backoff arc (hence 2x the number of arcs as states). This is technically a correct topology for the model, but not the most compact that it could be. If you are only using a unigram, you could perform epsilon removal (fstrmepsilon) after ngramread and you would get what you want. For higher order ngram models, that's not going to work. There I would suggest the --ARPA flag to use that format.

eliminating n-grams with identical outputs?

GezaKiss - 2015-02-15 - 12:55

Hi there! Is there a command to remove arcs that have the same input symbol as another arc in the WFST model, but a different output symbol and a higher cost? Thanks!

Background: I created a transducer from n-grams. But the resulting model contains multiple different outputs for some input symbol sequences, and so it cannot be determinized. Just removing input string with multiple output sequences from the training data doesn't solve this unfortunately.

BrianRoark - 2015-02-17 - 06:26

Hi Geza,

no, there is no command to do this, but it is related to some work we did using the lexicographic semiring to combine output labels with standard costs in a new, composite cost, which then allows determinization to remove higher cost arcs with the same input. (See http://www.mitpressjournals.org/doi/abs/10.1162/COLI_a_00198) Within a language model, however, such determinization may be costly, depending on the amount of ambiguity, given that it's a cyclic automaton -- in fact, under certain circumstances it may not be determinizable. If you perform the determinization after composing the model with inputs -- on a lattice of possible outputs -- then things should work out.

not sure if this helps.

brian

GezaKiss - 2015-02-26 - 03:19

Hi Brian, Thanks for your answer! I realize that my question was naive anyway, as of course you cannot do this on the level on n-grams, and you do need that ambiguity to find the best path. Thanks for this paper! The approach you mention is available through in C++ only, right? Is perhaps a description available somewhere that focuses on how to use it (technical details)? Thanks! Géza

Error compiling with nonstandard OpenFST location

CooperE - 2014-07-29 - 16:52

Hello,

I am trying to install OpenGrm on Ubuntu while linking it to a nonstandard OpenFST installation location. I ran configure with the following options:

./configure LDFLAGS=-L/proj/speech/tools/phonetisaurus/3rdparty/openfst-1.3.2/installation/include CPPFLAGS=-I/proj/speech/tools/phonetisaurus/3rdparty/openfst-1.3.2/installation/include --prefix=/proj/speech/tools/phonetisaurus/3rdparty/opengrm-ngram-1.0.3/installation/

Then I noticed that the compilation was still looking for OpenFST stuff in the standard install location, so I saw that I had to change AM_LDFLAGS in src/bin/Makefile to the following:

AM_LDFLAGS = -L/proj/speech/tools/phonetisaurus/3rdparty/openfst-1.3.2/installation/lib/fst -lfstfar -lfst -lm -ldl

When I run make, I still get what appears to be a linking error. The output is very large so you can see it here:

http://pastebin.com/1ms42A2b

Am I missing any flags to link to OpenFST? Any pointers would be greatly appreciated.

BrianRoark - 2014-07-30 - 10:40

One thing you might try is changing:

libngram_la_LDFLAGS = -version-info 1:0:0

in src/lib/Makefile.am to

libngram_la_LDFLAGS = -version-info 0:1:0 -lfst

prior to configure to make FST linking explicit. (h/t to Giulio Paci for this.)

word prediction using openfst/opengrm-ngram

GeorgSchlunz - 2014-07-17 - 12:25

I would like to write an Android app that does word prediction and completion as you type, using an n-gram language model.

I find a lot of references on google on how to build language models and evaluate their perplexity (srilm, mitlm, etc. and here, openfst and opengrm-ngram), but not a lot on how to apply them to a word prediction problem in practice. I am completely new to wfsts. Is there perhaps a standard recipe for word prediction using openfst and opengrm-ngram?

From what I can gather, it seems that you must build the language model fst, then compose it with an fst that is built from the seen word sequence up to the point in question, and finally do a shortest path search. Perhaps something along the lines of the following? :

<verbatim> fstcompose words.fsa lm.fst | fstshortestpath > words.fst </verbatim>

where lm.fst is hopefully a language model fst that can be created by opengrm-ngram, as described in http://www.openfst.org/twiki/bin/view/GRM/NGramQuickTour?

How would words.fsa be created?

How would the resulting words.fst be processed to obtain the n-best predicted words? Perhaps using something like? :

<verbatim> fstproject --project_output </verbatim>

Regarding Android, I see that they have packaged openfst for the platform. Is a language model fst created by opengrm-ngram completely "backwards-compatible" with openfst, so that I can load the model fst and do composition and shortest path search with only the openfst API in Android?

Thanks in advance!

BrianRoark - 2014-07-21 - 18:43

Hi,

there are some wrinkles to using a language model (LM) fst in the way you describe. But as to your last question, the language models that are produced are simply FSTs in the format of openfst, so you should be able to use them via the openfst API in Android. The problem with the fstcompose approach that you detail is that the opengrm LM fst provides probabilities over whole sequences, not just the next word. One approach to building words.fsa would be to make this represent abc(sigma) where abc is your history and sigma ranges over your possible next words. But the probabilities that you would derive for each x would be for string the abcx, where x is the last word in the string. This doesn't include probability for abcxy. instead, to get the right probability over all possible continuations, you want words.fsa to represent abc(sigma)*. That would give you the right probability mass, but unfortunately would be expensive to compute, especially after each word is entered. The better option is for you to use the C++ library functionality to walk the model with your history, find the correct state in the model (the model is deterministic for each input, when treating backoff arcs as failures), then collect the probabilities for all words from that state, following backoff arcs correctly to achieve appropriate smoothing. This requires being very aware of exactly how to walk the model and collect probabilities for all possible following words, then finding the most likely ones, etc., efficiently. It is possible, but non-trivial. As with other openfst functionality, it is there for you to create your own functions, but not much in the way of hand holding to get you there. Very interesting application, though; and by the end you'd understand the FST n-gram model topology well and be able to build other interesting applications with it. Hope that answers at least some of your question.

Brian

Creating a character n-gram model

AaronDunlop - 2014-05-14 - 17:49

Is there a path similar to that in the quick tour (http://openfst.cs.nyu.edu/twiki/bin/view/GRM/NGramQuickTour#OpenGrm_NGram_Library_Quick_Tour

) for character n-gram models?

I have a trivial corpus that works fine with the instructions described there for a word n-gram model, but I can't figure out how to use ngramcount to build a character model.

I'm trying (with a 4-line corpus file titled 'animals.txt'):

farcompilestrings -token_type=utf8 -keep_symbols=1 animals.txt >animals.char.far ngramcount -order=5 animals.char.far >animals.cnt

The 'farcompilestrings' command appears to work, but ngramcount fails with the error "ERROR: None of the input FSTs had a symbol table". I've tried with '-keep_symbols=0', and and without any '-keep_symbols' argument, and the results seem to be the same

I tried creating a symbol file with the UTF-8 characters in my 'corpus', but farcompilestrings doesn't like the space symbol in that file (it reports 'ERROR: SymbolTable::ReadText: Bad number of columns (1), file = animals.char.syms, line = 5:<>').

It seems like there must be an option to tell ngramcount to use bytes or UTF-8 characters as symbols (analogous to farcompilestrings '-token_type=utf8'), but I haven't found it.

Thanks in advance for any suggestions.

BrianRoark - 2014-05-15 - 11:39

Hi Aaron,

ngramcount wants an explicit symbol table, which is why the utf8 token_type for farcompilestrings isn't working. And, correct, farcompilestrings uses whitespace as the symbol delimiter, so representing spaces is a problem. Agreed, it would be nice to have that option in ngramcount, perhaps in a subsequent version... In the meantime, we generally use underscore as a proxy for space for these sorts of LMs. So, convert whitespace to underscore, then whitespace delimit and run as with a standard corpus. Then you just have the chore of converting to/from underscore when using the model. As an aside, you might find Witten-Bell to be a good smoothing method for character-based LMs, or any scenario with a relatively small vocabulary and a large number of observations. You can set the witten_bell_k switch to be above 10, and that should give you better regularization. Hope that helps.

brian

CyrilAllauzen - 2014-05-16 - 16:09

Hi Aaron,

You were actually on the right track.

When using farcompilestrings, the symbol table options are ignored. So just do: farcompilestrings -token_type=utf8 -keep_symbols=1 animals.txt.
When using ngramcount, use --require_symbols=false.
When generating your utf8 symbol table, make sure to use tabs as separator between the utf-8 character and integer label.
Use fstsymbols --isymbols=your_utf8_symbols --osymbols=your_ut8_symbols -- --fst_field_separator=`echo -e "\t"` animals.cnt animals_with_symbols.cnt.

Then the rest of the ngram pipeline should work using animals_with_symbols.cnt.

Cyril

CyrilAllauzen - 2014-05-16 - 16:14

Hi Aaron,

For step 3, remember to include 0 as integer label for epsilon (e.g. <epsilon>) and choose something else than an actual tab for the symbol corresponding to the utf8 integer value for tab.

Cyril

BrianRoark - 2014-05-17 - 18:24

D'oh! Learn something new every day... cool, thanks Cyril.

brian

AaronDunlop - 2014-05-20 - 14:04

Nice! Thanks, Cyril (and Brian - I'd gotten your method to work as well, but the utf8 symbol table should be cleaner).

Error in installing opengrm

AbhirajTomar - 2014-04-04 - 05:06

I get the following error when i execute the make command for opengrm installation

/usr/local/include/fst/test-properties.h:236: error: undefined reference to 'FLAGS_fst_verify_properties' /usr/local/include/fst/test-properties.h:236: error: undefined reference to 'FLAGS_fst_verify_properties' /usr/local/include/fst/test-properties.h:236: error: undefined reference to 'FLAGS_fst_verify_properties' collect2: ld returned 1 exit status make[3]: * [ngramapply] Error 1 make[3]: Leaving directory `/home/abhiraj/NLP/opengrm-ngram-1.0.3/src/bin' make[2]: * [all-recursive] Error 1 make[2]: Leaving directory `/home/abhiraj/NLP/opengrm-ngram-1.0.3/src' make[1]: * [all-recursive] Error 1 make[1]: Leaving directory `/home/abhiraj/NLP/opengrm-ngram-1.0.3' make: * [all] Error 2

i have unstalled openfst multiple times, still cant figure out what needs to be done. I'd really appreciate some help

CyrilAllauzen - 2014-04-07 - 14:16

It looks like ld is not finding the OpenFst library shared objects. It would be great is you could show us what was the ld command line make was trying to run at that time. Also could you tell us which configure=/=make commands/parameters you used to install OpenFst and try to compile OpenGrm. Finally, which OS are using?

Negative weights associated with epsilon transitions

VinodhR - 2014-04-02 - 16:36

I understand that the transition weights in the ngram model are negative log probabilities. (correct me if I am wrong)

But I can see that the epsilon transitions also take negative weights. What do they actually represent ?

BrianRoark - 2014-04-03 - 12:53

epsilon transitions are for backoff, and the weights on those arcs are negative log alpha, where alpha is the backoff constant that ensures normalization over all words leaving that state. Ideally, those backoff transitions are interpreted as failure transitions, i.e., only traversed if the symbol does not label a transition leaving the state. See the ngramapply command for an example of the use of a phi matcher to handle failure transitions.

Brian

Constraining the minimum length of sentences in random generating

VivekKulkarni - 2014-03-18 - 21:31

While sampling sentences using ngramrandgen do we have an option to constrain the minimum length of sentence generated just as we have for maximum length

BrianRoark - 2014-03-19 - 09:10

Sorry, no, we don't have that option. The solution is to oversample, then throw away strings with length less than your minimum using grep or something like that.

Models without <s> or </s>

PaidiCreed - 14 Jan 2014 - 12:34

Is it possible to build models from raw ngram counts which do not contain the symbols for start and end of sentence?

BrianRoark - 16 Jan 2014 - 14:26

Hi,

To build an automaton that has valid paths, you need to specify some kind of termination cost, which is what the final symbol allows for. So, you can't really do without having an entry for end-of-string in your raw n-gram counts; otherwise, it will represent this as an automaton that accepts no strings (empty) since no string can ever terminate. If you have a large set of counts without start or stop symbols, you should just add a unigram of the end-of-string symbol with some count (maybe just 1). The start symbol you can do without (though it is often an important context to include). Hope that helps.

brian

Perplexity without <s> or </s>

ErKn - 05 Sep 2013 - 03:35

I have been using ngramapply and ngramperplexity tools. I am wondering if there is an option to specify not to use the ending state in the calcul of logprob.

ErKn - 05 Sep 2013 - 03:36

ending state = </s>

BrianRoark - 05 Sep 2013 - 09:44

Hi,

if you run it in verbose mode (ngramperplexity --v=1 as shown in the quick tour) then you can see the -log probability of each item in the string. You can then read this text into your own code to calculate perplexity in whatever way you wish. However, typically the end of string marker is included in standard perplexity calculations, and should be included if you are comparing with other results. No options to exclude it in the library. Hope that helps.

brian

ErKn - 05 Sep 2013 - 10:10

Hi,

Thanks you for the hint. Could you tell me why I get different perplexity values when running ngramapply and ngramperplexity tools (in verbose mode) with same data/parameters.

JosefNovak - 06 Sep 2013 - 06:28

ngramapply only applies the LM weights to your input token sequence, it does not compute perplexity.

The total weight output by ngramapply for each string should be equal to the logprob(base 10) generated in the verbose output for ngramperplexity (but note that the ngram apply value is negative log using base e).

If you want to convince yourself you can try this:

#Apply the model to your test data
$ ngramapply toy.mod test.far test.scored.far
#Extract the individual results
$ farextract --filename_prefix=tst test.scored.far
#Print out one of the sentences, which now have individual arc weights
$ fstprint tsttest.sent-1
0   1   a   a   0.750776291
1   2   b   b   1.20206916
2   3   a   a   1.43548453
3   4   a   a   0.864784181
4   5   a   a   0.864784181
5   1.27910674
#Add all the values up and compute the perplexity in the normal manner
$ python
>>> import math
>>> -1.*(0.750776291+1.20206916+1.43548453+0.864784181+0.864784181+1.27910674)/math.log(10)
-2.778184008253953
>>> math.pow(10, ((0.750776291+1.20206916+1.43548453+0.864784181+0.864784181+1.27910674)/math.log(10))/(5+1))
2.9042277315216354

Those values should match whatever you get out of ngramperplexity (again careful about the log base) in verbose mode for each individual sentence. In my toy example case above:

$ ngramperplexity --v=1 toy.mod test.far
WARNING: OOV probability = 0; OOVs will be ignored in perplexity calculation
a b a a a
                                                ngram  -logprob
        N-gram probability                      found  (base10)
        p( a | <s> )                         = [2gram]  0.326058
        p( b | a ...)                        = [2gram]  0.522052
        p( a | b ...)                        = [1gram]  0.623423
        p( a | a ...)                        = [2gram]  0.375571
        p( a | a ...)                        = [2gram]  0.375571
        p( </s> | a ...)                     = [2gram]  0.555509
1 sentences, 5 words, 0 OOVs
logprob(base 10)= -2.77818;  perplexity = 2.90423

Experimenting with =ngramnormalize

JosefNovak - 02 Sep 2013 - 08:38

I have been experimenting a bit with the new ngrammarginalize tool - very cool. I noticed that in some cases it can get stuck in the do{}while loop in the StationaryStateProbs function.

I trained a model using a small amount of data and witten_bell smoothing + pruning. It may be worth noting that the data is very noisy.
I have been digging around in the paper and source code a bit to figure out what might be the issue. The issue seems to be here:

if (fabs((*probs)[st] - last_probs[st]) > converge_eps * last_probs[st] )

where last_probs[st] can occasionally be a very small negative value. If I am following the paper correctly, this corresponds to the epsilon value mentioned at the end of Section 4.2. (I'm not entirely convinced I followed correctly to this point though so please correct me if I'm wrong).
Anyway, if last_probs[st] turns out to be negative, for whatever reason, then there is a tendency to get stuck in this block. This also seems to be affected by the fact that the comparison delta is computed relative to last_probs[st] , so if last_probs[st] happens to get smaller with each iteration, then the comparison also becomes more 'sensitive', in the sense that a smaller absolute difference will still evaluate to 'true'. So as we iterate, it seems that some values become more sensitive to noise (maybe this is the numerical error you mention?) and the process gets stuck.

I noticed that if I set the comparison to an absolute:

if (fabs( fabs((*probs)[st]) - fabs(last_probs[st])) > converge_eps )

then it always terminates, and does so quickly for each iteration, even for very small values of converge_eps.

I have not convinced myself that this is a theoretically acceptable solution, nor have I validated the resulting models, but it does some reasonable at a superficial level.

I would definitely appreciate some feedback if available.

JosefNovak - 02 Sep 2013 - 11:38

EDIT: this second issue may be ignored. It was my fault.

Also, I notice that I am still getting 'unplugged' 'holes' some of my pruned models when I run ngram-read after pruning with SRILM.

Specifically, I prune with srilm, then run: ngramread it terminates without issue. If I run the info command I get:

$ ngraminfo l2p5k.train.p5.mod
FATAL: NGramModel: bad ngram model topology

and if I run the marginalize command (without any of the modifications I mentioned above) I get:

$ ngrammarginalize --v=3 --iterations=2  --norm_eps=0.001 l2p5k.train.p5.mod  l2p5k.train.p5.marg.mod
INFO: Iteration #1
INFO: FstImpl::ReadHeader: source: l2p5k.train.p5.mod, fst_type: vector, arc_type: standard, version: 2, flags: 3
INFO: CheckTopology: bad backoff arc from: 557 with order: 4 to state: 0 with order: 1
FATAL: NGramModel: bad ngram model topology

Maybe there is a flag I am missing? Or I should just pre-process and plug them manually prior to conversion?

BrianRoark - 02 Sep 2013 - 22:21

Hi Josef,

For the second issue, I would be interested in seeing the original ARPA format file and follow this through to see what the potential issue might be. So please email or point me to that if you can. For the first issue, I have seen some problems with the convergence of the steady state probs for Witten-Bell models in particular, which is due, I believe, to under-regularization (as we say in the paper). Your tracing this to the value of last_probs[st] is something I hadn't observed, so that could be very useful. In answer to your question, I wonder if there's not a way to maintain the original convergence criterion, but control for the probability falling below zero (presumably due to floating point issues). One way to do this would be to set a very small minimum probability for the state and set a state's probability to that minimum if the calculated value falls below it. Since last_probs[st] gets its value from (*probs)[st], then it would never fall below 0 either. I'll look into this, too, thanks.

brian

JosefNovak - 04 Sep 2013 - 03:27

Hi, The 2nd issue was my own embarrassing mistake. I was setting up a test distribution to share, and realized that I still had an old, duplicate version of ngramread on my $PATH. This was the source of the conversion issue. When I switched it things worked as expected for all models.

I will share an example of the other via email.

How to handle unknown words?

GeorgePetasis - 29 Aug 2013 - 07:18

Hi, I am new to this librbary. I have created a small language model using a corpus. Is there a solution in applying the model in user input, which may contain an unknown word?

How can I handle unknown words?

BrianRoark - 29 Aug 2013 - 10:02

Hi,

when you apply the model to new text, you are typically using farcompilestrings to compile each string into FSTs. It has a switch (--unknown_symbol) to map out of vocabulary (OOV) items to a particular symbol. The language model then needs to include that symbol, typically as some small probability at the unigram state. There are a couple of ways to do this -- see the discussion topic below entitled: Model Format: OOV arcs.

brian

ngramrandgen with a prefix

GiulianoLancioni - 17 Aug 2013 - 11:45

Is it possible to constraint the output of ngramrandgen by requiring that it continues a (well-formed) prefix, as it is possible in SRILM? Thank you?

BrianRoark - 17 Aug 2013 - 14:40

Hello,

that is certainly possible, and we'll add it to the list of features for a future release. Likely not in the soon-to-be-released latest version, but perhaps in the next update after that.

brian

GiulianoLancioni - 17 Aug 2013 - 15:36

Brian,

thank you for your answer. I am no expert of FSTs, but I guess It should be possible to obtain the same result by somewhat filtering the complete FST, through composition (?) with another FST. After all, it should look like a FST where some paths have already been walked through. Please excuse my vagueness.

GiulianoLancioni - 17 Aug 2013 - 15:49

For instance, fstintersect of the ngram FST with an acceptor for the prefix could do the trick, couldn't it?

BrianRoark - 18 Aug 2013 - 18:41

yes, this is the way to think about it. The complication comes from the random sampling algorithm, and the way that it interprets the backoff arcs in the model. The algorithm assumes a particular model topology during the sampling procedure, which will not be preserved in the composed machine that you propose. But, yes, essentially that is what would be done to modify the algorithm.

brian

BrianRoark - 18 Aug 2013 - 18:42

and, of course, the benefit of open-source is that you could play around with the algorithm yourself, if you like.

GiulianoLancioni - 20 Aug 2013 - 06:38

Thanks!

skip-grams

GezaKiss - 24 Jul 2013 - 14:12

Do you support or plan to support skip-grams? Thanks!

BrianRoark - 25 Jul 2013 - 10:23

Hi Geza,

Skip k-grams generally require a model structure that is trickier to represent compactly in a WFST than standard n-gram models. This is because there are generally more than one state to backoff to. For example, in a trigram state, if the specific trigram 'abc' doesn't exist, a standard backoff n-gram model will backoff to the bigram state and look for 'bc'. With a skip-gram, there is also the skip-1 bigram 'a_c' to look for, i.e., there are two backoff directions. So, to answer your question, it is something we've thought about and are thinking about, but nothing imminent.

brian

calculating perplexity for >10 utterances using example command

GezaKiss - 19 Jul 2013 - 14:09

I am a newbie who wants to calculate perplexity for a text file consisting of more than 10 lines. The examples provided for that only works for < 10, and it is not obvious to me how to bypass that. (Yes, I know I can use a loop over the utterances, I just hope I do not have to.)

Using stdin with farcompilestrings results in an error message: FATAL: STListWriter::Add: key disorder: 10 FATAL: STListReader: error reading file: stdin

And specifying a text file as the first parameter for farcompilestrings does not work either: FATAL: FarCompileStrings: compiling string number 1 in file test.txt failed with token_type = symbol and entry_type = line FATAL: STListReader::STListReader: wrong file type:

I could not find related usage info. I'd appreciate some help. Thanks!

GezaKiss - 19 Jul 2013 - 14:40

The example command would be this:

echo -e "A HAND BAG\nA HAND BAG\nA HAND BAG\nA HAND BAG\nA HAND BAG\nA HAND BAG\nA HAND BAG\nA HAND BAG\nA HAND BAG\nBAG HAND A" | farcompilestrings -generate_keys=1 -symbols=earnest.syms --keep_symbols=1 | ngramperplexity --v=1 earnest.mod -

BrianRoark - 20 Jul 2013 - 14:43

Hi Geza,

the -generate_keys=N switch for farcompilestrings creates numeric keys for each of the FSTs in the FAR file using N digits per FST. (One FST for each string in your case.) So, with N=1 you can index 0-9 strings, with N=2 you can index 0-99 strings, etc. So, for your example, you just need to up your -generate_keys argument with the number of digits in the total corpus count.

brian

GezaKiss - 24 Jul 2013 - 13:56

Thanks, Brian!

One more question, just so that I can see more clearly how ngramperplexity works. When I specify the "--OOV_probability=0.00001 --v=1" arguments, I get "[OOV] 9" for "ngram -logprob" in the output when using unigrams, and somewhat higher values for longer n-grams. What is the reason for his? (I guess I do not yet see how exactly perplexity calculation works.)

Thanks, Géza

GezaKiss - 24 Jul 2013 - 14:22

Sorry, I was too quick to write. (The solution is of course --OOV_class_size=10000, hence the 4 difference from the expected.)

GiulianoLancioni - 17 Aug 2013 - 04:48

As to the FarCompileStrings fatal error signaled by Géza, it depends from file encoding: text files must be encoded with Unix, noth Windows EOL.

Model Format: OOV arcs

SamS - 15 Jul 2013 - 00:31

When I load a pre-trained language model into c++ and iterate over arcs, I see that there are no arcs labeled with the given OOV symbol, although it is contained in the symbol table.

Are OOV words handled somewhere other than the FST itself, or is the absence of these arcs likely due to a quirk in my particular language model?

Any insight or ideas would be greatly appreciated. Thanks!

BrianRoark - 15 Jul 2013 - 22:37

Hi,

The model will only have probability mass for the OOV symbol if there are counts for it in the training corpus. ngramperplexity does have a utility for including an OOV probability, but this is done on the fly, not in the model structure itself. If you want to provide probability mass at the unigram state for the OOV symbol, you could create a corpus consisting of just that symbol, train a unigram model, then use ngrammerge to mix (either counts or model) with your main model. Then there would be explicit probability mass allocated to that symbol. You can use merge parameters to dictate how much probability that symbol should have. Hope that helps.

brian

SamS - 16 Jul 2013 - 13:09

This absolutely helps. Thanks for the answer!

AaronDunlop - 2014-06-19 - 18:39

A related question: would it be sensible to replace some (or all) singletons in the training corpus with the OOV symbol, thus learning OOV probabilities in context instead of solely at the unigram level? Is that approach likely to improve LM performance on unseen data?

OpenGrm in c++ application

EricYoo - 03 Jul 2013 - 08:42

Hi, I built 3-gram language model on few English words. My c++ program receive streaming character one by one. I would like to use the 3-gram model to score the up-coming character with history context. What example can I start with ? Is it possible not to convert to farstrings each time ?

BrianRoark - 15 Jul 2013 - 22:32

Hi,

This is one of the benefits of having the open-source library interface in C++, you can write functions of your own. We choose to score strings (when calculating perplexity for example) when encoding the strings as fars, but you could perform a similar function in your own C++ code. I would look at the code for ngramperplexity as a starting point, and learn how to use the arc iterators. Once you understand the structure of the model, you should be able to make that work. Alternatively, print out the model using fstprint and read it into your own data structures. Good luck!

brian

How to understand the order one count result?

HeTianxing - 08 Jun 2013 - 10:06

I tried order-1 ngram count with this simple text
Goose is hehe
Goose is hehe
Goose is
Goose
But I don't understand the resulting count fst.
0 -1.3863
0 0 Goose Goose -1.3863
0 0 is is -1.0986
0 0 hehe hehe -0.69315
The document says Transitions and final costs are weighted with the negative log count of the associated n-gram, but I can't make sense with these numbers, can someone help me out? Thx!!

BrianRoark - 09 Jun 2013 - 10:12

Hi,

The counts are stored as negative natural log (base e), so -0.69315 is -log(2), -1.0986 is -log(3) and -1.3863 is -log(4). The count of each word is kept on arcs in a single state machine (since this is order 1) and the final cost is for the end of string (which occurred four times in your example). You printed this using fstprint, but you can also try ngramprint which in this case yields:

<s> 4
Goose 4
is 3
hehe 2
</s> 4

where <s> and </s> are the implicit begin of string and end of string events. These are implicit because we don't actually use the symbols to encode them in the fst.

Hope that clears it up for you. If not, try the link in the 'Model Format' section of the quick tour, to the page on 'precise details'.

best,

brian

HuanliangWang - 15 May 2013 - 03:02

Hi,

I try to merge two LM by following command line:

./tools/opengrm-ngram-1.0.3/bin/ngrammerge --alpha=0.2 --beta=0.8 --normalize --use_smoothing A.fst B.fst AB.mrg.fst

But I get a error:

FATAL: NGramModel: input not deterministic

A.fst is normal fst LM trained by SRILM. B.fst is class-expanded LM by fstreplace command.

I also try to convert fst LM into arpa LM by command line:

./tools/opengrm-ngram-1.0.3/bin/ngramprint --ARPA B.fst > B.arpa

But I got a same error:

FATAL: NGramModel: input not deterministic

How to fix the error?

Thanks,

Huanliang wang

BrianRoark - 16 May 2013 - 22:17

Hi,

you've introduced non-determinism into the ngram models via your replace class modification. The ngrammerge (and ngramprint) commands are simple operations that expect a standard n-gram topology, hence the error messages. For more complex model topologies of the sort you have, you'll have to write your own model merge function that does the right thing when presented with non-determinism. The base library functions don't handle these complex examples, but the code should give you some indication of how to approach such a model mixture. Such is the benefit of open source!

brian

error when converting LM genereted by HTK into fst format

HuanliangWang - 07 May 2013 - 03:08

Hi,

I try to convert arpa LM genereted by HTK tool into fst format. The command is :

./tools/opengrm-ngram-1.0.3/bin/ngramread --ARPA test.arpa > test.lm.fst

But I get a error:

FATAL: NGramInput: Have a backoff cost with no state ID!

How to fix the error?

Thanks,

Huanliang wang

BrianRoark - 07 May 2013 - 22:18

Hi,

it appears that you have n-grams ending in your stop symbol (probably </s>) that have backoff weights, i.e., the ARPA format has an n-gram that looks like:

-1.583633 XYZ </s> -0.30103

But </s> means end-of-string, which we encode as final cost, not an arc leading to a new state. Hence there is no state where that backoff cost would be used. (Think of it this way: what's the next word you predict after </s>? In the standard semantics of </s>, it is the last term predicted, so nothing comes afterwards.) Do you also have n-grams that start with </s>?

So, one fix on your ARPA format is just to remove the backoff weight after n-grams that end in </s>.

hope that helps,

brian

HuanliangWang - 07 May 2013 - 23:08

Hi, Brian Thank you! I get it.

Another case is there are n-grams that start with </s> in my HTK LM. I think it is a bug of HTK tool, but It is a the only choice to train class-based LM. How do I fix it? Is it reasonable to remove directly these n-grams?

Thanks,

Huanliang Wang

HuanliangWang - 07 May 2013 - 23:10

Hi, Brian

Thank you! I get it.

Another case is there are some n-grams that start with </s> in my HTK LM. I think it is a bug of HTK tool, but it is my only choice to train class-based LM with automatic class clustering from large plain data . How do I fix it? Is it reasonable to remove directly these n-grams?

Thanks,

Huanliang Wang

BrianRoark - 08 May 2013 - 09:02

yes, you might try just removing those n-grams. In the ARPA format, you'll have to adjust the n-gram counts at the top of the file to match the number you have at each order.

HuanliangWang - 09 May 2013 - 05:06

Hi, Brian

Thank you! I got it.

Could you give me an example how to use fstreplace to replace e nonterminal label in a fst by another fst?

Thanks,

Huanliang Wang

BrianRoark - 09 May 2013 - 10:08

You might try the forum over at openfst.org, that's an fst command.

brian

error when converting LM genereted by HTK into fst format

HuanliangWang - 07 May 2013 - 03:06

Hi, I try to convert arpa LM genereted by HTK tool into fst format. The command is : ./tools/opengrm-ngram-1.0.3/bin/ngramread --ARPA test.arpa > test.lm.fst But I get a error:

FATAL: NGramInput: Have a backoff cost with no state ID!

How to fix the error?

Thanks, Huanliang wang

Infinity values / ill formed FST

RolandSchwarz - 03 Apr 2013 - 05:50

Hi,

I'm currently playing around with a test example and I noticed than after ngrammake if I call fstinfo (not ngraminfo) on the resulting language model fstinfo complains about the model being ill-formed. This is due to transitions (typically on epsilons) that have "Infinity" weight, which does not seem to be supported by openFST. Is that "working as intended"? The problem is later if I call fstshortestpath to get e.g. the n most likely sentences from the model the result contain not only "Infinity" weights but also "BadNumber" which might be a result of the infinite values.

Thanks,

Roland

BrianRoark - 03 Apr 2013 - 10:56

Hi Roland,

yes, under certain circumstances, some states in the model end up with infinite backoff cost, i.e., zero probability of backoff. In many cases this is, in fact, the correct weight to assign to backoff. For example, with a very small vocabulary and many observations, you might have a bigram state that has observations for every symbol in the vocabulary, hence no probability mass should be given to backoff. Still, this does cause some problems with OpenFst. In the next version (due to be released in the next month or so) we will by default have a minimum backoff probability of some very small epsilon (i.e., very large negative log probability). As a workaround in the meantime, I would suggest using fstprint to print the model to text, then use sed or perl or whatever to replace Infinity with some very large cost -- I think SRILM uses 99 in such cases, which would work fine.

hope that helps,

brian

BrianRoark - 03 Apr 2013 - 10:58

oh, I forgot to say that, once you have replaced Infinity with 99, use fstcompile to recompile the model into an FST.

brian

RolandSchwarz - 04 Apr 2013 - 05:12

Hey, thanks a lot for the quick reply, I'll give that a go!

Roland

RolandSchwarz - 04 Apr 2013 - 05:32

If I may add another quick question, when running fstshortestpath on the ngram count language model (i.e. after ngramcount but before ngrammake) I was expecting to get the most frequent n-gram, but instead the algorithm never seems to terminate. Any idea why that is? I though that shortestpath over the tropical sr should always terminate anyway.

Thanks,

Roland

BrianRoark - 04 Apr 2013 - 09:22

The ngram count Fst contains arcs with negative log counts. Since the counts can be greater than one, the negative log counts can be less than zero. Hence the shortest path is an infinite string repeating the most frequent symbol. Each symbol emission shortens the path, hence non-termination.

brian

RolandSchwarz - 05 Apr 2013 - 10:48

Oh I should have thought of that myself, thanks for clarifying. I somehow assumed the counts model would be acyclic.

Build failure on Fedora 17

JerryJames - 18 Dec 2012 - 10:42

Hi. I maintain several voice-recognition-related packages, including openfst, for the Fedora Linux distribution. I am working on an OpenGrm NGram package. My first attempt at building version 1.0.3 (with GCC 4.7.2 and glibc 2.15) failed:

In file included from ngramrandgen.cc:32:0:
./../include/ngram/ngram-randgen.h:55:48: error: there are no arguments to 'getpid' that depend on a template parameter, so a declaration of 'getpid' must be available [-fpermissive]
./../include/ngram/ngram-randgen.h:55:48: note: (if you use '-fpermissive', G++ will accept your code, but allowing the use of an undeclared name is deprecated)
ngramrandgen.cc:39:1: error: 'getpid' was not declared in this scope
ngramrandgen.cc:39:1: error: 'getpid' was not declared in this scope

It appears that an explicit #include <unistd.h> is needed in ngram-randgen.h. That header was probably pulled in through some other header in previous versions of either gcc or glibc.

BrianRoark - 19 Dec 2012 - 17:22

ok, that header file will be included in the next version. Thanks for the heads up.

brian

Expected result when using a lattice with ngram perplexity?

JosefNovak - 28 Nov 2012 - 06:40

I was wondering what the expected result is when feeding a lattice, rather than a string/sentence, to the ngramperplexity utility? Is this supported? It seems to report the perplexity of an arbitrary path through the lattice.

BrianRoark - 28 Nov 2012 - 20:46

Hi Josef,

ngramperplexity reports the perplexity of the path through the lattice that you get by taking the first arc out of each state that you reach. (Note that this is what you want for strings encoded as single-path automata.) Not sure what the preferred functionality should be for general lattices. Could make sense to show a warning or an error there; but at this point the onus is on the user to ensure that what is being scored is the same as what you get from farcompilestrings - unweighted, single-path automata. If you have an idea of what preferred functionality would be for non-string lattices, email me.

brian

is there a way to use NGramApply in c++

MarkusFreitag - 19 Oct 2012 - 12:55

Hi,

I do not want to print my fst and execute NGramApply in bash before reading the new fst again in c++.

Is there a method to use the method NGramApply directly in c++ ?

Thanks

BrianRoark - 21 Oct 2012 - 21:32

Hi Markus,

there is no single method; rather there are several ways to perform composition with the model, depending on how you want to interpret the backoff arcs. The most straightforward way to do this in your own code is to look at src/bin/ngramapply.cc and use the composition method for the particular kind of backoff arc, e.g., ngram.FailLMCompose() when interpreting the backoff as a failure transition. In other words, write your own ngramapply method based on inspection of the ngramapply code.

Hope that helps,

brian

MarkusFreitag - 22 Oct 2012 - 09:58

Hi,

thanks, I think yes that should work. I am using FailureArcs and my LM fst is created, so I do not need to build a lm fst out of strings or an ARPA lm.

I first just need to read the fst lm from my disk:

#include <ngram/ngram.h>

fst::StdMutableFst *fstforNGram; fstforNGram->Read($MYNGRAMFST); ngram::NGramModel ngram(fstforNGram); // that seems not to work, as: undefined reference to `ngram::NGramModel::InitModel()'

If I read the lm , I could then just add:

ngram.FailLMCompose(*lattice, &cfst, kSpecialLabel);

and the composed fst should be ready, right?

Thanks for helping

BrianRoark - 23 Oct 2012 - 09:05

Correct, that is the method for composing with failure arcs.

MarkusFreitag - 23 Oct 2012 - 09:30

yes, but I have a problem to read the fst lm in c++:

fst::StdMutableFst *fstforNGram;

fstforNGram->Read($MYNGRAMFST);

to that point it works.

ngram::NGramModel ngram(fstforNGram);

that seems not to work, as: undefined reference to `ngram::NGramModel::InitModel()'

Thanks

Fractional Kneser Ney

JosefNovak - 09 Oct 2012 - 04:31

Hi, I have been using OpenGrm with my Grapheme-to-Phoneme conversion tools for a while now and recently added some functionality to output weighted alignment lattices in .far format.

It is my understanding that these weighted lattices can only currently be utilized with Witten-Bell smoothing; is this correct?

Is there any plan to support fractional counts with Kneser-Ney smoothing, for instance along the lines of,

"Correlated Bigram LSA for Unsupervised Language Model Adaptation", Tam and Schultz.

or would I be best advised to implement this myself?

BrianRoark - 09 Oct 2012 - 09:32

Hi Josef,

Witten-Bell generalizes straightforwardly to fractional counts, as you point out. No immediate plans for new versions of other smoothing methods along those lines, so if that's something that you need urgently, you would need to implement it.

brian

JosefNovak - 09 Oct 2012 - 18:21

Understood, and thanks very much for your response!

FATAL: NegLogDiff

LukeCarmichael - 25 Sep 2012 - 17:21

Hello, I run this sequence of commands with the following output.

home$ ngramcounts a.far > a.cnts
home$  ngrammake --v=4 --method=katz a.cnts > katz.mod
INFO: FstImpl::ReadHeader: source: a.cnts, fst_type: vector, arc_type: standard, version: 2, flags: 3
Count bin   Katz Discounts (1-grams/2-grams/3-grams)
Count = 1   -nan/0.253709/-0.343723
Count = 2   -nan/1.26571/1.19095
Count = 3   -nan/0.467797/-0.532465
Count = 4   -nan/1.29438/1.18879
Count = 5   -nan/0.0740557/-0.489831
Count > 5   1/1/1
FATAL: NegLogDiff: undefined -10.2649 -10.2651

Other methods work fine.

How can I diagnose this problem?

Thanks, Luke

BrianRoark - 26 Sep 2012 - 13:12

Hi Luke,

this is basically a floating point precision issue, the system is trying to subtract two approximately equal numbers (while calculating backoff weights). The new version of the library coming out in a month or so has much improved floating point precision, which will help. In the meantime, you can get this to work by modifying a constant value in src/include/ngram/ngram-model.h which will allow these two numbers to be judged to be approximately equal. Look for: static const double kNormEps = 0.000001; near the top of that file. Change to 0.0001, then recompile.

This sort of problem usually comes up when you train a model with a relatively small vocabulary (like a phone or POS-tag model) and a relatively large corpus. The n-gram counts end up not following Good-Turing assumptions about what the distribution should look like (hence the odd discount values). In those cases, you're probably better off with Witten-Bell smoothing with the --witten_bell_k=15 or something like that. Or even trying an unsmoothed model.

And stay tuned for the next release, which deals more gracefully with some of these small vocabulary scenarios.

Brian

FATAL: NGramModel: bad ngram model topology

BenoitFavre - 10 Sep 2012 - 09:18

I generated an ngram model from a .arpa file with the following command:

ngramread --ARPA lm.arpa > lm.model

ngramread does not complain, but ngraminfo and trying to load the model from C++ code generate the following error:

FATAL: NGramModel: bad ngram model topology

How can I troubleshoot the problem?

BenoitFavre - 10 Sep 2012 - 09:27

Adding verbosity results in more mystery...

ngraminfo --v=2 lm.model INFO: FstImpl::ReadHeader: source: lm.model, fst_type: vector, arc_type: standard, version: 2, flags: 3 INFO: Incomplete # of ascending n-grams: 1377525 FATAL: NGramModel: bad ngram model topology

BrianRoark - 11 Sep 2012 - 10:33

Hi,

that error is coming from a sanity check that verifies that every state in the language model (other than the start and unigram states) is reached by exactly one 'ascending' arc, that goes from a lower order to a higher order state. ARPA format models can diverge from this, by, for example, having 'holes' (e.g., bigrams pruned but trigrams with that bigram as a suffix retained). But ngramread should plug all of those. maybe duplication? I'll email you about this.

BrianRoark - 18 Sep 2012 - 10:50

Benoit found a case where certain 'holes' from a pruned ARPA model were not being filled appropriately in the conversion. The sanity check routines on loading the model ensured that this anomaly was caught (causing the errors he mentioned), and we were able to find the cases where this was occurring and update the code. The updated conversion functions will be in the forthcoming version update release of the library, within the next month or two. In the meantime, if anyone encounters this problem, let me know and I can provide a workaround.

Access control:

Set ALLOWTOPICCHANGE = AllAuthUsersGroup

-- CyrilAllauzen - 09 Aug 2012

Topic revision: r102 - 2016-04-27 - PafnoutiT

Forum