Difference: GrmNGramForum (123 vs. 124)

Revision 1242017-02-03 - BrianRoark

Line: 1 to 1
 
META TOPICPARENT name="WebHome"

OpenGrm NGram Forum

Line: 28 to 28
  Short answer: there is no such utility currently in the library. Longer answer: forthcoming...
Added:
>
>

BrianRoark - 2017-02-03 - 15:17

Because the models are FSTs, we can build an FST to restrict the LM to only strings with the particular prefix. Here's what I did, using a model I had built called earnest.3g.mod.fst, built from our example corpus earnest.txt. For standard ngramrandgen, here's what we get:

ngramrandgen -remove_epsilon --max_sents=3 earnest.3g.mod.fst rg.norestrict.far

farprintstrings rg.norestrict.far

I DON T ALLOW I FEEL THAT MY DEAR UNCLE JACK DOES NOT MR BUNBURY FOR YOUR SHALL PROBABLY NEVER AN AUNT BEING AT THE ARISTOCRACY
YES ANYTHING CAST HATE PEOPLE
THEY SO GRIEVED YES SIR

Now suppose I want to restrict to strings prefixed with the string "THE ROOM". First I will build an FST that accepts all strings from my vocabulary (including with <epsilon>) prefixed by "THE ROOM". Here's a bash/awk sequence of commands to build such a restriction FST:

echo "THE ROOM" |\
while read i; do echo "$i" | wc |\
  while read a st c; do echo "$i" |\
    awk '{for (i = 1; i <= NF; i++) {printf("%d\t%d\t%s\t%s\n",s,s+1,$i,$i); s++}}' \
      >restrict.txt;
    cat earnest.syms |\
    awk -v ST="${st}" '{printf("%d\t%d\t%s %s\n",ST,ST,$1,$1)}' >>restrict.txt;
    echo "$st" >>restrict.txt; done; done
fstcompile \
  -isymbols=earnest.syms -keep_isymbols \
  -osymbols=earnest.syms -keep_osymbols restrict.txt restrict.fst

If you look at restrict.txt, you'll see it looks something like this:

0       1       THE     THE
1       2       ROOM    ROOM
2       2       <epsilon> <epsilon>
2       2       MORNING MORNING
2       2       ROOM ROOM
2       2       IN IN
2       2       ALGERNON ALGERNON
...
2
i.e., a loop of words at state 2, which is the final state. This automaton accepts all strings made up of words from the vocabulary, prefixed by "THE ROOM".

Now we can compose our restriction FST restrict.fst with our model, to produce a new model:

fstcompose --compose_filter=null restrict.fst earnest.3g.mod.fst >earnest.restrict.3g.mod.fst

We use compose_filter null so that it treats <epsilon> just like another symbol. Now we can randomly generate from this, and we get:

ngramrandgen -remove_epsilon --max_sents=3 earnest.restrict.3g.mod.fst rg.restrict.far

farprintstrings rg.restrict.far

THE ROOM YOU ARE YOU ARE AS ANY OUT
THE ROOM IS MARRIED TO I I AM THINK MARRIED LADY BLOXHAM THROUGH MOMENT A FIRST CONFESSED TO YOU HAVE WORTHING I MERELY CAME BACK
THE ROOM NEXT WEEK COUNTIES MONEY

So, yeah, no 'easy' command to provide such a prefix, but with some FST processing, you can get what you want.

 
<--/commentPlugin-->
Log In
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback