OpenFst Examples
Reading the
quick tour first is recommended. That includes a simple
example of FST application using either the C++ template level or the shell-level operations. The
advanced usage topic contains an
implementation using the template-free intermediate
scripting level as well.
In the examples below, we use the shell-level operations for convenience. The following data files are used in these examples:
With these files and the descriptions below, the reader should be able to repeat the examples.
(Note the OpenGrm Library, used to build the language model, is currently in development but will be released for general public use soon.)
Tokenization
The first example converts a sequence of ASCII characters into a sequence of word tokens with punctuation and whitespace stripped.
To do so we will need a
lexicon transducer that maps from letters to their corresponding word token. A simple way to generate this
is using the
OpenFst text format. For example, the word
Mars would have the form:
$ fstcompile -isymbols=ascii.syms -osymbols=wotw.syms >Mars.fst <<EOF
0 1 M Mars
1 2 a <epsilon>
2 3 r <epsilon>
3 4 s <epsilon>
4
EOF
This can be drawn with:
$ fstdraw --isymbols=ascii.syms -osymbols=wotw.syms -portrait Mars.fst | dot -Tjpg >Mars.jpg
which produces:
.
Suppose that
Martian.fst and
man.fst have similarly been created, then:
$ fstunion man.fst Mars.fst | fstunion - Martian.fst | fstclosure >lexicon.fst
produces a finite-state lexicon of that transduces zero or more spelled-out word sequences into to their word tokens.
The non-determinism and non-minimality introduced by the construction can be removed with:
$ fstrmepsilon lexicon.fst | fstdeterminize | fstminimize >lexicon_opt.fst
resulting in the compact:
In order to handle punctuation symbols, we change the lexicon construction to:
$ fstunion man.fst Mars.fst | fstunion - Martian.fst | fstconcat - punct.fst | fstclosure >lexicon.fst
where:
$ fstcompile -isymbols=ascii.syms -osymbols=wotw.syms >punct.fst <<EOF
0 1 <space> <epsilon>
0 1 . <epsilon>
0 1 , <epsilon>
0 1 ? <epsilon>
0 1 ! <epsilon>
1
EOF
is a transducer that deletes common punctuation symbols.
Now, the tokenizaton of the an example string
Mars man encoded as an FST:
can be done with:
$ fstcompose Marsman.fst lexicon_opt.fst | fstproject --project_output | fstrmepsilon >tokens.fst
giving:
.
To generate a full lexicon of all 7102 distinct words in the
War of Worlds, it is convenient to dispense with the union
of individual word FSTs above and instead generate a single text FST from the word symbols in
wotw.syms.
Here is a python script that does that and was used, along with the above steps,
to generate the
full optimized lexicon.