OpenFst Examples
Reading the
quick tour first is recommended. That includes a simple
example of FST application using either the C++ template level or the shell-level operations. The
advanced usage topic contains an
implementation using the template-free intermediate
scripting level as well.
In the examples below, we use the shell-level operations for convenience. The following data files are used in these examples:
With these files and the descriptions below, the reader should be able to repeat the examples.
(Note the OpenGrm Library, used to build the language model, is currently in development but will be released for general public use soon.)
Tokenization
The first example converts a sequence of ASCII characters into a sequence of word tokens with punctuation and whitespace stripped.
To do so we will need a
lexicon transducer that maps from letters to their corresponding word tokens. A simple way to generate this
is using the
OpenFst text format. For example, the word
Mars would have the form:
$ fstcompile -isymbols=ascii.syms -osymbols=wotw.syms >Mars.fst <<EOF
0 1 M Mars
1 2 a <epsilon>
2 3 r <epsilon>
3 4 s <epsilon>
4
EOF
This can be drawn with:
$ fstdraw --isymbols=ascii.syms -osymbols=wotw.syms -portrait Mars.fst | dot -Tjpg >Mars.jpg
which produces:
.
Suppose that
Martian.fst
and
man.fst
have similarly been created, then:
$ fstunion man.fst Mars.fst | fstunion - Martian.fst | fstclosure >lexicon.fst
produces a finite-state lexicon that transduces zero or more spelled-out word sequences into to their word tokens.
The non-determinism and non-minimality introduced by the construction can be removed with:
$ fstrmepsilon lexicon.fst | fstdeterminize | fstminimize >lexicon_opt.fst
resulting in the compact:
In order to handle punctuation symbols, we change the lexicon construction to:
$ fstunion man.fst Mars.fst | fstunion - Martian.fst | fstconcat - punct.fst | fstclosure >lexicon.fst
where:
$ fstcompile -isymbols=ascii.syms -osymbols=wotw.syms >punct.fst <<EOF
0 1 <space> <epsilon>
0 1 . <epsilon>
0 1 , <epsilon>
0 1 ? <epsilon>
0 1 ! <epsilon>
1
EOF
is a transducer that deletes common punctuation symbols.
Now, the tokenizaton of the an example string
Mars man encoded as an FST:
can be done with:
$ fstcompose Marsman.fst lexicon_opt.fst | fstproject --project_output | fstrmepsilon >tokens.fst
giving:
.
To generate a full lexicon of all 7102 distinct words in the
War of Worlds, it is convenient to dispense with the union
of individual word FSTs above and instead generate a single text FST from the word symbols in
wotw.syms
.
Here is a python script that does that and was used, along with the above steps,
to generate the
full optimized lexicon.
Exercise
The above tokenization does not handle numeric character input.
(a) Create a transducer that maps numbers in the range 0 - 999999 represented as digit strings to their English read form, e.g.:
1 -> one
11 -> eleven
111 -> one hundred eleven
1111 -> one thousand one hundred eleven
11111 -> eleven thousand one hundred eleven.
(b) Incorporate this transduction into the letter-to-token transduction above and apply to the input
Mars is 4225 miles across. represented as letters.
Downcasing Text
The next example converts case-sensitive input to all lowercase output. To do the conversion, we create a
flower transducer of the form:
$ fstcompile -isymbols=ascii.syms -osymbols=ascii.syms >downcase.fst <<EOF
0 0 a a
0 0 b b
0 0 A a
0 0 B b
0 0 ! !
0
EOF
which produces:
A downcasing flower transducer for the full character set is
here. This transducer can be applied to the
Mars men automaton from the previous example with:
$ fstcompose Marsman.fst downcase.fst | fstproject -project_output >marsman.fst
giving:
A transducer that downcases at the token level can be created with:
$ fstinvert lexicon_opt.fst downcase.fst | fstcompose - lexicon_opt.fst >downcase_token.fst
Exercise
The letter-level downcasing transducer downcases any ASCII input. For which inputs does the token-level downcasing
transducer work?
Case Restoration in Text
Edit Distance
Spelling Correction