TWiki> FST Web>FstExamples (revision 6)EditAttach

Work in progress, under construction OpenFst Examples

Reading the quick tour first is recommended. That includes a simple example of FST application using either the C++ template level or the shell-level operations. The advanced usage topic contains an implementation using the template-free intermediate scripting level as well.

The following data files are used in the examples below:

File Description Source
wotw.txt (normalized) text of H.G. Well's War of the Worlds public domain
wotw.lm.gz 5-gram language model for wotw.txt in OpenFst text format www.opengrm.org
wotw.syms FST symbol table file for wotw.lm www.opengrm.org
ascii.syms FST symbol table file for ASCII letters Python: for i in range(33,127): print "%c %d\n" % (i,i)
lexicon.txt.gz letter-to-token FST for wotw.syms see first example below
lexicon_opt.txt.gz optimized letter-to-token FST for wotw.syms see first example below
downcase.txt ASCII letter-to-downcased letter FST awk 'NR>1 { print 0,0,$1,tolower($1) } ; END { print 0 }' <ascii.syms >downcase.txt
With these files and the descriptions below, the reader should be able to repeat the examples.

(Note the OpenGrm Library, used to build the language model, is currently in development but will be released for general public use soon.)

A few general comments about the examples:

  1. For the most part, we illustrate with the shell-level commands for convenience. (On non-Posix systems, there may be issues with binary file I/O to standard input and output. If so, pass input and output files as program arguments instead.)

  1. The fstcompose operation is used often here. Typically, one or both of the input FSTs should be appropriately sorted before composition. In the examples below, however, we have only illustrated sorting where it is necessary, to keep the presentation shorter. The provided data files are pre-sorted for their intended use.

Tokenization

The first example converts a sequence of ASCII characters into a sequence of word tokens with punctuation and whitespace stripped. To do so we will need a lexicon transducer that maps from letters to their corresponding word tokens. A simple way to generate this is using the OpenFst text format. For example, the word Mars would have the form:

$ fstcompile --isymbols=ascii.syms --osymbols=wotw.syms >Mars.fst <<EOF
0 1 M Mars
1 2 a <epsilon>
2 3 r <epsilon>
3 4 s <epsilon>
4
EOF

This can be drawn with:

$ fstdraw --isymbols=ascii.syms --osymbols=wotw.syms -portrait Mars.fst | dot -Tjpg >Mars.jpg
which produces:

Mars.jpg.

Suppose that Martian.fst and man.fst have similarly been created, then:

$ fstunion man.fst Mars.fst | fstunion - Martian.fst | fstclosure >lexicon.fst

produces a finite-state lexicon that transduces zero or more spelled-out word sequences into to their word tokens.

lexicon.png

The non-determinism and non-minimality introduced by the construction can be removed with:

$ fstrmepsilon lexicon.fst | fstdeterminize | fstminimize >lexicon_opt.fst

resulting in the compact:

lexiconmin.png

In order to handle punctuation symbols, we change the lexicon construction to:

$ fstunion man.fst Mars.fst | fstunion - Martian.fst | fstconcat - punct.fst | fstclosure >lexicon.fst

where:

$ fstcompile --isymbols=ascii.syms --osymbols=wotw.syms >punct.fst <<EOF
0 1 <space> <epsilon>
0 1 . <epsilon>
0 1 , <epsilon>
0 1 ? <epsilon>
0 1 ! <epsilon>
1
EOF

is a transducer that deletes common punctuation symbols. The full punctuation transducer is here.

Now, the tokenizaton of the an example string Mars man encoded as an FST:

Marsman.png

can be done with:

$ fstcompose Marsman.fst lexicon_opt.fst | fstproject --project_output | fstrmepsilon >tokens.fst

giving:

tokens.jpg.

To generate a full lexicon of all 7102 distinct words in the War of Worlds, it is convenient to dispense with the union of individual word FSTs above and instead generate a single text FST from the word symbols in wotw.syms. Here is a python script that does that and was used, along with the above steps, to generate the full optimized lexicon.

Exercise 1

The above tokenization does not handle numeric character input.

(a) Create a transducer that maps numbers in the range 0 - 999999 represented as digit strings to their English read form, e.g.:

1 -> one
11 -> eleven
111 -> one hundred eleven
1111 -> one thousand one hundred eleven
11111 -> eleven thousand one hundred eleven.

(b) Incorporate this transduction into the letter-to-token transduction above and apply to the input Mars is 4225 miles across. represented as letters.

Downcasing Text

The next example converts case-sensitive input to all lowercase output. To do the conversion, we create a flower transducer of the form:

$ fstcompile --isymbols=ascii.syms --osymbols=ascii.syms >downcase.fst <<EOF
0 0 ! !
0 0 A a
0 0 B b
0 0 a a
0 0 b b
0
EOF

which produces:

downcase.jpg

A downcasing flower transducer for the full character set is here. This transducer can be applied to the Mars men automaton from the previous example with:

$ fstproject Marsman.fst | fstcompose - downcase.fst | fstproject --project_output >marsman.fst

giving:

marsman.png

A transducer that downcases at the token level can be created with:

$ fstinvert lexicon_opt.fst | fstcompose - downcase.fst | fstcompose - lexicon_opt.fst | 
fstrmepsilon | fstdeterminize | fstminimize  >downcase_token.fst

Exercise 2

(a) The letter-level downcasing transducer downcases any ASCII input. For which inputs does the token-level downcasing transducer work? What changes would be necessary to cover all inputs from wotw.syms?

(b) If a token The were applied to downcase_token.fst, what would the output look like? What would it look like if the optimizations (epsilon-removal, determinization and minimization) were omitted from the construction of downcase_token.fst.

Case Restoration in Text

This example creates a transducer that attempts to restore the case of downcased input. This is the first non-trivial task and there is no error-free way to do this. The approach taken here will be to use case statistics gathered from the The War of the Worlds source text to help solve this. In particular, we will use an n-gram language model created on this text that is represented as a finite-state automaton in OpenFst format. Here is a typical path in this 5-gram automaton:

$ fstrandgen --select=log_prob wotw.lm | fstprint  --isymbols=wotw.syms  --osymbols=wotw.syms 
0   1   The   The
1   2   desolating   desolating
2   3   cry   cry
3   4   <epsilon>   <epsilon>
4   5   worked   worked
5   6   <epsilon>   <epsilon>
6   7   upon   upon
7   8   my   my
8   9   mind   mind
9   10   <epsilon>   <epsilon>
10   11   once   once
11   12   <epsilon>   <epsilon>
12   13   <epsilon>   <epsilon>
13   14   I   I
14   15   <epsilon>   <epsilon>
15   16   <epsilon>   <epsilon>
16   17   slept   slept
17   18   <epsilon>   <epsilon>
18   19   little   little
19

This model is constructed to have a transition for every 1-gram to 5-gram seen in 'War of the Worlds' with its weight related to the (negative log) probability of that n-gram occurring in the text corpus. The transitions correspond to backoff transitions used in the smoothing of the model to allow accepting input sequences not seen in training.

Given this language model and using the lexicon and downcasing transducers from the previous examples (preferably a more complete one constructed in Exercise 2a), a solution is:

$ fstcompose lexicon_opt.fst wotw.lm | fstarcsort --sort_type=ilabel >wotw.fst
$ fstinvert downcase.fst | fstcompose - wotw.fst >case_restore.fst

The first FST, wotw.fst, maps from letters to tokens following the probability distribution of the language model. The second FST, case_restore.fst is similar but uses only downcased letters. Case prediction can then be performed with:

$ fstcompose marsman.fst case_restore.fst | fstshortestpath | 
fstproject --project_output | fstrmepsilon >prediction.fst

which gives:

prediction.png

In other words, the most likely case of the input is determinized with respect to the n-gram model.

There is a serious problem, however, with the above solution. For all but tiny corpora, the first composition will blow up with the classical composition algorithm since the output labels in lexicon_opt.fst have been pushed back when it was determinized and this greatly delays matching with the labels in wotw.lm. There are three possible solutions:

First, we can use the input to restrict the composition chain as:

$ fstcompose downcase.fst marsman.fst | fstinvert | fstcompose - lexicon_opt.fst | 
fstcompose - wotw.lm | fstshortestpath | fstproject -project_output | fstrmepsilon >prediction.fst

This works fine but has the disadvantage that we don't have a single transducer to apply and we are depending on the input being a string or otherwise small. A second solution, which gives a single optimized transducer, is to replace transducer determinization and minimization of lexicon.fst with automata determinization and minimization (via encoding the input and output label pairs into a single new label) followed by the transducer determinization and minimization of the result of the composition with wotw.fst:

$ fstencode --encode_labels lexicon.fst enc.dat | fstdeterminize | fstminimize | 
fstencode --decode - enc.dat >lexicon_compact.fst
$ fstcompose lexicon_compact.fst wotw.lm | fstdeterminize | fstminimize >wotw.fst
$ fstinvert downcase.fst | fstcompose - wotw.fst >case_restore.fst

This solution is a natural and simple one but has the disadvantage that the transducer determinization and minimization steps are quite expensive. A final solution is to use an FST representation that allows lookahead matching, which composition can exploit to avoid the matching delays:

$ fstconvert --fst_type=olabel_lookahead --save_relabel_opairs=relabel.pairs lexicon_opt.fst >lexicon_lookahead.fst
$ fstrelabel --relabel_ipairs=relabel.pairs wotw.lm | fstarcsort --sort_type=ilabel >wotw_relabel.lm
$ fstcompose lexicon_lookahead.fst wotw_relabel.lm >wotw.fst
$ fstinvert downcase.fst | fstcompose - wotw.fst >case_restore.fst

The relabeling of the input labels of the language model is a by-product of how the lookahead matching works. Note in order to use the lookahead FST formats you must use --enable-lookahead-fsts in the library compilation and you must set your LD_LIBRARY_PATH (or equivalent) appropriately.

Exercise 3

(a) Find the weight of the second shortest distinct token sequence in the prediction example above.

(b) Find the weight of the second shortest distinct token sequence in the prediction example above without using the --nshortest flag (hint: use fstdifference).

(c) Find all paths within weight 5 of the shortest path in prediction example.

Edit Distance

Spelling Correction

Topic attachments
I Attachment History Action Size Date Who Comment
JPEGjpg Mars.jpg r1 manage 10.8 K 2010-12-08 - 06:15 MichaelRiley  
PNGpng Marsman.png r3 r2 r1 manage 13.0 K 2010-12-09 - 01:52 MichaelRiley  
Unknown file formatsyms ascii.syms r2 r1 manage 0.5 K 2010-12-08 - 06:06 MichaelRiley  
JPEGjpg downcase.jpg r2 r1 manage 3.8 K 2010-12-08 - 23:39 MichaelRiley  
Texttxt full_downcase.txt r2 r1 manage 0.8 K 2010-12-08 - 22:48 MichaelRiley  
Texttxt full_punct.txt r1 manage 0.7 K 2010-12-09 - 01:42 MichaelRiley  
JPEGjpg lexicon.jpg r2 r1 manage 15.6 K 2010-12-08 - 06:42 MichaelRiley  
PNGpng lexicon.png r5 r4 r3 r2 r1 manage 18.4 K 2010-12-08 - 07:06 MichaelRiley  
Unknown file formatgz lexicon.txt.gz r1 manage 300.9 K 2010-12-09 - 06:39 MichaelRiley  
Unknown file formatgz lexicon_opt.txt.gz r2 r1 manage 226.2 K 2010-12-09 - 02:27 MichaelRiley  
PNGpng lexiconmin.png r1 manage 20.1 K 2010-12-08 - 07:06 MichaelRiley  
Texttxt makelex.py.txt r1 manage 0.4 K 2010-12-08 - 08:48 MichaelRiley  
PNGpng marsman.png r2 r1 manage 12.9 K 2010-12-09 - 02:11 MichaelRiley  
PNGpng prediction.png r1 manage 10.4 K 2010-12-09 - 06:10 MichaelRiley  
JPEGjpg tokens.jpg r1 manage 4.5 K 2010-12-09 - 02:09 MichaelRiley  
PNGpng tokens.png r2 r1 manage 14.4 K 2010-12-08 - 07:47 MichaelRiley  
Unknown file formatgz wotw.lm.gz r1 manage 3331.7 K 2010-12-08 - 05:28 MichaelRiley  
Unknown file formatsyms wotw.syms r1 manage 88.8 K 2010-12-08 - 05:13 MichaelRiley  
Texttxt wotw.txt r1 manage 331.0 K 2010-12-08 - 05:11 MichaelRiley  
Edit | Attach | Watch | Print version | History: r22 | r8 < r7 < r6 < r5 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r6 - 2010-12-09 - MichaelRiley
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback