TWiki> FST Web>FstExamples (revision 3)EditAttach

Work in progress, under construction OpenFst Examples

Reading the quick tour first is recommended. That includes a simple example of FST application using either the C++ template level or the shell-level operations. The advanced usage topic contains an implementation using the template-free intermediate scripting level as well.

In the examples below, we use the shell-level operations for convenience. The following data files are used in these examples:

File Description Source
wotw.txt (normalized) text of H.G. Well's War of the Worlds public domain
wotw.lm.gz 5-gram language model for wotw.txt in OpenFst text format www.opengrm.org
wotw.syms FST symbol table file for wotw.lm www.opengrm.org
ascii.syms FST symbol table file for ASCII letters Python: for i in range(33,127): print "%c %d\n" % (i,i)
lexicon_opt.txt.gz letter-to-token FST for wotw.syms see first example below

With these files and the descriptions below, the reader should be able to repeat the examples.

(Note the OpenGrm Library, used to build the language model, is currently in development but will be released for general public use soon.)

Tokenization

The first example converts a sequence of ASCII characters into a sequence of word tokens with punctuation and whitespace stripped. To do so we will need a lexicon transducer that maps from letters to their corresponding word token. A simple way to generate this is using the OpenFst text format. For example, the word Mars would have the form:

$ fstcompile -isymbols=ascii.syms -osymbols=wotw.syms >Mars.fst <<EOF
0 1 M Mars
1 2 a <epsilon>
2 3 r <epsilon>
3 4 s <epsilon>
4
EOF

This can be drawn with:

$ fstdraw --isymbols=ascii.syms -osymbols=wotw.syms -portrait Mars.fst | dot -Tjpg >Mars.jpg
which produces:

Mars.jpg.

Suppose that Martian.fst and man.fst have similarly been created, then:

$ fstunion man.fst Mars.fst | fstunion - Martian.fst | fstclosure >lexicon.fst

produces a finite-state lexicon of that transduces zero or more spelled-out word sequences into to their word tokens.

lexicon.png

The non-determinism and non-minimality introduced by the construction can be removed with:

$ fstrmepsilon lexicon.fst | fstdeterminize | fstminimize >lexicon_opt.fst

resulting in the compact:

lexiconmin.png

In order to handle punctuation symbols, we change the lexicon construction to:

$ fstunion man.fst Mars.fst | fstunion - Martian.fst | fstconcat - punct.fst | fstclosure >lexicon.fst

where:

$ fstcompile -isymbols=ascii.syms -osymbols=wotw.syms >punct.fst <<EOF
0 1 <space> <epsilon>
0 1 . <epsilon>
0 1 , <epsilon>
0 1 ? <epsilon>
0 1 ! <epsilon>
1
EOF

is a transducer that deletes common punctuation symbols.

Now, the tokenizaton of the an example string Mars man encoded as an FST:

Marsman.png

can be done with:

$ fstcompose Marsman.fst lexicon_opt.fst | fstproject --project_output | fstrmepsilon >tokens.fst

giving:

tokens.png.

To generate a full lexicon of all 7102 distinct words in the War of Worlds, it is convenient to dispense with the union of individual word FSTs above and instead generate a single text FST from the word symbols in wotw.syms. Here is a python script that does that and was used, along with the above steps, to generate the full optimized lexicon.

Topic attachments
I Attachment History Action Size Date Who Comment
JPEGjpg Mars.jpg r1 manage 10.8 K 2010-12-08 - 06:15 MichaelRiley  
PNGpng Marsman.png r3 r2 r1 manage 13.0 K 2010-12-08 - 07:49 MichaelRiley  
Unknown file formatsyms ascii.syms r2 r1 manage 0.5 K 2010-12-08 - 06:06 MichaelRiley  
JPEGjpg lexicon.jpg r2 r1 manage 15.6 K 2010-12-08 - 06:42 MichaelRiley  
PNGpng lexicon.png r5 r4 r3 r2 r1 manage 18.4 K 2010-12-08 - 07:06 MichaelRiley  
PNGpng lexiconmin.png r1 manage 20.1 K 2010-12-08 - 07:06 MichaelRiley  
Texttxt makelex.py.txt r1 manage 0.4 K 2010-12-08 - 08:48 MichaelRiley  
PNGpng tokens.png r2 r1 manage 14.4 K 2010-12-08 - 07:47 MichaelRiley  
Unknown file formatgz wotw.lm.gz r1 manage 3331.7 K 2010-12-08 - 05:28 MichaelRiley  
Unknown file formatsyms wotw.syms r1 manage 88.8 K 2010-12-08 - 05:13 MichaelRiley  
Texttxt wotw.txt r1 manage 331.0 K 2010-12-08 - 05:11 MichaelRiley  
Edit | Attach | Watch | Print version | History: r22 | r5 < r4 < r3 < r2 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r3 - 2010-12-08 - MichaelRiley
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback