TWiki> FST Web>FstExamples (revision 4)EditAttach

Work in progress, under construction OpenFst Examples

Reading the quick tour first is recommended. That includes a simple example of FST application using either the C++ template level or the shell-level operations. The advanced usage topic contains an implementation using the template-free intermediate scripting level as well.

In the examples below, we use the shell-level operations for convenience. The following data files are used in these examples:

File Description Source
wotw.txt (normalized) text of H.G. Well's War of the Worlds public domain
wotw.lm.gz 5-gram language model for wotw.txt in OpenFst text format www.opengrm.org
wotw.syms FST symbol table file for wotw.lm www.opengrm.org
ascii.syms FST symbol table file for ASCII letters Python: for i in range(33,127): print "%c %d\n" % (i,i)
lexicon_opt.txt.gz letter-to-token FST for wotw.syms see first example below
downcase.txt ASCII letter-to-downcased letter FST awk 'NR>1 { print 0,0,$1,tolower($1) } ; END { print 0 }' <ascii.syms >downcase.txt
With these files and the descriptions below, the reader should be able to repeat the examples.

(Note the OpenGrm Library, used to build the language model, is currently in development but will be released for general public use soon.)

Tokenization

The first example converts a sequence of ASCII characters into a sequence of word tokens with punctuation and whitespace stripped. To do so we will need a lexicon transducer that maps from letters to their corresponding word tokens. A simple way to generate this is using the OpenFst text format. For example, the word Mars would have the form:

$ fstcompile -isymbols=ascii.syms -osymbols=wotw.syms >Mars.fst <<EOF
0 1 M Mars
1 2 a <epsilon>
2 3 r <epsilon>
3 4 s <epsilon>
4
EOF

This can be drawn with:

$ fstdraw --isymbols=ascii.syms -osymbols=wotw.syms -portrait Mars.fst | dot -Tjpg >Mars.jpg
which produces:

Mars.jpg.

Suppose that Martian.fst and man.fst have similarly been created, then:

$ fstunion man.fst Mars.fst | fstunion - Martian.fst | fstclosure >lexicon.fst

produces a finite-state lexicon that transduces zero or more spelled-out word sequences into to their word tokens.

lexicon.png

The non-determinism and non-minimality introduced by the construction can be removed with:

$ fstrmepsilon lexicon.fst | fstdeterminize | fstminimize >lexicon_opt.fst

resulting in the compact:

lexiconmin.png

In order to handle punctuation symbols, we change the lexicon construction to:

$ fstunion man.fst Mars.fst | fstunion - Martian.fst | fstconcat - punct.fst | fstclosure >lexicon.fst

where:

$ fstcompile -isymbols=ascii.syms -osymbols=wotw.syms >punct.fst <<EOF
0 1 <space> <epsilon>
0 1 . <epsilon>
0 1 , <epsilon>
0 1 ? <epsilon>
0 1 ! <epsilon>
1
EOF

is a transducer that deletes common punctuation symbols.

Now, the tokenizaton of the an example string Mars man encoded as an FST:

Marsman.png

can be done with:

$ fstcompose Marsman.fst lexicon_opt.fst | fstproject --project_output | fstrmepsilon >tokens.fst

giving:

tokens.png.

To generate a full lexicon of all 7102 distinct words in the War of Worlds, it is convenient to dispense with the union of individual word FSTs above and instead generate a single text FST from the word symbols in wotw.syms. Here is a python script that does that and was used, along with the above steps, to generate the full optimized lexicon.

Exercise

The above tokenization does not handle numeric character input.

(a) Create a transducer that maps numbers in the range 0 - 999999 represented as digit strings to their English read form, e.g.:

1 -> one
11 -> eleven
111 -> one hundred eleven
1111 -> one thousand one hundred eleven
11111 -> eleven thousand one hundred eleven.

(b) Incorporate this transduction into the letter-to-token transduction above and apply to the input Mars is 4225 miles across. represented as letters.

Downcasing Text

The next example converts case-sensitive input to all lowercase output. To do the conversion, we create a flower transducer of the form:

$ fstcompile -isymbols=ascii.syms -osymbols=ascii.syms >downcase.fst <<EOF
0 0 a a
0 0 b b
0 0 A a
0 0 B b
0 0 ! !
0
EOF

which produces:

downcase.jpg

A downcasing flower transducer for the full character set is here. This transducer can be applied to the Mars men automaton from the previous example with:

$ fstcompose Marsman.fst downcase.fst | fstproject -project_output >marsman.fst

giving:

marsman.png

A transducer that downcases at the token level can be created with:

$ fstinvert lexicon_opt.fst downcase.fst | fstcompose - lexicon_opt.fst >downcase_token.fst

Exercise

The letter-level downcasing transducer downcases any ASCII input. For which inputs does the token-level downcasing transducer work?

Case Restoration in Text

Edit Distance

Spelling Correction

Topic attachments
I Attachment History Action Size Date Who Comment
JPEGjpg Mars.jpg r1 manage 10.8 K 2010-12-08 - 06:15 MichaelRiley  
PNGpng Marsman.png r3 r2 r1 manage 13.0 K 2010-12-08 - 07:49 MichaelRiley  
Unknown file formatsyms ascii.syms r2 r1 manage 0.5 K 2010-12-08 - 06:06 MichaelRiley  
JPEGjpg downcase.jpg r2 r1 manage 3.8 K 2010-12-08 - 23:39 MichaelRiley  
Texttxt full_downcase.txt r2 r1 manage 0.8 K 2010-12-08 - 22:48 MichaelRiley  
JPEGjpg lexicon.jpg r2 r1 manage 15.6 K 2010-12-08 - 06:42 MichaelRiley  
PNGpng lexicon.png r5 r4 r3 r2 r1 manage 18.4 K 2010-12-08 - 07:06 MichaelRiley  
PNGpng lexiconmin.png r1 manage 20.1 K 2010-12-08 - 07:06 MichaelRiley  
Texttxt makelex.py.txt r1 manage 0.4 K 2010-12-08 - 08:48 MichaelRiley  
PNGpng tokens.png r2 r1 manage 14.4 K 2010-12-08 - 07:47 MichaelRiley  
Unknown file formatgz wotw.lm.gz r1 manage 3331.7 K 2010-12-08 - 05:28 MichaelRiley  
Unknown file formatsyms wotw.syms r1 manage 88.8 K 2010-12-08 - 05:13 MichaelRiley  
Texttxt wotw.txt r1 manage 331.0 K 2010-12-08 - 05:11 MichaelRiley  
Edit | Attach | Watch | Print version | History: r22 | r6 < r5 < r4 < r3 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r4 - 2010-12-08 - MichaelRiley
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback