OpenFst Quick Tour

Below is a brief tutorial on the OpenGrm SFST library based on a running example. We use the command-line SFST utilities for this; we could have instead used the corresponding C++-level utilities in a program (see Available Operations).

Example Data and Models

For our running example, we will use the text of Oscar Wilde's The Importance of Being Earnest, which has been upper-cased and had punctuation removed. The first 850 sentences, earnest_train.txt, are used as training data and the remaining 838 sentences, earnest_test.txt, are used as test data. We also provide an OpenFst-style symbol table, earnest.syms, which contains the 1000 most frequent words in the training set plus the <epsilon> token for the empty string and the <unk> token for the unknown word. These training and test sets can be compiled into OpenFst archives with:
$ farcompilestrings --unknown_symbol="<unk>" -keep_symbols -symbols=earnest_train_1000.syms earnest_train.txt >earnest_train.far
$ farcompilestrings --unknown_symbol="<unk>" -keep_symbols -symbols=earnest_train_1000.syms earnest_test.txt >earnest_test.far

Using the OpenGrm NGram utilities, these are used to construct a bigram Katz LM on the training data, earnest_train.mod, and the test data, earnest_test.mod. From earnest_train.mod we create a 2000 n-gram, entropy-pruned LM, earnest_train.pru.

$ ngramcount -order=$order earnest_train.far >earnest_train.cnts
$ ngrammake earnest_train.cnts >earnest_train.mod

$ ngramcount -order=$order earnest_test.far >earnest_test.cnts
$ ngrammake earnest_test.cnts >earnest_test.mod

$ ngramshrink --method="relative_entropy" --target_number_of_ngrams=2000 earnest_train.mod >earnest_train.pru

Perplexity

We can compute the perplexity of these models relative to the test set using the SFst operation sfstperplexity as follows (ngramperplexity returns the same values, restricted to n-gram input):

$ unk_flags="--unknown_label=1001 --unknown_class_size=1"

$ sfstperplexity $unk_flags -phi_label=0 earnest_train.mod earnest_test.far
# of sentences                                    838
cross entropy                                     41835.6
perplexity                                        73.4146
# of words                                        8900
# of OOVs                                         1363

$ sfstperplexity $unk_flags -phi_label=0 earnest_train.mod earnest_test.far
# of sentences                                    838
cross entropy                                     30568.8
perplexity                                        23.0836
# of words                                        8900
# of OOVs                                         1363

$ sfstperplexity $unk_flags -phi_label=0 earnest_train.pru earnest_test.far
# of sentences                                    838
cross entropy                                     43164.4
perplexity                                        84.1474
# of words                                        8900
# of OOVs                                         1363

Sampling

We can sample from these SFSTs as follows:

$ sfstrandgen -phi_label=0 earnest_train.mod | fstprint --acceptor
0   1   YES
1   2   BUT
2   3   I
3   4   SHALL
4   5   PROBABLY
5   6   NEVER
6   7   <epsilon>
7   8   THAT
8   9   NAME
9   10   <epsilon>
10   11   YOUR
11   12   COUSIN
12   13   CECILY
13

Approximation

Idempotency

The following steps show that SFT approximation is effectively idempotent: the perplexity of the approximation of a SFST onto the same topology gives the same perplexity as the source.

$ sfstapprox -phi_label=0 earnest_train.mod earnest_train.mod >earnest_train.approx
$ sfstperplexity $unk_flags -phi_label=0 earnest_train.approx earnest_test.far
# of sentences                                    838
cross entropy                                     41835.6
perplexity                                        73.4145
# of words                                        8900
# of OOVs                                         1363

An alternative, equivalent way to perform this approximation is to break it into two steps, where the counting and normalization are done separately.

$ sfstcount -phi_label=0 earnest_train.mod earnest_train.mod >earnest_train.approx_cnts
$ sfstnormalize -method=kl_min -phi_label=0 earnest_train.approx_cnts >earnest_train.approx2
$ sfstperplexity $unk_flags -phi_label=0 earnest_train.approx2 earnest_test.far
# of sentences                                    838
cross entropy                                     41835.6
perplexity                                        73.4145
# of words                                        8900
# of OOVs                                         1363

Test Set-Based Target Topology

If we use the test-set bigram as the target topology we would expect to get a better test set perplexity since we include all the relevant bigrams.

$ sfstapprox -phi_label=0 earnest_train.mod earnest_test.mod >earnest_train.approx2
$ sfstperplexity $unk_flags -phi_label=0 earnest_train.approx2 earnest_test.far
# of sentences                                    838
cross entropy                                     41327.8
perplexity                                        69.6842
# of words                                        8900
# of OOVs                                         1363

Pruned Target Topology

If we use the pruned training-set bigram as the target topology, we will get a different weighting than the greedy relative-entropy pruning performed above. This approximation seeks the minimum KL-distance solution and proves somewhat better than the 84.1474 above.

$ sfstapprox -phi_label=0 earnest_train.mod earnest_train.pru >earnest_train.pru.approx
$ sfstperplexity $unk_flags -phi_label=0 earnest_train.pru.approx earnest_test.far
# of sentences                                    838
cross entropy                                     43044.8
perplexity                                        83.1204
# of words                                        8900
# of OOVs                                         1363
Topic attachments
I Attachment History Action Size Date Who Comment
Unknown file formatsyms earnest.syms r1 manage 27.7 K 2019-11-08 - 01:00 MichaelRiley  
Texttxt earnest_test.txt r1 manage 44.6 K 2019-11-08 - 00:46 MichaelRiley  
Texttxt earnest_train.txt r1 manage 44.5 K 2019-11-08 - 00:46 MichaelRiley  
Edit | Attach | Watch | Print version | History: r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r2 - 2019-11-08 - MichaelRiley
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback