OpenFst Quick Tour

Below is a brief tutorial on the OpenGrm SFST library based on a running example. We use the command-line SFST utilities for this; we could have instead used the corresponding C++-level utilities in a program (see Available Operations).

Example Data and Models

For our running example, we will use the text of Oscar Wilde's The Importance of Being Earnest, which has been upper-cased and had punctuation removed. The first 850 sentences, earnest_train.txt, are used as training data and the remaining 838 sentences, earnest_test.txt, are used as test data. We also provide an OpenFst

-style symbol table, earnest_train_1000.syms, which contains the 1000 most frequent words in the training set plus the <epsilon> token for the empty string and the <unk> token for the unknown word. These training and test sets can be compiled into OpenFst archives with:

$ farcompilestrings --unknown_symbol="<unk>" -keep_symbols -symbols=earnest_train_1000.syms earnest_train.txt >earnest_train.far
$ farcompilestrings --unknown_symbol="<unk>" -keep_symbols -symbols=earnest_train_1000.syms earnest_test.txt >earnest_test.far

Using the OpenGrm NGram utilities, these are used to construct a bigram Katz LM on the training data, earnest_train.mod, and the test data, earnest_test.mod. From earnest_train.mod we create a 2000 n-gram, entropy-pruned LM, earnest_train.pru.

$ ngramcount -order=$order earnest_train.far >earnest_train.cnts
$ ngrammake earnest_train.cnts >earnest_train.mod

$ ngramcount -order=$order earnest_test.far >earnest_test.cnts
$ ngrammake earnest_test.cnts >earnest_test.mod

$ ngramshrink --method="relative_entropy" --target_number_of_ngrams=2000 earnest_train.mod >earnest_train.pru

Perplexity

We can compute the perplexity of these models relative to the test set using the SFst operation sfstperplexity as follows (ngramperplexity returns the same values, restricted to n-gram input):

$ sfstperplexity --unknown_label=1001 -phi_label=0 earnest_train.mod earnest_test.far
# of sources                                      838
cross entropy/source                              49.92315
perplexity/symbol                                 73.4146
# of OOVs                                         1363

$ sfstperplexity --unknown_label=1001 -phi_label=0 earnest_train.mod earnest_train.far
# of sources                                      850
cross entropy/source                              37.8284
perplexity/symbol                                 26.19
# of OOVs                                         568

$ sfstperplexity --unknown_label=1001 -phi_label=0 earnest_train.pru earnest_test.far
# of sources                                      838
cross entropy/source                              51.5088
perplexity/symbol                                 84.1474
# of OOVs                                         1363

We can also compute the self perplexities and cross perplexities of the models:

$ sfstperplexity -phi_label=0 earnest_train.mod
# of sources                                      1
self entropy/source                               55.9322
perplexity/symbol                                 72.518

$ sfstperplexity -phi_label=0 earnest_train.pru
# of sources                                      1
self entropy/source                               56.3617
perplexity/symbol                                 84.8603

$ sfstperplexity --unknown_label=1001 -phi_label=0 earnest_train.pru earnest_train.mod
# of sources                                      1
cross entropy/source                              59.2278
perplexity/symbol                                 93.3396
# of OOVs                                         196

$ sfstperplexity --unknown_label=1001 -phi_label=0 earnest_train.mod earnest_train.pru
# of sources                                      1
cross entropy/source                              57.839
perplexity/symbol                                 95.336
# of OOVs                                         23

Sampling

We can sample from these SFSTs as follows:

$ sfstrandgen -phi_label=0 earnest_train.mod | fstprint --acceptor
0   1   YES
1   2   BUT
2   3   I
3   4   SHALL
4   5   PROBABLY
5   6   NEVER
6   7   <epsilon>
7   8   THAT
8   9   NAME
9   10   <epsilon>
10   11   YOUR
11   12   COUSIN
12   13   CECILY
13

Approximation

Idempotency

The following steps show that SFT approximation is effectively idempotent: the perplexity of the approximation of a SFST onto the same topology gives the same perplexity as the source.

$ sfstapprox -phi_label=0 earnest_train.mod earnest_train.mod >earnest_train.approx
$ sfstperplexity --unknown_label=1001 -phi_label=0 earnest_train.approx earnest_test.far
# of sources                                      838
cross entropy/source                              49.9231
perplexity/symbol                                 73.4145
# of OOVs                                         1363

An alternative, equivalent way to perform this approximation is to break it into two steps, where the counting and normalization are done separately.

$ sfstcount -phi_label=0 earnest_train.mod earnest_train.mod >earnest_train.approx_cnts
$ sfstnormalize -method=kl_min -phi_label=0 earnest_train.approx_cnts >earnest_train.approx2
$ sfstperplexity --unknown_label=1001 -phi_label=0 earnest_train.approx2 earnest_test.far
# of sources                                      838
cross entropy/source                              49.9231
perplexity/symbol                                 73.4145
# of OOVs                                         1363

Test Set-Based Target Topology

If we use the test-set bigram as the target topology we would expect to get a better test set perplexity since we include all the relevant bigrams.

$ sfstapprox -phi_label=0 earnest_train.mod earnest_test.mod >earnest_train.approx2
$ sfstperplexity --unknown_label=1001 -phi_label=0 earnest_train.approx2 earnest_test.far
# of sources                                      838
cross entropy/source                              49.3171
perplexity/symbol                                 69.6842
# of OOVs                                         1363

Pruned Target Topology

If we use the pruned training-set bigram as the target topology, we will get a different weighting than the greedy relative-entropy pruning performed above. This approximation seeks the minimum KL-distance solution and proves somewhat better than the 84.1474 above.

$ sfstapprox -phi_label=0 earnest_train.mod earnest_train.pru >earnest_train.pru.approx
$ sfstperplexity --unknown_label=1001 -phi_label=0 earnest_train.pru.approx earnest_test.far
# of sources                                      838
cross entropy/source                              51.3661
perplexity/symbol                                 83.1204
# of OOVs                                         1363

Attachments

Topic attachments
I	Attachment	History	Action	Size	Date	Who
txt	earnest_test.txt	r1	manage	44.6 K	2019-11-08 - 00:46	MichaelRiley
txt	earnest_train.txt	r1	manage	44.5 K	2019-11-08 - 00:46	MichaelRiley
syms	earnest_train_1000.syms	r1	manage	10.7 K	2020-07-06 - 02:01	MichaelRiley

Topic revision: r4 - 2020-07-06 - MichaelRiley