`<epsilon>`

token for the empty string and the `<unk>`

token for the unknown word. These training and test sets can be compiled into OpenFst archives with:
$ farcompilestrings --unknown_symbol="<unk>" -keep_symbols -symbols=earnest_train_1000.syms earnest_train.txt >earnest_train.far $ farcompilestrings --unknown_symbol="<unk>" -keep_symbols -symbols=earnest_train_1000.syms earnest_test.txt >earnest_test.far

Using the OpenGrm NGram utilities, these are used to construct a bigram Katz LM on the training data, `earnest_train.mod`

, and the test data, `earnest_test.mod`

. From `earnest_train.mod`

we create a 2000 n-gram, entropy-pruned LM, `earnest_train.pru`

.

$ ngramcount -order=$order earnest_train.far >earnest_train.cnts $ ngrammake earnest_train.cnts >earnest_train.mod $ ngramcount -order=$order earnest_test.far >earnest_test.cnts $ ngrammake earnest_test.cnts >earnest_test.mod $ ngramshrink --method="relative_entropy" --target_number_of_ngrams=2000 earnest_train.mod >earnest_train.pru

`sfstperplexity`

as follows (`ngramperplexity`

returns the same values, restricted to n-gram input):

$ sfstperplexity --unknown_label=1001 -phi_label=0 earnest_train.mod earnest_test.far # of sources 838 cross entropy/source 49.92315 perplexity/symbol 73.4146 # of OOVs 1363 $ sfstperplexity --unknown_label=1001 -phi_label=0 earnest_train.mod earnest_train.far # of sources 850 cross entropy/source 37.8284 perplexity/symbol 26.19 # of OOVs 568 $ sfstperplexity --unknown_label=1001 -phi_label=0 earnest_train.pru earnest_test.far # of sources 838 cross entropy/source 51.5088 perplexity/symbol 84.1474 # of OOVs 1363

We can also compute the self perplexities and cross perplexities of the models:

$ sfstperplexity -phi_label=0 earnest_train.mod # of sources 1 self entropy/source 55.9322 perplexity/symbol 72.518 $ sfstperplexity -phi_label=0 earnest_train.pru # of sources 1 self entropy/source 56.3617 perplexity/symbol 84.8603 $ sfstperplexity --unknown_label=1001 -phi_label=0 earnest_train.pru earnest_train.mod # of sources 1 cross entropy/source 59.2278 perplexity/symbol 93.3396 # of OOVs 196 $ sfstperplexity --unknown_label=1001 -phi_label=0 earnest_train.mod earnest_train.pru # of sources 1 cross entropy/source 57.839 perplexity/symbol 95.336 # of OOVs 23

$ sfstrandgen -phi_label=0 earnest_train.mod | fstprint --acceptor 0 1 YES 1 2 BUT 2 3 I 3 4 SHALL 4 5 PROBABLY 5 6 NEVER 6 7 <epsilon> 7 8 THAT 8 9 NAME 9 10 <epsilon> 10 11 YOUR 11 12 COUSIN 12 13 CECILY 13

$ sfstapprox -phi_label=0 earnest_train.mod earnest_train.mod >earnest_train.approx $ sfstperplexity --unknown_label=1001 -phi_label=0 earnest_train.approx earnest_test.far # of sources 838 cross entropy/source 49.9231 perplexity/symbol 73.4145 # of OOVs 1363

An alternative, equivalent way to perform this approximation is to break it into two steps, where the counting and normalization are done separately.

$ sfstcount -phi_label=0 earnest_train.mod earnest_train.mod >earnest_train.approx_cnts $ sfstnormalize -method=kl_min -phi_label=0 earnest_train.approx_cnts >earnest_train.approx2 $ sfstperplexity --unknown_label=1001 -phi_label=0 earnest_train.approx2 earnest_test.far # of sources 838 cross entropy/source 49.9231 perplexity/symbol 73.4145 # of OOVs 1363

$ sfstapprox -phi_label=0 earnest_train.mod earnest_test.mod >earnest_train.approx2 $ sfstperplexity --unknown_label=1001 -phi_label=0 earnest_train.approx2 earnest_test.far # of sources 838 cross entropy/source 49.3171 perplexity/symbol 69.6842 # of OOVs 1363

$ sfstapprox -phi_label=0 earnest_train.mod earnest_train.pru >earnest_train.pru.approx $ sfstperplexity --unknown_label=1001 -phi_label=0 earnest_train.pru.approx earnest_test.far # of sources 838 cross entropy/source 51.3661 perplexity/symbol 83.1204 # of OOVs 1363

I | Attachment | History | Action | Size | Date | Who | Comment |
---|---|---|---|---|---|---|---|

txt | earnest_test.txt | r1 | manage | 44.6 K | 2019-11-08 - 00:46 | MichaelRiley | |

txt | earnest_train.txt | r1 | manage | 44.5 K | 2019-11-08 - 00:46 | MichaelRiley | |

syms | earnest_train_1000.syms | r1 | manage | 10.7 K | 2020-07-06 - 02:01 | MichaelRiley |

Topic revision: r4 - 2020-07-06 - MichaelRiley

Copyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.

Ideas, requests, problems regarding TWiki? Send feedback

Ideas, requests, problems regarding TWiki? Send feedback