Baum-Welch documentation


Training is performed using the TrainBaumWelch function. It takes as arguments a FAR or WFST representing the plaintext, and channel model and a FAR representing the ciphertext.

Normalization strategies

Training takes a template argument argument representing the desired expectation table type. Two types are available. The state expectation table normalizes so that all arcs leaving a state sum to semiring one. The state/input-label expectation table normalizes so that the weights of all arcs leaving a state with the same input label sum to semiring one; this is arguably the most intuitive normalization strategy and so it used as the default.

Semirings and count collection strategies

Training function is controlled in part by the arc type of the inputs.

In a semiring without the path property, like the log semiring, the E-step sums over all state sequences and thus performs true Baum-Welch training. In the case the input is a FST language model with backoffs, attach a $\phi$-matcher (fstspecial --fst_type phi) to ensure proper interpretation.

In a semiring with the path property, like the tropical semiring, the E-step sums over only the most likely state sequence, an approximation known as Viterbi training (Brown et al. 1993:293). In the case the input is an FST language model with backoffs, backoffs can be encoded exactly using a $\phi$-matcher or approximated by $\epsilon$ using the default matcher. In practice, Baum-Welch training is slightly slower than Viterbi training, though somewhat more accurate (Jansche 2003).


TrainBaumWelch takes an struct representing options for training. These define the size of the batch, the maximum number of iterations for training, the learning rate $$\alpha$$, and the comparison/quantization $$\delta$$ used to detect convergence.

Since expectation maximization training is only locally optimal, random (re)starts can be used to avoid local minima. For particularly difficult problems, it may be beneficial to perform a very large number of random starts (e.g., Berg-Kilpatrick and Klein 2013). In this library, we randomly initialize weights by sampling from the uniform distribution $(\delta, 1]$ in the real numbers and then mapping these values onto the appropriate values in the desired semiring. RandomizeBaumWelch can be used to generate random restarts.


Berg-Kilpatrick, T., and Klein, D. 2013. Decipherment with a million random restarts. In _Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing_, pages 874-878. Seattle.

Brown, P., Della Pietra, S. A., Della Pietra, V. J., and Mercer, R. L. 1993. The mathematics of statistical machine translation: parameter estimation. Computational Linguistics 19(2): 263-311.

Dempster, A. P., Laird, N. M., and Rubin, D. B. 1977. Maximum likelihood from incomplete data via the EM algorithm. _Journal of the Royal Statistical Society, Series B_, 39(1): 1-38.

Hart, P. E., Nilsson, N. J., Raphael, B. 1968. A formal basis for the heuristic determination of minimal cost paths. _IEEE Transactions on Systems Science and Cybernetics_ 4(2): 100-107.

Jansche, M. 2003. Inference of string mappings for language technology. Doctoral dissertation, Ohio State University.

Knight, K., Nair, A., Rashod, N., and Yamada, K. 2006. Unsupervised analysis for decipherment problems. _Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions_, pages 499-506. Sydney.

Liang, P., and Klein, D. 2009. Online EM for unsupervised models. In _Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics_, pages 611-619. Boulder.

Novak, J. R., Minematsu, N., and Hirose, K. 2012. WFST-based grapheme-to-phoneme conversion: Open source tools for alignment, model-building and decoding. _Proceedings of the 10th International Workshop on Finite State Methods and Natural Language Processing_, pages 45-49. Donostia–San Sebastián.

Novak, J. R., Minematsu, N., and Hirose, K. 2016. Phonetisaurus: exploring grapheme-to-phoneme conversion with joint n-gram models in the WFST framework. Natural Language Engineering 22(6): 907-938.

Edit | Attach | Watch | Print version | History: r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r3 - 2021-05-18 - KyleGorman
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback