Line: 1 to 1  

OpenFst Examples  
Line: 17 to 17  
With these files and the descriptions below, the reader should be able to repeat the examples. With about 340,000 words in The War of the Worlds, it is a small corpus that allows nontrivial examples.  
Deleted:  
< <  (Note the OpenGrm Library, used to build the language model, is currently in development but will be released for general public use soon.)  
A few general comments about the examples:  
Changed:  
< < 
 
> > 
 
Tokenization  
Line: 284 to 282  
The relabeling of the input labels of the language model is a byproduct of how the lookahead matching works. Note in order to use  
Changed:  
< <  the lookahead FST formats you must use enablelookaheadfsts=yes in the library configuration and you must set your  
> >  the lookahead FST formats you must use enablelookaheadfsts in the library configuration and you must set your  
LD_LIBRARY_PATH (or equivalent) appropriately.
Exercise 5 
Line: 1 to 1  

OpenFst Examples  
Line: 195 to 195  
Case Restoration in TextThis example creates a transducer that attempts to restore the case of downcased input. This is the first nontrivial example and, in general, there is no errorfree way to do this. The approach taken here will be to use case statistics gathered from the The War of the Worlds source text  
Changed:  
< <  to help solve this. In particular, we will use an ngram language model created on this text that is represented as a finitestate automaton in OpenFst format, which you should compile to the file wotw.lm . Here is a typical path in this 5gram automaton:  
> >  to help solve this. In particular, we will use an ngram language model created on this text that is represented as a finitestate automaton in OpenFst format, which you should compile to the file wotw.lm.fst . Here is a typical path in this 5gram automaton:  
Changed:  
< <  $ fstrandgen select=log_prob wotw.lm  fstprint isymbols=wotw.syms osymbols=wotw.syms  
> >  $ fstrandgen select=log_prob wotw.lm.fst  fstprint isymbols=wotw.syms osymbols=wotw.syms  
0 1 The The 1 2 desolating desolating 2 3 cry cry  
Line: 228 to 228  
# Before trying this, read the whole section.
 
Changed:  
< <  $ fstcompose lexicon_opt.fst wotw.lm  fstarcsort sort_type=ilabel >wotw.fst  
> >  $ fstcompose lexicon_opt.fst wotw.lm.fst  fstarcsort sort_type=ilabel >wotw.fst  
$ fstinvert full_downcase.fst  fstcompose  wotw.fst >case_restore.fst  
Line: 249 to 249  
There is a serious problem, however, with the above solution. For all but tiny corpora,
the first composition is extremely expensive with the classical composition algorithm since the
output labels in lexicon_opt.fst have been pushed back when it was determinized and this greatly delays matching  
Changed:  
< <  with the labels in wotw.lm . There are three possible solutions:  
> >  with the labels in wotw.lm.fst . There are three possible solutions:  
First, we can use the input to restrict the composition chain as:
$ fstcompose full_downcase.fst marsman.fst  fstinvert  fstcompose  lexicon_opt.fst   
Changed:  
< <  fstcompose  wotw.lm  fstshortestpath  fstproject project_output  fstrmepsilon  fsttopsort >prediction.fst  
> >  fstcompose  wotw.lm.fst  fstshortestpath  fstproject project_output  fstrmepsilon  fsttopsort >prediction.fst  
This works fine but has the disadvantage that we don't have a single transducer to apply and we are depending on the input being  
Line: 266 to 266  
$ fstencode encode_labels lexicon.fst enc.dat  fstdeterminize  fstminimize  fstencode decode  enc.dat >lexicon_compact.fst  
Changed:  
< <  $ fstcompose lexicon_compact.fst wotw.lm  fstdeterminize  fstminimize  fstarcsort sort_type=ilabel >wotw.fst  
> >  $ fstcompose lexicon_compact.fst wotw.lm.fst  fstdeterminize  fstminimize  fstarcsort sort_type=ilabel >wotw.fst  
$ fstinvert full_downcase.fst  fstcompose  wotw.fst >case_restore.fst  
Line: 277 to 277  
# Converts to a lookahead lexicon
$ fstconvert fst_type=olabel_lookahead save_relabel_opairs=relabel.pairs lexicon_opt.fst >lexicon_lookahead.fst
 
Changed:  
< <  $ fstrelabel relabel_ipairs=relabel.pairs wotw.lm  fstarcsort sort_type=ilabel >wotw_relabel.lm  
> >  $ fstrelabel relabel_ipairs=relabel.pairs wotw.lm.fst  fstarcsort sort_type=ilabel >wotw_relabel.lm  
# Relabels the language model input (required by lookahead implementation) $ fstcompose lexicon_lookahead.fst wotw_relabel.lm >wotw.fst $ fstinvert full_downcase.fst  fstcompose  wotw.fst >case_restore.fst 
Line: 1 to 1  

OpenFst Examples  
Line: 169 to 169  
A transducer that downcases at the token level (but see Exercise 3a) can be created with:  
Changed:  
< <  $ fstinvert lexicon_opt.fst  fstcompose  full_downcase.fst  fstcompose  lexicon_opt.fst  fstrmepsilon  fstdeterminize  fstminimize >downcase_token.fst  
> >  $ fstinvert lexicon_opt.fst  fstcompose  full_downcase.fst  fstcompose  lexicon_opt.fst  fstrmepsilon  fstdeterminize  fstminimize >downcase_token.fst  
Exercise 2  
Line: 236 to 236  
The second FST, case_restore.fst is similar but uses only downcased letters. Case prediction can then be performed with:  
Changed:  
< <  $ fstcompose marsman.fst case_restore.fst  fstshortestpath  fstproject project_output  fstrmepsilon  fsttopsort >prediction.fst  
> >  $ fstcompose marsman.fst case_restore.fst  fstshortestpath  fstproject project_output  fstrmepsilon  fsttopsort >prediction.fst  
which gives:  
Line: 254 to 254  
First, we can use the input to restrict the composition chain as:  
Changed:  
< <  $ fstcompose full_downcase.fst marsman.fst  fstinvert  fstcompose  lexicon_opt.fst  fstcompose  wotw.lm  fstshortestpath  fstproject project_output  fstrmepsilon  fsttopsort >prediction.fst  
> >  $ fstcompose full_downcase.fst marsman.fst  fstinvert  fstcompose  lexicon_opt.fst  fstcompose  wotw.lm  fstshortestpath  fstproject project_output  fstrmepsilon  fsttopsort >prediction.fst  
This works fine but has the disadvantage that we don't have a single transducer to apply and we are depending on the input being  
Line: 264 to 264  
the transducer determinization and minimization of the result of the composition with wotw.fst :  
Changed:  
< <  $ fstencode encode_labels lexicon.fst enc.dat  fstdeterminize  fstminimize  fstencode decode  enc.dat >lexicon_compact.fst  
> >  $ fstencode encode_labels lexicon.fst enc.dat  fstdeterminize  fstminimize  fstencode decode  enc.dat >lexicon_compact.fst  
$ fstcompose lexicon_compact.fst wotw.lm  fstdeterminize  fstminimize  fstarcsort sort_type=ilabel >wotw.fst $ fstinvert full_downcase.fst  fstcompose  wotw.fst >case_restore.fst  
Line: 392 to 392  
$ fstcompose ref.fst edit1.fst  fstarcsort >ref_edit.fst $ fstcompose edit2.fst hyp.fst  fstarcsort >hyp_edit.fst  
Changed:  
< <  $ fstcompose ref_edit.fst hyp_edit.fst  fstshortestpath  fstrmepsilon  fsttopsort  fstprint isymbols=levenshtein.syms osymbols=levenshtein.syms  
> >  $ fstcompose ref_edit.fst hyp_edit.fst  fstshortestpath  fstrmepsilon  fsttopsort  fstprint isymbols=levenshtein.syms osymbols=levenshtein.syms  
Here is the output (with some added color to make it easier to read): 
Line: 1 to 1  

OpenFst Examples  
Line: 365 to 365  
Given these factors, compute:  
Changed:  
< <  $ fstcompose ref.fst edit1.fst >ref_edit.fst $ fstcompose edit2.fst hyp.fst >hyp_edit.fst  
> >  $ fstcompose ref.fst edit1.fst  fstarcsort >ref_edit.fst $ fstcompose edit2.fst hyp.fst  fstarcsort >hyp_edit.fst  
$ fstcompose ref_edit.fst hyp_edit.fst  fstshortestdistance reverse  head 1  
Line: 385 to 385  
Create a transducer ref.fst representing a correctly capitalized English sentence using words from the corpus and with adequate whitespace. You might want to use words which appear both capitalized and uncapitalized in the source text to have a chance to observe a nonzero edit distance. A suitable (nonsensical) example is the following: "The nice chief astronomer says that both the terraces of the south tower and the western mills in the East use the English Channel as a supply pool "  
Changed:  
< <  We will now downcase ref.fst (with full_downcase.fst which we presented above) and feed it to (ie. compose it with) the case restoration FST. We select the shortest path to get the hypothesis of case_restore.fst for this input and compose that with the reversed tokenizer to get its representation as a sequence of characters not tokens. This is hyp.fst .  
> >  You can now downcase ref.fst (with the full_downcase.fst transducer presented above), apply case_restore.fst to it and get the hypothesis output for this input (as was explained in the section about case restoration). Compose that with the reversed tokenizer to get the hypothesis represented as a sequence of characters not tokens. This is hyp.fst , which should be a FST representing a string along the lines of "The Nice chief Astronomer says that both the terraces of the south Tower and the western Mills in the east use the English channel as a Supply Pool ".  
Changed:  
< <  $ fstcompose ref.fst full_downcase.fst  fstcompose  case_restore.fst  fstshortestpath \  fstproject project_output  fstrmepsilon  fstcompose  <(fstinvert lexicon_opt.fst) \  fstshortestpath  fsttopsort  fstproject project_output \  fstpush push_weights remove_total_weight > hyp.fst The result should be a FST representing a string along the lines of "The Nice chief Astronomer says that both the terraces of the south Tower and the western Mills in the east use the English channel as a Supply Pool ". Now, we can compute the edit distance as in the example above.
$ fstcompose <(fstcompose ref.fst edit1.fst  fstarcsort) <(fstcompose edit2.fst hyp.fst  fstarcsort) \  fstshortestdistance reverse head 1
For the given  
> >  Now, you can compute the edit distance as in the example above. For the given ref.fst and hyp.fst , the edit distance should be 8. You can also show the alignment (which, in the present case, will only include substitutions):  
Changed:  
< <  $ fstcompose <(fstcompose ref.fst edit1.fst  fstarcsort) <(fstcompose edit2.fst hyp.fst  fstarcsort)  fstshortestpath  fstrmepsilon  fsttopsort  fstprint isymbols=levenshtein.syms osymbols=levenshtein.syms  
> >  $ fstcompose ref.fst edit1.fst  fstarcsort >ref_edit.fst $ fstcompose edit2.fst hyp.fst  fstarcsort >hyp_edit.fst $ fstcompose ref_edit.fst hyp_edit.fst  fstshortestpath  fstrmepsilon  fsttopsort  fstprint isymbols=levenshtein.syms osymbols=levenshtein.syms  
Changed:  
< <  Here is the output:  
> >  Here is the output (with some added color to make it easier to read):  
</twistyPlugin twikiMakeVisibleInline>  
Changed:  
< <  
> >  
0 1 T T 1 2 h h 2 3 e e  
Changed:  
< <  3 4  
> >  3 4 <space> <space> 4 5 n N 1  
5 6 i i 6 7 c c 7 8 e e  
Changed:  
< <  8 9  
> >  8 9 <space> <space>  
9 10 c c 10 11 h h 11 12 i i 12 13 e e 13 14 f f  
Changed:  
< <  14 15  
> >  14 15 <space> <space> 15 16 a A 1  
16 17 s s 17 18 t t 18 19 r r  
Line: 443 to 429  
22 23 m m 23 24 e e 24 25 r r  
Changed:  
< <  25 26  
> >  25 26 <space> <space>  
26 27 s s 27 28 a a 28 29 y y 29 30 s s  
Changed:  
< <  30 31  
> >  30 31 <space> <space>  
31 32 t t 32 33 h h 33 34 a a 34 35 t t  
Changed:  
< <  35 36  
> >  35 36 <space> <space>  
36 37 b b 37 38 o o 38 39 t t 39 40 h h  
Changed:  
< <  40 41  
> >  40 41 <space> <space>  
41 42 t t 42 43 h h 43 44 e e  
Changed:  
< <  44 45  
> >  44 45 <space> <space>  
45 46 t t 46 47 e e 47 48 r r  
Line: 471 to 457  
50 51 c c 51 52 e e 52 53 s s  
Changed:  
< <  53 54  
> >  53 54 <space> <space>  
54 55 o o 55 56 f f  
Changed:  
< <  56 57  
> >  56 57 <space> <space>  
57 58 t t 58 59 h h 59 60 e e  
Changed:  
< <  60 61  
> >  60 61 <space> <space>  
61 62 s s 62 63 o o 63 64 u u 64 65 t t 65 66 h h  
Changed:  
< <  66 67  
> >  66 67 <space> <space> 67 68 t T 1  
68 69 o o 69 70 w w 70 71 e e 71 72 r r  
Changed:  
< <  72 73  
> >  72 73 <space> <space>  
73 74 a a 74 75 n n 75 76 d d  
Changed:  
< <  76 77  
> >  76 77 <space> <space>  
77 78 t t 78 79 h h 79 80 e e  
Changed:  
< <  80 81  
> >  80 81 <space> <space>  
81 82 w w 82 83 e e 83 84 s s  
Line: 506 to 492  
85 86 e e 86 87 r r 87 88 n n  
Changed:  
< <  88 89  
> >  88 89 <space> <space> 89 90 m M 1  
90 91 i i 91 92 l l 92 93 l l 93 94 s s  
Changed:  
< <  94 95  
> >  94 95 <space> <space>  
95 96 i i 96 97 n n  
Changed:  
< <  97 98  
> >  97 98 <space> <space>  
98 99 t t 99 100 h h 100 101 e e  
Changed:  
< <  101 102  
> >  101 102 <space> <space> 102 103 E e 1  
103 104 a a 104 105 s s 105 106 t t  
Changed:  
< <  106 107  
> >  106 107 <space> <space>  
107 108 u u 108 109 s s 109 110 e e  
Changed:  
< <  110 111  
> >  110 111 <space> <space>  
111 112 t t 112 113 h h 113 114 e e  
Changed:  
< <  114 115  
> >  114 115 <space> <space>  
115 116 E E 116 117 n n 117 118 g g  
Line: 540 to 526  
119 120 i i 120 121 s s 121 122 h h  
Changed:  
< <  122 123  
> >  122 123 <space> <space> 123 124 C c 1  
124 125 h h 125 126 a a 126 127 n n 127 128 n n 128 129 e e 129 130 l l  
Changed:  
< <  130 131  
> >  130 131 <space> <space>  
131 132 a a 132 133 s s  
Changed:  
< <  133 134  
> >  133 134 <space> <space>  
134 135 a a  
Changed:  
< <  135 136  
> >  135 136 <space> <space> 136 137 s S 1  
137 138 u u 138 139 p p 139 140 p p 140 141 l l 141 142 y y  
Changed:  
< <  142 143  
> >  142 143 <space> <space> 143 144 p P 1  
144 145 o o 145 146 o o 146 147 l l  
Changed:  
< <  147 148  
> >  147 148 <space> <space>  
148  
Changed:  
< <  
> >  
</twistyPlugin> 
Line: 1 to 1  

OpenFst Examples  
Line: 407 to 407  
$ fstcompose <(fstcompose ref.fst edit1.fst  fstarcsort) <(fstcompose edit2.fst hyp.fst  fstarcsort) \  
Changed:  
< <   fstreverse  fstshortestpath  fstprint isymbols=levenshtein.syms osymbols=levenshtein.syms  
> >   fstshortestpath  fstrmepsilon  fsttopsort  fstprint isymbols=levenshtein.syms osymbols=levenshtein.syms  
Added:  
> >  Here is the output:
</twistyPlugin twikiMakeVisibleInline> 0 1 T T 1 2 h h 2 3 e e 3 4 <space> <space> 4 5 n N 1 5 6 i i 6 7 c c 7 8 e e 8 9 <space> <space> 9 10 c c 10 11 h h 11 12 i i 12 13 e e 13 14 f f 14 15 <space> <space> 15 16 a A 1 16 17 s s 17 18 t t 18 19 r r 19 20 o o 20 21 n n 21 22 o o 22 23 m m 23 24 e e 24 25 r r 25 26 <space> <space> 26 27 s s 27 28 a a 28 29 y y 29 30 s s 30 31 <space> <space> 31 32 t t 32 33 h h 33 34 a a 34 35 t t 35 36 <space> <space> 36 37 b b 37 38 o o 38 39 t t 39 40 h h 40 41 <space> <space> 41 42 t t 42 43 h h 43 44 e e 44 45 <space> <space> 45 46 t t 46 47 e e 47 48 r r 48 49 r r 49 50 a a 50 51 c c 51 52 e e 52 53 s s 53 54 <space> <space> 54 55 o o 55 56 f f 56 57 <space> <space> 57 58 t t 58 59 h h 59 60 e e 60 61 <space> <space> 61 62 s s 62 63 o o 63 64 u u 64 65 t t 65 66 h h 66 67 <space> <space> 67 68 t T 1 68 69 o o 69 70 w w 70 71 e e 71 72 r r 72 73 <space> <space> 73 74 a a 74 75 n n 75 76 d d 76 77 <space> <space> 77 78 t t 78 79 h h 79 80 e e 80 81 <space> <space> 81 82 w w 82 83 e e 83 84 s s 84 85 t t 85 86 e e 86 87 r r 87 88 n n 88 89 <space> <space> 89 90 m M 1 90 91 i i 91 92 l l 92 93 l l 93 94 s s 94 95 <space> <space> 95 96 i i 96 97 n n 97 98 <space> <space> 98 99 t t 99 100 h h 100 101 e e 101 102 <space> <space> 102 103 E e 1 103 104 a a 104 105 s s 105 106 t t 106 107 <space> <space> 107 108 u u 108 109 s s 109 110 e e 110 111 <space> <space> 111 112 t t 112 113 h h 113 114 e e 114 115 <space> <space> 115 116 E E 116 117 n n 117 118 g g 118 119 l l 119 120 i i 120 121 s s 121 122 h h 122 123 <space> <space> 123 124 C c 1 124 125 h h 125 126 a a 126 127 n n 127 128 n n 128 129 e e 129 130 l l 130 131 <space> <space> 131 132 a a 132 133 s s 133 134 <space> <space> 134 135 a a 135 136 <space> <space> 136 137 s S 1 137 138 u u 138 139 p p 139 140 p p 140 141 l l 141 142 y y 142 143 <space> <space> 143 144 p P 1 144 145 o o 145 146 o o 146 147 l l 147 148 <space> <space> 148 </twistyPlugin>  
Exercise 8Create an edit transducer that: 
Line: 1 to 1  

OpenFst Examples  
Line: 169 to 169  
A transducer that downcases at the token level (but see Exercise 3a) can be created with:  
Changed:  
< <  $ fstinvert lexicon_opt.fst  fstcompose  full_downcase.fst  fstcompose  lexicon_opt.fst  fstrmepsilon  fstdeterminize  fstminimize >downcase_token.fst  
> >  $ fstinvert lexicon_opt.fst  fstcompose  full_downcase.fst  fstcompose  lexicon_opt.fst  fstrmepsilon  fstdeterminize  fstminimize >downcase_token.fst  
Exercise 2  
Line: 236 to 236  
The second FST, case_restore.fst is similar but uses only downcased letters. Case prediction can then be performed with:  
Changed:  
< <  $ fstcompose marsman.fst case_restore.fst  fstshortestpath  fstproject project_output  fstrmepsilon  fsttopsort >prediction.fst  
> >  $ fstcompose marsman.fst case_restore.fst  fstshortestpath  fstproject project_output  fstrmepsilon  fsttopsort >prediction.fst  
which gives:  
Line: 254 to 254  
First, we can use the input to restrict the composition chain as:  
Changed:  
< <  $ fstcompose full_downcase.fst marsman.fst  fstinvert  fstcompose  lexicon_opt.fst  fstcompose  wotw.lm  fstshortestpath  fstproject project_output  fstrmepsilon  fsttopsort >prediction.fst  
> >  $ fstcompose full_downcase.fst marsman.fst  fstinvert  fstcompose  lexicon_opt.fst  fstcompose  wotw.lm  fstshortestpath  fstproject project_output  fstrmepsilon  fsttopsort >prediction.fst  
This works fine but has the disadvantage that we don't have a single transducer to apply and we are depending on the input being  
Line: 264 to 264  
the transducer determinization and minimization of the result of the composition with wotw.fst :  
Changed:  
< <  $ fstencode encode_labels lexicon.fst enc.dat  fstdeterminize  fstminimize  fstencode decode  enc.dat >lexicon_compact.fst  
> >  $ fstencode encode_labels lexicon.fst enc.dat  fstdeterminize  fstminimize  fstencode decode  enc.dat >lexicon_compact.fst  
$ fstcompose lexicon_compact.fst wotw.lm  fstdeterminize  fstminimize  fstarcsort sort_type=ilabel >wotw.fst $ fstinvert full_downcase.fst  fstcompose  wotw.fst >case_restore.fst  
Line: 360 to 360  
These transducers include new symbols <sub>, <del>, and <ins> that are used for the substitution, deletion and
insertion of other symbols respectively. In fact, the composition of these two transducers is equivalent to the
original edit transducer edit.fst . However, each of these transducers has 4 V transitions where V is the number of  
Changed:  
< <  distinct symbols whereas the original edit transducer has (V+1)^{2}1 transitions.  
> >  distinct symbols, whereas the original edit transducer has (V+1)^{2}1 transitions.  
Given these factors, compute:
$ fstcompose ref.fst edit1.fst >ref_edit.fst  
Changed:  
< <  $ fstcompose edit2.hyp hyp.fst >hyp_edit.fst  
> >  $ fstcompose edit2.fst hyp.fst >hyp_edit.fst  
$ fstcompose ref_edit.fst hyp_edit.fst  fstshortestdistance reverse  head 1  
Changed:  
< <  With large inputs, the short distance algorithm may need to use inadmissable pruning.  
> >  With large inputs, the shortest distance algorithm may need to use inadmissable pruning.  
This is because the edit transducer allows arbitrary insertions and deletions, so the search space is quadratic in the length of the input. Alternatively the edit transducer could be changed (see Exercise 8b).  
Line: 379 to 379  
A threeway composition algorithm or specialized composition matchers and filters are approaches that could implement this more efficiently.  
Added:  
> >  As an example, we can see to what extent the case restoration transducer errs on a given input by computing the edit distance between the output it yields and the reference answer. We will use the Levenshtein distance.
First, generate
Create a transducer
We will now downcase
$ fstcompose ref.fst full_downcase.fst  fstcompose  case_restore.fst  fstshortestpath \  fstproject project_output  fstrmepsilon  fstcompose  <(fstinvert lexicon_opt.fst) \  fstshortestpath  fsttopsort  fstproject project_output \  fstpush push_weights remove_total_weight > hyp.fst The result should be a FST representing a string along the lines of "The Nice chief Astronomer says that both the terraces of the south Tower and the western Mills in the east use the English channel as a Supply Pool ". Now, we can compute the edit distance as in the example above.
$ fstcompose <(fstcompose ref.fst edit1.fst  fstarcsort) <(fstcompose edit2.fst hyp.fst  fstarcsort) \  fstshortestdistance reverse head 1
For the given
$ fstcompose <(fstcompose ref.fst edit1.fst  fstarcsort) <(fstcompose edit2.fst hyp.fst  fstarcsort) \  fstreverse  fstshortestpath  fstprint isymbols=levenshtein.syms osymbols=levenshtein.syms  
Exercise 8Create an edit transducer that:  
Line: 416 to 447  
 
Changed:  
< < 
 
> > 

Line: 1 to 1  

OpenFst Examples  
Line: 23 to 23  
 
Added:  
> > 
 
Tokenization  
Line: 46 to 47  
which produces:  
Changed:  
< <  .  
> >  
Suppose that Martian.fst and man.fst have similarly been created, then:  
Line: 101 to 102  
giving:  
Changed:  
< <  .  
> > 
Note that our construction of  
To generate a full lexicon of all 7102 distinct words in the War of Worlds, it is convenient to dispense with the union
of individual word FSTs above and instead generate a single text FST from the word symbols in wotw.syms .
Here is a python script that does that and was used, along with the above steps,  
Changed:  
< <  to generate the full optimized lexicon.  
> >  to generate the full optimized lexicon (which you should compile to lexicon_opt.fst ).  
Exercise 1  
Line: 119 to 122  
11 > eleven 111 > one hundred eleven 1111 > one thousand one hundred eleven  
Changed:  
< <  11111 > eleven thousand one hundred eleven.  
> >  11111 > eleven thousand one hundred eleven  
(b) Incorporate this transduction into the lettertotoken transduction above and apply to the input Mars is 4225 miles across. represented as letters.  
Line: 146 to 149  
A downcasing flower transducer for the full character set is here. This transducer can be applied to the Mars men automaton from the previous example with:  
Changed:  
< <  $ fstproject Marsman.fst  fstcompose  downcase.fst  fstproject project_output >marsman.fst  
> >  $ fstproject Marsman.fst  fstcompose  full_downcase.fst  fstproject project_output >marsman.fst  
giving:  
Line: 154 to 157  
Why use transducers for this when UNIX commands like  
Changed:  
< <  as easy to write (see Example 2). Second, the inverse of this transduction is less trivial and can be quite useful (see the next section).  
> >  as easy to write (see Example 2). Second, trying to invert this transduction is less trivial and can be quite useful (see the next section).  
Finally, this transducer operates on any finitestate input not just a string. For example,  
Changed:  
< <  $ fstinvert lexicon_opt.fst  fstcompose  downcase.fst  fstinvert >lexicon_opt_downcase.fst  
> >  $ fstinvert lexicon_opt.fst  fstcompose  full_downcase.fst  fstinvert >lexicon_opt_downcase.fst  
downcases the letters in the lexicon from the previous example.  
Line: 166 to 169  
A transducer that downcases at the token level (but see Exercise 3a) can be created with:  
Changed:  
< <  $ fstinvert lexicon_opt.fst  fstcompose  downcase.fst  fstcompose  lexicon_opt.fst   
> >  $ fstinvert lexicon_opt.fst  fstcompose  full_downcase.fst  fstcompose  lexicon_opt.fst   
fstrmepsilon  fstdeterminize  fstminimize >downcase_token.fst  
Line: 175 to 178  
(a) upcases letters that are stringinitial or after a punctuation symbol/space (capitalization transducer).  
Changed:  
< <  (b) converts lowercase underscoreseparated identifiers such as num_final_states to the form NumFinalStates (camelcase transducer).  
> >  (b) converts lowercase underscoreseparated identifiers such as num_final_states to the form NumFinalStates (CamelCase transducer).  
Exercise 3(a) The letterlevel downcasing transducer downcases any ASCII input. For which inputs does the tokenlevel downcasing transducer work? What changes would be necessary to cover all inputs fromwotw.syms ?  
Line: 186 to 189  
Create a 1,000,000 ASCII character string represented as an FST. Compose it on the left with downcase.fst and time the computation.
Compose it on the right and time the computation. The labels in downcase.fst were presorted on one side; use fstinfo to
determine which side. Use fstarcsort to sort downcase.fst on the opposite side and repeat the experiments above.  
Changed:  
< <  Given that composition matching uses binary search on the sorted side (with the higher outdegree, if both sides sorted), explain the differences in computation time that you observe.  
> >  Given that composition matching uses binary search on the sorted side (with the higher outdegree, if both sides are sorted), explain the differences in computation time that you observe.  
Case Restoration in TextThis example creates a transducer that attempts to restore the case of downcased input. This is the first nontrivial example and, in general, there is no errorfree way to do this. The approach taken here will be to use case statistics gathered from the The War of the Worlds source text  
Changed:  
< <  to help solve this. In particular, we will use an ngram language model created on this text that is represented as a finitestate automaton in OpenFst format. Here is a typical path in this 5gram automaton:  
> >  to help solve this. In particular, we will use an ngram language model created on this text that is represented as a finitestate automaton in OpenFst format, which you should compile to the file wotw.lm . Here is a typical path in this 5gram automaton:  
$ fstrandgen select=log_prob wotw.lm  fstprint isymbols=wotw.syms osymbols=wotw.syms  
Line: 224 to 229  
# Before trying this, read the whole section.
$ fstcompose lexicon_opt.fst wotw.lm  fstarcsort sort_type=ilabel >wotw.fst
 
Changed:  
< <  $ fstinvert downcase.fst  fstcompose  wotw.fst >case_restore.fst  
> >  $ fstinvert full_downcase.fst  fstcompose  wotw.fst >case_restore.fst  
The first FST,  
Line: 249 to 254  
First, we can use the input to restrict the composition chain as:  
Changed:  
< <  $ fstcompose downcase.fst marsman.fst  fstinvert  fstcompose  lexicon_opt.fst   
> >  $ fstcompose full_downcase.fst marsman.fst  fstinvert  fstcompose  lexicon_opt.fst   
fstcompose  wotw.lm  fstshortestpath  fstproject project_output  fstrmepsilon  fsttopsort >prediction.fst  
Line: 262 to 267  
$ fstencode encode_labels lexicon.fst enc.dat  fstdeterminize  fstminimize  fstencode decode  enc.dat >lexicon_compact.fst $ fstcompose lexicon_compact.fst wotw.lm  fstdeterminize  fstminimize  fstarcsort sort_type=ilabel >wotw.fst  
Changed:  
< <  $ fstinvert downcase.fst  fstcompose  wotw.fst >case_restore.fst  
> >  $ fstinvert full_downcase.fst  fstcompose  wotw.fst >case_restore.fst  
This solution is a natural and simple one but has the disadvantage that the transducer determinization and minimization steps are quite  
Line: 275 to 280  
$ fstrelabel relabel_ipairs=relabel.pairs wotw.lm  fstarcsort sort_type=ilabel >wotw_relabel.lm # Relabels the language model input (required by lookahead implementation) $ fstcompose lexicon_lookahead.fst wotw_relabel.lm >wotw.fst  
Changed:  
< <  $ fstinvert downcase.fst  fstcompose  wotw.fst >case_restore.fst  
> >  $ fstinvert full_downcase.fst  fstcompose  wotw.fst >case_restore.fst  
The relabeling of the input labels of the language model is a byproduct of how the lookahead matching works. Note in order to use  
Line: 297 to 302  
Exercise 7Create a transducer that converts the digits 09 into their possible telephone keypad alphabetic equivalents (e.g., 2: a,b,c; 3: d,e,f) and  
Changed:  
< <  allow for spaces as well. Use this transducer to convert the sentence no one would have believed in the last years of the nineteenth century that this world was being watched keenly and closely into digits and spaces. Use the lexicon alone to disambiguate this  
> >  allows for spaces as well. Use this transducer to convert the sentence no one would have believed in the last years of the nineteenth century that this world was being watched keenly and closely into digits and spaces. Use the lexicon alone to disambiguate this  
digit and space sequence (cf. T9 phone input). Now use both the lexicon and the language model to disambiguate it.
Edit Distance 
Line: 1 to 1  

OpenFst Examples  
Line: 293 to 293  
(a) The case restoration above can only work for words that are found in the text corpus. Describe an alternative that gives a plausible result on any letter sequence.  
Changed:  
< <  (b) Punctuation can give clues to the case of nearby words (e.g. i was in cambridge, ma. before. it was nice.). Describe a method to to exploit this information in case restoration.  
> >  (b) Punctuation can give clues to the case of nearby words (e.g. i was in cambridge, ma. before. it was nice.). Describe a method to exploit this information in case restoration.  
Exercise 7Create a transducer that converts the digits 09 into their possible telephone keypad alphabetic equivalents (e.g., 2: a,b,c; 3: d,e,f) and  
Line: 324 to 323  
$ fstcompose ref.fst edit.fst  fstcompose  hyp.fst  # Returns shortest distance from final states to the initial (first) state  
Changed:  
< <  $ fstshortdistance reverse  head 1  
> >  $ fstshortestdistance reverse  head 1  
computes the edit distance between the reference and hypothesis according to the edit transducer  
Line: 363 to 362  
$ fstcompose ref.fst edit1.fst >ref_edit.fst $ fstcompose edit2.hyp hyp.fst >hyp_edit.fst  
Changed:  
< <  $ fstcompose ref_edit.fst hyp_edit.fst  fstshortdistance reverse  head 1  
> >  $ fstcompose ref_edit.fst hyp_edit.fst  fstshortestdistance reverse  head 1  
With large inputs, the short distance algorithm may need to use inadmissable pruning. 
Line: 1 to 1  

OpenFst Examples  
Line: 13 to 13  
 
Changed:  
< < 
 
> > 
 
With these files and the descriptions below, the reader should be able to repeat the examples. With about 340,000 words in The War of the Worlds, it is a small corpus that allows nontrivial examples.  
Line: 71 to 71  
In order to handle punctuation symbols, we change the lexicon construction to:
 
Changed:  
< <  $ fstunion man.fst Mars.fst  fstunion  Martian.fst  fstconcat  punct.fst  fstclosure >lexicon.fst  
> >  $ fstunion man.fst Mars.fst  fstunion  Martian.fst  fstconcat  punct.fst  fstclosure >lexicon.fst  
where:  
Line: 223 to 223  
# Before trying this, read the whole section.
 
Changed:  
< <  $ fstcompose lexicon_opt.fst wotw.lm  fstarcsort sort_type=ilabel >wotw.fst  
> >  $ fstcompose lexicon_opt.fst wotw.lm  fstarcsort sort_type=ilabel >wotw.fst  
$ fstinvert downcase.fst  fstcompose  wotw.fst >case_restore.fst 
Line: 1 to 1  

OpenFst Examples  
Line: 22 to 22  
A few general comments about the examples:
 
Changed:  
< < 
 
> > 
 
Tokenization  
Line: 181 to 181  
(a) The letterlevel downcasing transducer downcases any ASCII input. For which inputs does the tokenlevel downcasing transducer work? What changes would be necessary to cover all inputs from wotw.syms ?
(b) If a token The were applied to  
Added:  
> > 
Exercise 4Create a 1,000,000 ASCII character string represented as an FST. Compose it on the left withdowncase.fst and time the computation.
Compose it on the right and time the computation. The labels in downcase.fst were presorted on one side; use fstinfo to
determine which side. Use fstarcsort to sort downcase.fst on the opposite side and repeat the experiments above.
Given that composition matching uses binary search on the sorted side (with the higher outdegree, if both sides sorted), explain the differences in computation time that you observe.  
Case Restoration in TextThis example creates a transducer that attempts to restore the case of downcased input. This is the first nontrivial example and, in general, there is no errorfree way to do this. The approach taken here will be to use case statistics gathered from the The War of the Worlds source text  
Line: 216 to 222  
Given this language model and using the lexicon and downcasing transducers from the previous examples, a solution is:
 
Changed:  
< <  # Before trying this read the whole section. $ fstcompose lexicon_opt.fst wotw.lm  fstarcsort sort_type=ilabel >wotw.fst  
> >  # Before trying this, read the whole section. $ fstcompose lexicon_opt.fst wotw.lm  fstarcsort sort_type=ilabel >wotw.fst  
$ fstinvert downcase.fst  fstcompose  wotw.fst >case_restore.fst  
Line: 255 to 261  
$ fstencode encode_labels lexicon.fst enc.dat  fstdeterminize  fstminimize  fstencode decode  enc.dat >lexicon_compact.fst  
Changed:  
< <  $ fstcompose lexicon_compact.fst wotw.lm  fstdeterminize  fstminimize >wotw.fst  
> >  $ fstcompose lexicon_compact.fst wotw.lm  fstdeterminize  fstminimize  fstarcsort sort_type=ilabel >wotw.fst  
$ fstinvert downcase.fst  fstcompose  wotw.fst >case_restore.fst  
Line: 276 to 282  
the lookahead FST formats you must use enablelookaheadfsts=yes in the library configuration and you must set your
LD_LIBRARY_PATH (or equivalent) appropriately.  
Changed:  
< <  Exercise 4  
> >  Exercise 5  
(a) Find the weight of the second shortest distinct token sequence in the prediction example above.
(b) Find the weight of the second shortest distinct token sequence in the prediction example above without using the (c) Find all paths within weight 10 of the shortest path in prediction example.  
Changed:  
< <  Exercise 5  
> >  Exercise 6  
(a) The case restoration above can only work for words that are found in the text corpus. Describe an alternative that gives a
plausible result on any letter sequence.
(b) Punctuation can give clues to the case of nearby words (e.g. i was in cambridge, ma. before. it was nice.). Describe a method to to exploit this information in case restoration.  
Changed:  
< <  Exercise 6  
> >  Exercise 7  
Create a transducer that converts the digits 09 into their possible telephone keypad alphabetic equivalents (e.g., 2: a,b,c; 3: d,e,f) and allow for spaces as well. Use this transducer to convert the sentence no one would have believed in the last years of the nineteenth century that this world was being watched keenly and closely into digits and spaces. Use the lexicon alone to disambiguate this digit and space sequence (cf. T9 phone input). Now use both the lexicon and the language model to disambiguate it.  
Line: 328 to 334  
This counts any substitution (a:b, b:a), insertion (<epsilon>:a, <epsilon>:b), or deletion as (a:<epsilon>:a, b:<epsilon>) as 1 edit and matches (a:a, b:b) as zero edits. For word error rate, we use the Levenshtein edit distance, i.e. where the cost of substitutions, insertions, and deletions are all the same. However, each pairing of a symbol (or epsilon) with another symbol can be given a separate cost in a more general edit distance. This can obviously be implemented by choosing different weights  
Changed:  
< <  for the corresponding edit transducer transitions. Even more general edit distances can be defined (see Exercise 7).  
> >  for the corresponding edit transducer transitions. Even more general edit distances can be defined (see Exercise 8).  
Note that if the hypothesis is not a string but a more general automaton representing a set of  
Changed:  
< <  hypotheses (e.g. the result from Exercise 4c) then this procedure returns the oracle edit distance, i.e., the edit distance of the bestmatching ('oracleprovided') hypothesis  
> >  hypotheses (e.g. the result from Exercise 5c) then this procedure returns the oracle edit distance, i.e., the edit distance of the bestmatching ('oracleprovided') hypothesis  
compared to the reference. The corresponding oracle error rate is a measure of the quality of the hypothesis set (often called a 'lattice').
There is one serious problem with this approach and that is when the symbol set is large. For the 95 letter  
Line: 362 to 368  
With large inputs, the short distance algorithm may need to use inadmissable pruning. This is because the edit transducer allows arbitrary insertions and deletions, so the search space is quadratic in the length of  
Changed:  
< <  the input. Alternatively the edit transducer could be changed (see Exercise 7b).  
> >  the input. Alternatively the edit transducer could be changed (see Exercise 8b).  
With more general edit transducers, this factoring may not be possible. In that case, representing the edit transducer in some specialized compact FST representation would be possible but pairwise compositions might be very expensive. A threeway composition algorithm or specialized composition matchers and filters are approaches that could implement this more efficiently.  
Changed:  
< <  Exercise 7  
> >  Exercise 8  
Create an edit transducer that:
(a) allows only a fixed number N of contiguous insertions or deletions.  
Line: 378 to 384  
common spelling variants like or vs our or ction vs. xion are given lower cost.  
Changed:  
< <  Exercise 8  
> >  Exercise 9  
Provide a way to:
(a) compute the error rate rather than the edit distance using transducers. 
Line: 1 to 1  

OpenFst Examples  
Line: 290 to 290  
(b) Punctuation can give clues to the case of nearby words (e.g. i was in cambridge, ma. before. it was nice.). Describe a method to to exploit this information in case restoration.  
Added:  
> >  Exercise 6Create a transducer that converts the digits 09 into their possible telephone keypad alphabetic equivalents (e.g., 2: a,b,c; 3: d,e,f) and allow for spaces as well. Use this transducer to convert the sentence no one would have believed in the last years of the nineteenth century that this world was being watched keenly and closely into digits and spaces. Use the lexicon alone to disambiguate this digit and space sequence (cf. T9 phone input). Now use both the lexicon and the language model to disambiguate it.  
Edit DistanceSince the predictions made in the previous example might not always be correct, we may want to measure the error when  
Line: 323 to 328  
This counts any substitution (a:b, b:a), insertion (<epsilon>:a, <epsilon>:b), or deletion as (a:<epsilon>:a, b:<epsilon>) as 1 edit and matches (a:a, b:b) as zero edits. For word error rate, we use the Levenshtein edit distance, i.e. where the cost of substitutions, insertions, and deletions are all the same. However, each pairing of a symbol (or epsilon) with another symbol can be given a separate cost in a more general edit distance. This can obviously be implemented by choosing different weights  
Changed:  
< <  for the corresponding edit transducer transitions. Even more general edit distances can be defined (see Exercise 6).  
> >  for the corresponding edit transducer transitions. Even more general edit distances can be defined (see Exercise 7).  
Note that if the hypothesis is not a string but a more general automaton representing a set of hypotheses (e.g. the result from Exercise 4c) then this procedure returns the oracle edit distance, i.e., the edit distance of the bestmatching ('oracleprovided') hypothesis  
Line: 357 to 362  
With large inputs, the short distance algorithm may need to use inadmissable pruning. This is because the edit transducer allows arbitrary insertions and deletions, so the search space is quadratic in the length of  
Changed:  
< <  the input. Alternatively the edit transducer could be changed (see Exercise 6b).  
> >  the input. Alternatively the edit transducer could be changed (see Exercise 7b).  
With more general edit transducers, this factoring may not be possible. In that case, representing the edit transducer in some specialized compact FST representation would be possible but pairwise compositions might be very expensive. A threeway composition algorithm or specialized composition matchers and filters are approaches that could implement this more efficiently.  
Changed:  
< <  Exercise 6  
> >  Exercise 7  
Create an edit transducer that:
(a) allows only a fixed number N of contiguous insertions or deletions.  
Line: 373 to 378  
common spelling variants like or vs our or ction vs. xion are given lower cost.  
Changed:  
< <  Exercise 7  
> >  Exercise 8  
Provide a way to:
(a) compute the error rate rather than the edit distance using transducers. 
Line: 1 to 1  

OpenFst Examples  
Line: 236 to 236  
In other words, the most likely case of the input is determinized with respect to the ngram model.
There is a serious problem, however, with the above solution. For all but tiny corpora,  
Changed:  
< <  the first composition will blow up with the classical composition algorithm since the  
> >  the first composition is extremely expensive with the classical composition algorithm since the  
output labels in lexicon_opt.fst have been pushed back when it was determinized and this greatly delays matching
with the labels in wotw.lm . There are three possible solutions:  
Line: 263 to 263  
expensive. A final solution is to use an FST representation that allows lookahead matching, which composition can exploit to avoid the matching delays:  
Changed:  
< <  $ fstconvert fst_type=olabel_lookahead save_relabel_opairs=relabel.pairs lexicon_opt.fst >lexicon_lookahead.fst $ fstrelabel relabel_ipairs=relabel.pairs wotw.lm  fstarcsort sort_type=ilabel >wotw_relabel.lm $ fstcompose lexicon_lookahead.fst wotw_relabel.lm >wotw.fst $ fstinvert downcase.fst  fstcompose  wotw.fst >case_restore.fst  
> >  # Converts to a lookahead lexicon $ fstconvert fst_type=olabel_lookahead save_relabel_opairs=relabel.pairs lexicon_opt.fst >lexicon_lookahead.fst $ fstrelabel relabel_ipairs=relabel.pairs wotw.lm  fstarcsort sort_type=ilabel >wotw_relabel.lm # Relabels the language model input (required by lookahead implementation) $ fstcompose lexicon_lookahead.fst wotw_relabel.lm >wotw.fst $ fstinvert downcase.fst  fstcompose  wotw.fst >case_restore.fst  
The relabeling of the input labels of the language model is a byproduct of how the lookahead matching works. Note in order to use
the lookahead FST formats you must use enablelookaheadfsts=yes in the library configuration and you must set your  
Line: 309 to 311  
 
Changed:  
< <  fstcompose ref.fst edit.fst  fstcompose  hyp.fst   
> >  $ fstcompose ref.fst edit.fst  fstcompose  hyp.fst   
# Returns shortest distance from final states to the initial (first) state  
Changed:  
< <  fstshortdistance reverse  head 1  
> >  $ fstshortdistance reverse  head 1  
computes the edit distance between the reference and hypothesis according to the edit transducer  
Line: 348 to 350  
Given these factors, compute:  
Changed:  
< <  fstcompose ref.fst edit1.fst >ref_edit.fst fstcompose edit2.hyp hyp.fst >hyp_edit.fst fstcompose ref_edit.fst hyp_edit.fst  fstshortdistance reverse  head 1  
> >  $ fstcompose ref.fst edit1.fst >ref_edit.fst $ fstcompose edit2.hyp hyp.fst >hyp_edit.fst $ fstcompose ref_edit.fst hyp_edit.fst  fstshortdistance reverse  head 1  
With large inputs, the short distance algorithm may need to use inadmissable pruning.  
Line: 378 to 380  
(b) compute the oracle error path as well as the oracle rate for a lattice.  
Deleted:  
< <  Spelling Correction  

Line: 1 to 1  

OpenFst Examples  
Line: 15 to 15  
 
Changed:  
< <  With these files and the descriptions below, the reader should be able to repeat the examples. With about 340,000 words in 'The War of the Worlds', it is a small but not toy corpus that allows nontrivial examples.  
> >  With these files and the descriptions below, the reader should be able to repeat the examples. With about 340,000 words in The War of the Worlds, it is a small corpus that allows nontrivial examples.  
(Note the OpenGrm Library, used to build the language model, is currently in development but will be released for general public use soon.)  
Line: 163 to 163  
downcases the letters in the lexicon from the previous example.  
Changed:  
< <  A transducer that downcases at the token level can be created with:  
> >  A transducer that downcases at the token level (but see Exercise 3a) can be created with:  
$ fstinvert lexicon_opt.fst  fstcompose  downcase.fst  fstcompose  lexicon_opt.fst   
Line: 215 to 215  
Given this language model and using the lexicon and downcasing transducers from the previous examples, a solution is:  
Changed:  
< <  $ fstcompose lexicon_opt.fst wotw.lm  fstarcsort sort_type=ilabel >wotw.fst $ fstinvert downcase.fst  fstcompose  wotw.fst >case_restore.fst  
> >  # Before trying this read the whole section.
$ fstcompose lexicon_opt.fst wotw.lm  fstarcsort sort_type=ilabel >wotw.fst
$ fstinvert downcase.fst  fstcompose  wotw.fst >case_restore.fst
 
The first FST, wotw.fst , maps from letters to tokens following the probability distribution of the language model.
The second FST, case_restore.fst is similar but uses only downcased letters. Case prediction can then be performed with:
$ fstcompose marsman.fst case_restore.fst  fstshortestpath   
Changed:  
< <  fstproject project_output  fstrmepsilon >prediction.fst  
> >  fstproject project_output  fstrmepsilon  fsttopsort >prediction.fst  
which gives:  
Line: 243 to 244  
$ fstcompose downcase.fst marsman.fst  fstinvert  fstcompose  lexicon_opt.fst   
Changed:  
< <  fstcompose  wotw.lm  fstshortestpath  fstproject project_output  fstrmepsilon >prediction.fst  
> >  fstcompose  wotw.lm  fstshortestpath  fstproject project_output  fstrmepsilon  fsttopsort >prediction.fst  
This works fine but has the disadvantage that we don't have a single transducer to apply and we are depending on the input being  
Line: 306 to 307  
Suppose the reference and (unweighted) hypothesis are represented as finitestate automata ref.fst and hyp.fst respectively. Then:  
Changed:  
< <  fstcompose ref.fst edit.fst  fstcompose  hyp.fst  fstshortdistance reverse  head 1  
> > 
fstcompose ref.fst edit.fst  fstcompose  hyp.fst 
# Returns shortest distance from final states to the initial (first) state
fstshortdistance reverse  head 1
 
computes the edit distance between the reference and hypothesis according to the edit transducer edit.fst . The edit transducer for
two letters a and b is the flower automaton:  
Line: 372 to 376  
(a) compute the error rate rather than the edit distance using transducers.  
Changed:  
< <  (b) compute the oracle error path as well as the oracle rate on a lattice.  
> >  (b) compute the oracle error path as well as the oracle rate for a lattice.  
Spelling Correction 
Line: 1 to 1  

OpenFst Examples  
Line: 230 to 230  
which gives:  
Changed:  
< <  
> >  
In other words, the most likely case of the input is determinized with respect to the ngram model.  
Line: 278 to 278  
(b) Find the weight of the second shortest distinct token sequence in the prediction example above without using the nshortest flag (hint: use fstdifference).  
Changed:  
< <  (c) Find all paths within weight 5 of the shortest path in prediction example.  
> >  (c) Find all paths within weight 10 of the shortest path in prediction example.  
Exercise 5(a) The case restoration above can only work for words that are found in the text corpus. Describe an alternative that gives a  
Line: 287 to 287  
(b) Punctuation can give clues to the case of nearby words (e.g. i was in cambridge, ma. before. it was nice.). Describe a method to to exploit this information in case restoration.  
Changed:  
< <  Edit Distance  
> >  Edit Distance  
Added:  
> >  Since the predictions made in the previous example might not always be correct, we may want to measure the error when we have the correct reference answers as well. One common error measure is computed by aligning the hypothesis and reference, defining:  
Added:  
> >  edit distance = # of substitutions + # of deletions + # of insertionsand then defining  
Changed:  
< <  Spelling Correction  
> >  error rate = edit distance / # of reference symbols If this is computed on letters, it is called the letter error rate; on words, it is called the word error rate.
Suppose the reference and (unweighted) hypothesis are represented as finitestate automata
fstcompose ref.fst edit.fst  fstcompose  hyp.fst  fstshortdistance reverse  head 1
computes the edit distance between the reference and hypothesis according to the edit transducer
This counts any substitution (a:b, b:a), insertion (<epsilon>:a, <epsilon>:b), or deletion as (a:<epsilon>:a, b:<epsilon>) as 1 edit and matches (a:a, b:b) as zero edits. For word error rate, we use the Levenshtein edit distance, i.e. where the cost of substitutions, insertions, and deletions are all the same. However, each pairing of a symbol (or epsilon) with another symbol can be given a separate cost in a more general edit distance. This can obviously be implemented by choosing different weights for the corresponding edit transducer transitions. Even more general edit distances can be defined (see Exercise 6). Note that if the hypothesis is not a string but a more general automaton representing a set of hypotheses (e.g. the result from Exercise 4c) then this procedure returns the oracle edit distance, i.e., the edit distance of the bestmatching ('oracleprovided') hypothesis compared to the reference. The corresponding oracle error rate is a measure of the quality of the hypothesis set (often called a 'lattice').
There is one serious problem with this approach and that is when the symbol set is large. For the 95 letter
For the Levenstein distance, there is a simple solution: factor the edit transducer into two components. Using the example above, the left
factor,  
Added:  
> >  
Added:  
> >  and the right factor, edit2.fst , is:
These transducers include new symbols <sub>, <del>, and <ins> that are used for the substitution, deletion and
insertion of other symbols respectively. In fact, the composition of these two transducers is equivalent to the
original edit transducer Given these factors, compute:
fstcompose ref.fst edit1.fst >ref_edit.fst fstcompose edit2.hyp hyp.fst >hyp_edit.fst fstcompose ref_edit.fst hyp_edit.fst  fstshortdistance reverse  head 1 With large inputs, the short distance algorithm may need to use inadmissable pruning. This is because the edit transducer allows arbitrary insertions and deletions, so the search space is quadratic in the length of the input. Alternatively the edit transducer could be changed (see Exercise 6b). With more general edit transducers, this factoring may not be possible. In that case, representing the edit transducer in some specialized compact FST representation would be possible but pairwise compositions might be very expensive. A threeway composition algorithm or specialized composition matchers and filters are approaches that could implement this more efficiently.
Exercise 6Create an edit transducer that:(a) allows only a fixed number N of contiguous insertions or deletions. (b) computes the Levenshtein distance between American and English spellings of words except that common spelling variants like or vs our or ction vs. xion are given lower cost.
Exercise 7Provide a way to:(a) compute the error rate rather than the edit distance using transducers. (b) compute the oracle error path as well as the oracle rate on a lattice.
Spelling Correction  
 
Line: 312 to 393  
 
Deleted:  
< < 
 
 
Added:  
> > 

Line: 1 to 1  

 
Changed:  
< <  OpenFst Examples  
> >  OpenFst Examples  
Reading the quick tour first is recommended. That includes a simple example of FST application using either the C++ template level or the shelllevel operations. The advanced usage topic contains an implementation using the templatefree intermediate scripting level as well.  
Line: 14 to 14  
 
Changed:  
< <  With these files and the descriptions below, the reader should be able to repeat the examples.  
> >  With these files and the descriptions below, the reader should be able to repeat the examples. With about 340,000 words in 'The War of the Worlds', it is a small but not toy corpus that allows nontrivial examples.  
(Note the OpenGrm Library, used to build the language model, is currently in development but will be released for general public use soon.)  
Line: 54 to 54  
$ fstunion man.fst Mars.fst  fstunion  Martian.fst  fstclosure >lexicon.fst  
Changed:  
< <  produces a finitestate lexicon that transduces zero or more spelledout word sequences into to their word tokens.  
> >  produces a finitestate lexicon that transduces zero or more spelledout word sequences into to their word tokens:  
Line: 64 to 64  
$ fstrmepsilon lexicon.fst  fstdeterminize  fstminimize >lexicon_opt.fst  
Changed:  
< <  resulting in the compact:  
> >  resulting in the equvialent, deterministic and minimal:  
Line: 89 to 89  
is a transducer that deletes common punctuation symbols. The full punctuation transducer is here.  
Changed:  
< <  Now, the tokenizaton of the an example string Mars man encoded as an FST:  
> >  Now, the tokenizaton of the example string Mars man encoded as an FST:  
Line: 153 to 153  
Added:  
> >  Why use transducers for this when UNIX commands like tr and C library routines like tolower are some of the many easy ways to downcase text? Transducers have several advantages over these approaches. First, more complex transformations are almost
as easy to write (see Example 2). Second, the inverse of this transduction is less trivial and can be quite useful (see the next section).
Finally, this transducer operates on any finitestate input not just a string. For example,
$ fstinvert lexicon_opt.fst  fstcompose  downcase.fst  fstinvert >lexicon_opt_downcase.fst downcases the letters in the lexicon from the previous example.  
A transducer that downcases at the token level can be created with:  
Line: 161 to 171  
Exercise 2  
Added:  
> >  Create a transducer that:
(a) upcases letters that are stringinitial or after a punctuation symbol/space (capitalization transducer).  
Added:  
> >  (b) converts lowercase underscoreseparated identifiers such as num_final_states to the form NumFinalStates (camelcase transducer).
Exercise 3  
(a) The letterlevel downcasing transducer downcases any ASCII input. For which inputs does the tokenlevel downcasing transducer work? What changes would be necessary to cover all inputs from wotw.syms ?
(b) If a token The were applied to Case Restoration in Text  
Changed:  
< <  This example creates a transducer that attempts to restore the case of downcased input. This is the first nontrivial task and there is no errorfree way to do this. The approach taken here will be to use case statistics gathered from the The War of the Worlds source text to help solve this. In particular, we will use an ngram language model created on this text that is represented as a finitestate automaton in OpenFst format. Here is a typical path in this 5gram automaton:  
> >  This example creates a transducer that attempts to restore the case of downcased input. This is the first nontrivial example and, in general, there is no errorfree way to do this. The approach taken here will be to use case statistics gathered from the The War of the Worlds source text
to help solve this. In particular, we will use an ngram language model created on this text that is represented as a finitestate automaton in OpenFst format. Here is a typical path in this 5gram automaton:  
$ fstrandgen select=log_prob wotw.lm  fstprint isymbols=wotw.syms osymbols=wotw.syms  
Line: 195 to 211  
This model is constructed to have a transition for every 1gram to 5gram seen in 'War of the Worlds' with its weight related to the  
Changed:  
< <  (negative log) probability of that ngram occurring in the text corpus. The transitions correspond to backoff transitions
used in the smoothing of the model to allow accepting input sequences not seen in training.  
> >  (negative log) probability of that ngram occurring in the text corpus. The epsilon transitions correspond to backoff transitions in the smoothing of the model that was performed to allow accepting input sequences not seen in training.  
Changed:  
< <  Given this language model and using the lexicon and downcasing transducers from the previous examples (preferably a more complete one constructed in Exercise 2a), a solution is:  
> >  Given this language model and using the lexicon and downcasing transducers from the previous examples, a solution is:  
$ fstcompose lexicon_opt.fst wotw.lm  fstarcsort sort_type=ilabel >wotw.fst  
Line: 255 to 270  
The relabeling of the input labels of the language model is a byproduct of how the lookahead matching works. Note in order to use  
Changed:  
< <  the lookahead FST formats you must use enablelookaheadfsts in the library compilation and you must set your  
> >  the lookahead FST formats you must use enablelookaheadfsts=yes in the library configuration and you must set your  
LD_LIBRARY_PATH (or equivalent) appropriately.  
Changed:  
< <  Exercise 3  
> >  Exercise 4  
(a) Find the weight of the second shortest distinct token sequence in the prediction example above.
(b) Find the weight of the second shortest distinct token sequence in the prediction example above without using the (c) Find all paths within weight 5 of the shortest path in prediction example.  
Changed:  
< <  Edit Distance  
> >  Exercise 5(a) The case restoration above can only work for words that are found in the text corpus. Describe an alternative that gives a plausible result on any letter sequence.(b) Punctuation can give clues to the case of nearby words (e.g. i was in cambridge, ma. before. it was nice.). Describe a method to to exploit this information in case restoration.
Edit Distance  
Changed:  
< <  Spelling Correction  
> >  Spelling Correction  
Line: 1 to 1  

OpenFst ExamplesReading the quick tour first is recommended. That includes a simple example of FST application using either the C++ template level or the shelllevel operations. The advanced usage topic contains an implementation using the templatefree intermediate scripting level as well.  
Changed:  
< <  In the examples below, we use the shelllevel operations for convenience. The following data files are used in these examples:  
> >  The following data files are used in the examples below:  
 
Changed:  
< < 
 
> > 
 
(Note the OpenGrm Library, used to build the language model, is currently in development but will be released for general public use soon.)  
Added:  
> >  A few general comments about the examples:
 
TokenizationThe first example converts a sequence of ASCII characters into a sequence of word tokens with punctuation and whitespace stripped.  
Line: 102 to 109  
to generate the full optimized lexicon.  
Changed:  
< <  Exercise  
> >  Exercise 1  
The above tokenization does not handle numeric character input.
(a) Create a transducer that maps numbers in the range 0  999999 represented as digit strings to their English read form, e.g.:  
Line: 149 to 156  
A transducer that downcases at the token level can be created with:  
Changed:  
< <  $ fstinvert lexicon_opt.fst  fstcompose  downcase.fst  fstcompose  lexicon_opt.fst >downcase_token.fst  
> >  $ fstinvert lexicon_opt.fst  fstcompose  downcase.fst  fstcompose  lexicon_opt.fst  fstrmepsilon  fstdeterminize  fstminimize >downcase_token.fst  
Changed:  
< <  Exercise  
> >  Exercise 2  
Changed:  
< <  The letterlevel downcasing transducer downcases any ASCII input. For which inputs does the tokenlevel downcasing transducer work?  
> >  (a) The letterlevel downcasing transducer downcases any ASCII input. For which inputs does the tokenlevel downcasing transducer work? What changes would be necessary to cover all inputs from wotw.syms ?  
Added:  
> >  (b) If a token The were applied to downcase_token.fst , what would the output look like? What would it look like if the optimizations (epsilonremoval, determinization and minimization) were omitted from the construction of downcase_token.fst .  
Case Restoration in Text  
Changed:  
< <  Edit Distance  
> >  This example creates a transducer that attempts to restore the case of downcased input. This is the first nontrivial task and there is no errorfree way to do this. The approach taken here will be to use case statistics gathered from the The War of the Worlds source text to help solve this. In particular, we will use an ngram language model created on this text that is represented as a finitestate automaton in OpenFst format. Here is a typical path in this 5gram automaton:  
Changed:  
< <  Spelling Correction  
> >  $ fstrandgen select=log_prob wotw.lm  fstprint isymbols=wotw.syms osymbols=wotw.syms 0 1 The The 1 2 desolating desolating 2 3 cry cry 3 4 <epsilon> <epsilon> 4 5 worked worked 5 6 <epsilon> <epsilon> 6 7 upon upon 7 8 my my 8 9 mind mind 9 10 <epsilon> <epsilon> 10 11 once once 11 12 <epsilon> <epsilon> 12 13 <epsilon> <epsilon> 13 14 I I 14 15 <epsilon> <epsilon> 15 16 <epsilon> <epsilon> 16 17 slept slept 17 18 <epsilon> <epsilon> 18 19 little little 19
This model is constructed to have a transition for every 1gram to 5gram seen in 'War of the Worlds' with its weight related to the
(negative log) probability of that ngram occurring in the text corpus. The Given this language model and using the lexicon and downcasing transducers from the previous examples (preferably a more complete one constructed in Exercise 2a), a solution is:
$ fstcompose lexicon_opt.fst wotw.lm  fstarcsort sort_type=ilabel >wotw.fst $ fstinvert downcase.fst  fstcompose  wotw.fst >case_restore.fst
The first FST,
$ fstcompose marsman.fst case_restore.fst  fstshortestpath  fstproject project_output  fstrmepsilon >prediction.fst which gives:
In other words, the most likely case of the input is determinized with respect to the ngram model.  
Added:  
> >  There is a serious problem, however, with the above solution. For all but tiny corpora,
the first composition will blow up with the classical composition algorithm since the
output labels in lexicon_opt.fst have been pushed back when it was determinized and this greatly delays matching
with the labels in wotw.lm . There are three possible solutions:  
Added:  
> >  First, we can use the input to restrict the composition chain as:
$ fstcompose downcase.fst marsman.fst  fstinvert  fstcompose  lexicon_opt.fst  fstcompose  wotw.lm  fstshortestpath  fstproject project_output  fstrmepsilon >prediction.fst
This works fine but has the disadvantage that we don't have a single transducer to apply and we are depending on the input being
a string or otherwise small. A second solution, which gives a single optimized transducer, is to replace transducer determinization and minimization of lexicon.fst with automata determinization and minimization
(via encoding the input and output label pairs into a single new label) followed by
the transducer determinization and minimization of the result of the composition with
$ fstencode encode_labels lexicon.fst enc.dat  fstdeterminize  fstminimize  fstencode decode  enc.dat >lexicon_compact.fst $ fstcompose lexicon_compact.fst wotw.lm  fstdeterminize  fstminimize >wotw.fst $ fstinvert downcase.fst  fstcompose  wotw.fst >case_restore.fst This solution is a natural and simple one but has the disadvantage that the transducer determinization and minimization steps are quite expensive. A final solution is to use an FST representation that allows lookahead matching, which composition can exploit to avoid the matching delays:
$ fstconvert fst_type=olabel_lookahead save_relabel_opairs=relabel.pairs lexicon_opt.fst >lexicon_lookahead.fst $ fstrelabel relabel_ipairs=relabel.pairs wotw.lm  fstarcsort sort_type=ilabel >wotw_relabel.lm $ fstcompose lexicon_lookahead.fst wotw_relabel.lm >wotw.fst $ fstinvert downcase.fst  fstcompose  wotw.fst >case_restore.fst
The relabeling of the input labels of the language model is a byproduct of how the lookahead matching works. Note in order to use
the lookahead FST formats you must use
Exercise 3(a) Find the weight of the second shortest distinct token sequence in the prediction example above.
(b) Find the weight of the second shortest distinct token sequence in the prediction example above without using the (c) Find all paths within weight 5 of the shortest path in prediction example.
Edit Distance
Spelling Correction  
Line: 184 to 288  
 
Added:  
> > 

Line: 1 to 1  

OpenFst Examples  
Line: 12 to 12  
 
Changed:  
< < 
 
> > 
 
With these files and the descriptions below, the reader should be able to repeat the examples.
(Note the OpenGrm Library, used to build the language model, is currently in development but will be released for general public use soon.)  
Line: 24 to 24  
is using the OpenFst text format. For example, the word Mars would have the form:  
Changed:  
< <  $ fstcompile isymbols=ascii.syms osymbols=wotw.syms >Mars.fst <<EOF  
> >  $ fstcompile isymbols=ascii.syms osymbols=wotw.syms >Mars.fst <<EOF  
0 1 M Mars
1 2 a  
Line: 35 to 35  
This can be drawn with:  
Changed:  
< <  $ fstdraw isymbols=ascii.syms osymbols=wotw.syms portrait Mars.fst  dot Tjpg >Mars.jpg  
> >  $ fstdraw isymbols=ascii.syms osymbols=wotw.syms portrait Mars.fst  dot Tjpg >Mars.jpg  
which produces:  
Line: 70 to 70  
where:  
Changed:  
< <  $ fstcompile isymbols=ascii.syms osymbols=wotw.syms >punct.fst <<EOF  
> >  $ fstcompile isymbols=ascii.syms osymbols=wotw.syms >punct.fst <<EOF  
0 1  
Line: 80 to 80  
EOF  
Changed:  
< <  is a transducer that deletes common punctuation symbols.  
> >  is a transducer that deletes common punctuation symbols. The full punctuation transducer is here.  
Now, the tokenizaton of the an example string Mars man encoded as an FST:  
Line: 94 to 94  
giving:  
Changed:  
< <  .  
> >  .  
To generate a full lexicon of all 7102 distinct words in the War of Worlds, it is convenient to dispense with the union
of individual word FSTs above and instead generate a single text FST from the word symbols in wotw.syms .  
Line: 122 to 122  
The next example converts casesensitive input to all lowercase output. To do the conversion, we create a flower transducer of the form:  
Changed:  
< <  $ fstcompile isymbols=ascii.syms osymbols=ascii.syms >downcase.fst <<EOF 0 0 a a 0 0 b b  
> >  $ fstcompile isymbols=ascii.syms osymbols=ascii.syms >downcase.fst <<EOF 0 0 ! !  
0 0 A a 0 0 B b  
Changed:  
< <  0 0 ! !  
> >  0 0 a a 0 0 b b  
0 EOF  
Line: 139 to 139  
A downcasing flower transducer for the full character set is here. This transducer can be applied to the Mars men automaton from the previous example with:  
Changed:  
< <  $ fstcompose Marsman.fst downcase.fst  fstproject project_output >marsman.fst  
> >  $ fstproject Marsman.fst  fstcompose  downcase.fst  fstproject project_output >marsman.fst  
giving:  
Line: 149 to 149  
A transducer that downcases at the token level can be created with:  
Changed:  
< <  $ fstinvert lexicon_opt.fst downcase.fst  fstcompose  lexicon_opt.fst >downcase_token.fst  
> >  $ fstinvert lexicon_opt.fst  fstcompose  downcase.fst  fstcompose  lexicon_opt.fst >downcase_token.fst  
Exercise  
Line: 175 to 175  
 
Changed:  
< < 
 
> > 
 
 
Added:  
> > 

Line: 1 to 1  

OpenFst Examples  
Line: 8 to 8  
 
Changed:  
< < 
 
> > 
 
 
Changed:  
< < 
 
> > 
 
With these files and the descriptions below, the reader should be able to repeat the examples.
(Note the OpenGrm Library, used to build the language model, is currently in development but will be released for general public use soon.)  
Line: 20 to 20  
TokenizationThe first example converts a sequence of ASCII characters into a sequence of word tokens with punctuation and whitespace stripped.  
Changed:  
< <  To do so we will need a lexicon transducer that maps from letters to their corresponding word token. A simple way to generate this  
> >  To do so we will need a lexicon transducer that maps from letters to their corresponding word tokens. A simple way to generate this  
is using the OpenFst text format. For example, the word Mars would have the form:  
Line: 41 to 41  
.  
Changed:  
< <  Suppose that Martian.fst and man.fst have similarly been created, then:  
> >  Suppose that Martian.fst and man.fst have similarly been created, then:  
$ fstunion man.fst Mars.fst  fstunion  Martian.fst  fstclosure >lexicon.fst  
Changed:  
< <  produces a finitestate lexicon of that transduces zero or more spelledout word sequences into to their word tokens.  
> >  produces a finitestate lexicon that transduces zero or more spelledout word sequences into to their word tokens.  
Line: 97 to 97  
.
To generate a full lexicon of all 7102 distinct words in the War of Worlds, it is convenient to dispense with the union  
Changed:  
< <  of individual word FSTs above and instead generate a single text FST from the word symbols in wotw.syms.  
> >  of individual word FSTs above and instead generate a single text FST from the word symbols in wotw.syms .  
Here is a python script that does that and was used, along with the above steps, to generate the full optimized lexicon.  
Added:  
> >  ExerciseThe above tokenization does not handle numeric character input.(a) Create a transducer that maps numbers in the range 0  999999 represented as digit strings to their English read form, e.g.:
1 > one 11 > eleven 111 > one hundred eleven 1111 > one thousand one hundred eleven 11111 > eleven thousand one hundred eleven. (b) Incorporate this transduction into the lettertotoken transduction above and apply to the input Mars is 4225 miles across. represented as letters.
Downcasing TextThe next example converts casesensitive input to all lowercase output. To do the conversion, we create a flower transducer of the form:
$ fstcompile isymbols=ascii.syms osymbols=ascii.syms >downcase.fst <<EOF 0 0 a a 0 0 b b 0 0 A a 0 0 B b 0 0 ! ! 0 EOF which produces:
A downcasing flower transducer for the full character set is here. This transducer can be applied to the Mars men automaton from the previous example with:
$ fstcompose Marsman.fst downcase.fst  fstproject project_output >marsman.fst giving:
A transducer that downcases at the token level can be created with:
$ fstinvert lexicon_opt.fst downcase.fst  fstcompose  lexicon_opt.fst >downcase_token.fst
ExerciseThe letterlevel downcasing transducer downcases any ASCII input. For which inputs does the tokenlevel downcasing transducer work?
Case Restoration in Text
Edit Distance
Spelling Correction  
 
Line: 115 to 178  
 
Added:  
> > 

Line: 1 to 1  

 
Changed:  
< <  OpenFst Examples  
> >  OpenFst Examples  
Reading the quick tour first is recommended. That includes a simple example of FST application using either the C++ template level or the shelllevel operations. The advanced usage topic contains an implementation using the templatefree intermediate scripting level as well.  
Changed:  
< <  In the following, we code only at the C++ template level, which is the most general and flexible. Using the other levels for these examples, however, should be straightforward.  
> >  In the examples below, we use the shelllevel operations for convenience. The following data files are used in these examples:  
Changed:  
< <  T9 Recognizer  
> > 
With these files and the descriptions below, the reader should be able to repeat the examples. (Note the OpenGrm Library, used to build the language model, is currently in development but will be released for general public use soon.)
TokenizationThe first example converts a sequence of ASCII characters into a sequence of word tokens with punctuation and whitespace stripped. To do so we will need a lexicon transducer that maps from letters to their corresponding word token. A simple way to generate this is using the OpenFst text format. For example, the word Mars would have the form:
$ fstcompile isymbols=ascii.syms osymbols=wotw.syms >Mars.fst <<EOF 0 1 M Mars 1 2 a <epsilon> 2 3 r <epsilon> 3 4 s <epsilon> 4 EOF This can be drawn with: $ fstdraw isymbols=ascii.syms osymbols=wotw.syms portrait Mars.fst  dot Tjpg >Mars.jpgwhich produces: . Suppose that Martian.fst and man.fst have similarly been created, then:
$ fstunion man.fst Mars.fst  fstunion  Martian.fst  fstclosure >lexicon.fst produces a finitestate lexicon of that transduces zero or more spelledout word sequences into to their word tokens.
The nondeterminism and nonminimality introduced by the construction can be removed with:
$ fstrmepsilon lexicon.fst  fstdeterminize  fstminimize >lexicon_opt.fst resulting in the compact:
In order to handle punctuation symbols, we change the lexicon construction to:
$ fstunion man.fst Mars.fst  fstunion  Martian.fst  fstconcat  punct.fst  fstclosure >lexicon.fst
where:
$ fstcompile isymbols=ascii.syms osymbols=wotw.syms >punct.fst <<EOF 0 1 <space> <epsilon> 0 1 . <epsilon> 0 1 , <epsilon> 0 1 ? <epsilon> 0 1 ! <epsilon> 1 EOF is a transducer that deletes common punctuation symbols. Now, the tokenizaton of the an example string Mars man encoded as an FST:
can be done with:
$ fstcompose Marsman.fst lexicon_opt.fst  fstproject project_output  fstrmepsilon >tokens.fst giving: . To generate a full lexicon of all 7102 distinct words in the War of Worlds, it is convenient to dispense with the union of individual word FSTs above and instead generate a single text FST from the word symbols in wotw.syms. Here is a python script that does that and was used, along with the above steps, to generate the full optimized lexicon.

Line: 1 to 1  

OpenFst Examples  
Changed:  
< <  Reading the quick tour first is recommended. That includes a simple example of FST application using both the C++ template level and the shelllevel operations. The advanced usage topic gives an implementation using the templatefree intermediate scripting level as well.  
> >  Reading the quick tour first is recommended. That includes a simple example of FST application using either the C++ template level or the shelllevel operations. The advanced usage topic contains an implementation using the templatefree intermediate scripting level as well.  
In the following, we code only at the C++ template level, which is the most general and flexible. Using the other levels for these examples, however, should be straightforward. 
Line: 1 to 1  

Added:  
> > 
OpenFst ExamplesReading the quick tour first is recommended. That includes a simple example of FST application using both the C++ template level and the shelllevel operations. The advanced usage topic gives an implementation using the templatefree intermediate scripting level as well. In the following, we code only at the C++ template level, which is the most general and flexible. Using the other levels for these examples, however, should be straightforward.
T9 Recognizer 