OpenGrm NGram Forum

CyrilAllauzen - 2014-05-16 - 16:09

Hi Aaron,

You were actually on the right track.

  1. When using farcompilestrings, the symbol table options are ignored. So just do: farcompilestrings -token_type=utf8 -keep_symbols=1 animals.txt.
  2. When using ngramcount, use --require_symbols=false.
  3. When generating your utf8 symbol table, make sure to use tabs as separator between the utf-8 character and integer label.
  4. Use fstsymbols --isymbols=your_utf8_symbols --osymbols=your_ut8_symbols -- --fst_field_separator=`echo -e "\t"` animals.cnt animals_with_symbols.cnt.

Then the rest of the ngram pipeline should work using animals_with_symbols.cnt.


CyrilAllauzen - 2014-05-16 - 16:14

Hi Aaron,

For step 3, remember to include 0 as integer label for epsilon (e.g. <epsilon>) and choose something else than an actual tab for the symbol corresponding to the utf8 integer value for tab.


