Difference: GrmNGramForum (71 vs. 72)

Revision 722014-05-16 - CyrilAllauzen

Line: 1 to 1

OpenGrm NGram Forum

Line: 48 to 48

CyrilAllauzen - 2014-05-16 - 16:09

Hi Aaron,

You were actually on the right track.

  1. When using farcompilestrings, the symbol table options are ignored. So just do: farcompilestrings -token_type=utf8 -keep_symbols=1 animals.txt.
  2. When using ngramcount, use --require_symbols=false.
  3. When generating your utf8 symbol table, make sure to use tabs as separator between the utf-8 character and integer label.
  4. Use fstsymbols --isymbols=your_utf8_symbols --osymbols=your_ut8_symbols -- --fst_field_separator=`echo -e "\t"` animals.cnt animals_with_symbols.cnt.

Then the rest of the ngram pipeline should work using animals_with_symbols.cnt.


CyrilAllauzen - 2014-05-16 - 16:14

Hi Aaron,

For step 3, remember to include 0 as integer label for epsilon (e.g. <epsilon>) and choose something else than an actual tab for the symbol corresponding to the utf8 integer value for tab.


Log In
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback