Difference: GrmNGramForum (71 vs. 72)

Revision 722014-05-16 - CyrilAllauzen

Line: 1 to 1
 
META TOPICPARENT name="WebHome"

OpenGrm NGram Forum

Line: 48 to 48
  brian
Added:
>
>

CyrilAllauzen - 2014-05-16 - 16:09

Hi Aaron,

You were actually on the right track.

  1. When using farcompilestrings, the symbol table options are ignored. So just do: farcompilestrings -token_type=utf8 -keep_symbols=1 animals.txt.
  2. When using ngramcount, use --require_symbols=false.
  3. When generating your utf8 symbol table, make sure to use tabs as separator between the utf-8 character and integer label.
  4. Use fstsymbols --isymbols=your_utf8_symbols --osymbols=your_ut8_symbols -- --fst_field_separator=`echo -e "\t"` animals.cnt animals_with_symbols.cnt.

Then the rest of the ngram pipeline should work using animals_with_symbols.cnt.

Cyril

CyrilAllauzen - 2014-05-16 - 16:14

Hi Aaron,

For step 3, remember to include 0 as integer label for epsilon (e.g. <epsilon>) and choose something else than an actual tab for the symbol corresponding to the utf8 integer value for tab.

Cyril

 
<--/commentPlugin-->
Log In
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback