TWiki
>
GRM Web
>
NGramLibrary
>
NGramQuickTour
(revision 30) (raw view)
Edit
Attach
---+ !OpenGrm NGram Library Quick Tour %TOC% This tour is organized around the stages of n-gram model creation, modification and use: * corpus I/O (=ngramsymbols=, =farcompilestrings= and =farprintstrings=) * n-gram model format * n-gram counting (=ngramcount=) * n-gram model parameter estimation (=ngrammake=) * n-gram model merging, pruning and constraining (=ngrammerge=, =ngramshrink= and =ngrammarginalize=) * model I/O (=ngramread=, =ngramprint= and =ngraminfo=) * n-gram model sampling, application and evaluation (=ngramrandgen=, =ngramapply= and =ngramperplexity=) For additional details, follow the links to each operation's full documentation found in each section and in tthe summary table of [[NGramQuickTour#AvailableOperations][available operations]] below. #TextIo ---+++ Corpus I/O Text corpora are represented as binary [[http://www.openfst.org/twiki/bin/view/FST/FstExtensions#FstArchives][finite-state archives]], with one automaton per sentence. This provides efficient later processing by the NGram Library utilities and allows if desired more general probabilistic input (e.g. weighted DAGs or _lattices_). The first step is to generate an !OpenFst-style [[http://www.openfst.org/twiki/bin/view/FST/FstQuickTour#CreatingFsts][symbol table]] for the text tokens in input corpus. This can be done with the command-line utility [[NGramSymbols][ngramsymbols]]. For example, the symbols in the text of Oscar Wilde's _Importance of Being Earnest_, using the suitably normalized copy found [[%ATTACHURL%/earnest.txt][here]], can be extracted with: <verbatim> $ ngramsymbols <earnest.txt >earnest.syms </verbatim> If multiple corpora, e.g. for a separate training set and a test set, are to be processed together, the same symbol table should be used throughout. This can be accomplished by concatenating the corpora when passed to =ngramsymbols=, eliminating out-of-vocabulary symbols. By default, =ngramsymbols= creates symbol table entries for _<epsilon>_ and an out-of-vocabulary token _<unk>_. The identity of these labels can be changed using flags. A flag can then be passed to [[http://www.openfst.org/twiki/bin/view/FST/FstExtensions#FstArchives][farcompilestrings]] to specify the out-of-vocabulary label, so that words not in the symbol table will get mapped to that index. --- Given a symbol table, a text corpus can be converted to a binary FAR archive with: <verbatim> $ farcompilestrings -symbols=earnest.syms -keep_symbols=1 earnest.txt >earnest.far </verbatim> and can be printed with: <verbatim> $ farprintstrings earnest.far >earnest.txt </verbatim> #ModelFormat ---+++ Model Format All n-gram models produced by the utilities here, including those with unnormalized counts, have a cyclic weighted finite-state transducer (FST) format, encoded using the [[http://www.openfst.org][OpenFst library]]. For the precise details of the n-gram format, see [[NGramModelFormat][here]]. The model is normally stored in a general-purpose, mutable ([[http://www.openfst.org/twiki/bin/view/FST/FstAdvancedUsage#Fst_Types][VectorFst]]) format, which is convenient for the various processing steps described below.This can be converted to a more compact (but immutable) format specifically for n-gram models ([[http://www.openfst.org/twiki/bin/view/FST/FstAdvancedUsage#Fst_Types][NGramFst]]) when the desired final model is generated. #NgramCounting ---+++ N-gram Counting [[NGramCount][ngramcount]] is a command line utility for counting n-grams from an input corpus, represented in FAR format. It produces an n-gram model in the FST format described above. Transitions and final costs are weighted with the negative log count of the associated n-gram. By using the switch _--order_ the maximum length n-gram to count can be chosen. All n-grams observed in the input corpus of length less than or equal to the specified order will be counted. By default, the order is set to 3 (trigram model). The 1-gram through 5-gram counts for the =earnest.far= finite-state archive file created above can be created with: <verbatim> $ ngramcount -order=5 earnest.far >earnest.cnts </verbatim> #ModelEstimation ---+++ N-gram Model Parameter Estimation [[NGramMake][ngrammake]] is a command line utility for normalizing and smoothing an n-gram model. It takes as input the FST produced by =ngramcount= (which contains raw, unnormalized counts). The 5-gram counts in =earnest.cnts= created above can be converted into a n-gram model with: <verbatim> $ ngrammake earnest.cnts >earnest.mod </verbatim> Flags to [[NGramMake][ngrammake]] specify the smoothing (e.g. Katz, Knesser-Ney, etc) used with the default being Witten-Bell. Here is a generated sesntence from the language model (using =ngramrandgen=, which is described below): <verbatim> $ ngramrandgen earnest.mod | farprintstrings I <epsilon> WOULD STRONGLY <epsilon> ADVISE YOU MR WORTHING TO TRY <epsilon> AND <epsilon> ACQUIRE <epsilon> SOME RELATIONS AS <epsilon> <epsilon> <epsilon> FAR AS THE PIANO IS CONCERNED <epsilon> SENTIMENT <epsilon> IS MY FORTE <epsilon> </verbatim> (An epsilon transition is emitted for each backoff.) #MergeAndPrune ---+++ N-gram Model Merging, Pruning and Constraining [[NGramMerge][ngrammerge]] is a command line utility for merging two n-gram models into a single model -- either unnormalized counts or smoothed, normalized models. For example, suppose we split our corpus up into two parts, earnest.aa and earnest.ab, and derive 5-gram counts from each independently using _ngramcount_ as shown above. We can then merge the counts to get the same counts as derived above from the full corpus (earnest.cnts): <verbatim> $ ngrammerge earnest.aa.cnts earnest.ab.cnts >earnest.merged.cnts $ fstequal earnest.cnts earnest.merged.cnts </verbatim> Note that, unlike our example merging unnormalized counts above, merging two smoothed models that have been built from half a corpus each will result in a different model than one built from the corpus as a whole, due to the smoothing and mixing. Each of the two model or count FSTs can be weighted, using the _--alpha_ switch for the first input FST, and the _--beta_ switch for the second input FST. --- [[NGramShrink][ngramshrink]] is a command line utility for pruning n-gram models. The following command shrinks the 5-gram model created above using entropy pruning to roughly 1/10 the original size: <verbatim> $ ngramshrink -method=relative_entropy -theta=.00015 earnest.mod >earnest.pru </verbatim> A random sentence generated through this LM is: <verbatim> $ ngramrandgen earnest.pru | farprintstrings I THINK <epsilon> BE ABLE TO <epsilon> DIARY GWENDOLEN WONDERFUL SECRETS MONEY <epsilon> YOU <epsilon> </verbatim> --- [[NGramMarginal][ngrammarginalize]] is a command line utility for re-estimating smoothed n-gram models using marginalization constraints similar to Kneser-Ney smoothing. The following imposes marginalization constraints on the 5-gram model created above: <verbatim> $ ngrammarginalize earnest.mod >earnest.marg.mod </verbatim> This functionality is available in version 1.1.0 and higher. Note that this algorithm may need to be run for several iterations, using the _--iterations_ switch. See full operation documentation for further considerations and references. #ModelIo ---+++ N-gram Model Reading, Printing and Info [[NGramPrint][ngramprint]] is a command line utility for reading in n-gram models and producing text files. Both raw counts and normalized models are encoded with the [[NGramQuickTour#ModelFormat][same automaton structure]], so either can be accessed for this function. There are multiple options for output. For example, using the example 5-gram model created below, the following prints out a portion of it in ARPA format: <verbatim> $ ngramprint --ARPA earnest.mod >earnest.ARPA $ head -15 earnest.ARPA \data\ ngram 1=2306 ngram 2=10319 ngram 3=14796 ngram 4=15218 ngram 5=14170 \1-grams: -99 <s> -0.9399067 -1.064551 </s> -3.337681 MORNING -0.3590219 -2.990894 ROOM -0.4771213 -1.857355 IN -0.6232494 -2.87695 ALGERNON -0.4771213 </verbatim> [[NGramRead][ngramread]] is a command line utility for reading in textual representations of n-gram models and producing FSTs appropriate for use by other functions and utilities. It has several options for input. For example, <verbatim> $ ngramread --ARPA earnest.ARPA >earnest.mod </verbatim> generates a n-gram model in FST format from the ARPA n-gram language model specification. [[NGramInfo][ngraminfo]] is a command-line utility that prints out various information about an n-gram language model in FST format. <verbatim> $ ngraminfo earnest.mod # of states 39076 # of ngram arcs 51618 # of backoff arcs 39075 initial state 1 unigram state 0 # of final states 5190 ngram order 5 # of 1-grams 2305 # of 2-grams 10319 # of 3-grams 14796 # of 4-grams 15218 # of 5-grams 14170 well-formed y normalized y </verbatim> #SamplingApplicationEvaluation ---+++ N-gram Model Sampling, Application and Evaluation [[NGramRandGen][ngramrandgen]] is a command line utility for sampling from n-gram models. <verbatim> $ ngramrandgen --max_sents=1 earnest.mod | farprintstrings IT IS SIMPLY A VERY INEXCUSABLE MANNER </verbatim> --- [[NGramApply][ngramapply]] is a command line utility for applying n-gram models. It can be called to apply a model to a concatenated archive of automata: <verbatim> $ ngramapply earnest.mod earnest.far | farprintstrings -print_weight </verbatim> The result is a FAR weighted by the n-gram model. --- [[NGramPerplexity][ngramperplexity]] can be used to evaluate an n-gram model. For example, the following calculates the perplexity of two strings (_a hand bag_ and _bag hand a_) from the example 5-gram model generated above: <verbatim> echo -e "A HAND BAG\nBAG HAND A" |\ farcompilestrings -generate_keys=1 -symbols=earnest.syms --keep_symbols=1 |\ ngramperplexity --v=1 earnest.mod - A HAND BAG ngram -logprob N-gram probability found (base10) p( A | <s> ) = [2gram] 1.87984 p( HAND | A ...) = [2gram] 2.56724 p( BAG | HAND ...) = [3gram] 0.0457417 p( </s> | BAG ...) = [4gram] 0.507622 1 sentences, 3 words, 0 OOVs logprob(base 10)= -5.00044; perplexity (base 10)= 17.7873 BAG HAND A ngram -logprob N-gram probability found (base10) p( BAG | <s> ) = [1gram] 4.02771 p( HAND | BAG ...) = [1gram] 3.35968 p( A | HAND ...) = [1gram] 2.51843 p( </s> | A ...) = [1gram] 1.53325 1 sentences, 3 words, 0 OOVs logprob(base 10)= -11.4391; perplexity (base 10)= 724.048 2 sentences, 6 words, 0 OOVs logprob(base 10)= -16.4395; perplexity (base 10)= 113.485 </verbatim> #LibraryUse ---+++ Using the C++ Library The <nop>OpenGrm NGram library is a C++ library. Users can call the [[NGramQuickTour#AvailableOperations][available operations]] from that level rather than from the command line if desired. From C++, include =<ngram/ngram.h>= in the installation include directory and link to =libfst.so=, =libfar.so=, and =libngram.so= in the installation library directory. This assumes you've installed [[http://www.openfst.org][OpenFst]] (with =--enable-far=yes=). (You may instead use just those include files for the classes and functions that you will need.) All classes and functions are in the =ngram= namespace. As mentioned earlier, each n-gram model, including those with unnormalized counts, is represented as a weighted FST. Each of the n-gram operation classes holds the FST in the common base class =NGramModel=. A partial description of this class follows: <pre> class NGramModel { public: typedef int StateId; %RED% // Construct an n-gram model container holding the input FST, whose ownership // is retained by the caller. %ENDCOLOR% NGramModel(StdMutableFst *fst); %RED% // Returns highest n-gram order. %ENDCOLOR% int HiOrder() const; %RED% // Returns order of a given state. %ENDCOLOR% int StateOrder(StateId state) const; %RED% // Returns the unigram state. %ENDCOLOR% StateId UnigramState() const; %RED% // Validates model has a well-formed n-gram topology %ENDCOLOR% bool CheckTopology() const; %RED% // Validates that states are fully normalized (probabilities sum to 1.0) %ENDCOLOR% bool CheckNormalization() const; %RED% // Gets a const reference to the internal (expanded) FST. %ENDCOLOR% StdExpandedFst &GetFst() const; %RED% // Gets a pointer to the internal (mutable) FST. %ENDCOLOR% StdMutableFst *GetMutableFst() const; private: StdMutableFst *fst_; }; </pre> From this class is derived [[NGramCount]] for counting, [[NGramMake]] for parameter estimation/smoothing, [[NGramShrink]] for model pruning, [[NGramMerge]] for model interpolation/merging (among others). =NGramMake= and =NGramShrink= are further sub-classed for each specific smoothing and pruning method. For example, =NGramMake= has methods (some abstract) common to most/all parameter estimation/smoothing techniques while =NGramKatz= has the specific implementations for that method. #AvailableOperations ---+++ Available Operations Click on operation name for additional information. | *Operation* | *Usage* | *Description* | | [[NGramApply][NGramApply]] | ngramapply [--bo_arc_type] ngram.fst [in.far [out.far]] | Intersect n-gram model with fst archive | | [[NGramCount][NGramCount]] | ngramcount [--order] [in.far [out.fst]] | count n-grams from fst archive | | | !NGramCounter(order); | --- n-gram counter | | [[NGramInfo][NGramInfo]] | ngraminfo [in.mod] | print various information about an n-gram model | | [[NGramMake][NGramMake]] | ngrammake [--method] [--backoff] [--bins] [--witten_bell_k] [--discount_D] [in.fst [out.fst]] | n-gram model smoothing and normalization | | | !NGramAbsolute(&CountFst); | --- Absolute Discount smoothing | | | !NGramKatz(&CountFst); | --- Katz smoothing | | | !NGramKneserNey(&CountFst); | --- Kneser Ney smoothing | | | !NGramUnsmoothed(&CountFst); | --- no smoothing | | | !NGramWittenBell(&CountFst); | --- Witten-Bell smoothing | | [[NGramMarginal][NGramMarginal]] | ngrammarginalize [--iterations] [--max_bo_updates] [--output_each_iteration] [--steady_state_file] [in.mod [out.mod]] | impose marginalization constraints on input model | | | !NGramMarginal(&M); | --- n-gram marginalization constraint class | | [[NGramMerge][NGramMerge]] | ngrammerge [--alpha] [--beta] [--use_smoothing] [--normalize] in1.fst in2.fst [out.fst] | merge two count or model FSTs | | | !NGramMerge(&M1, &M2, alpha, beta); | --- n-gram merge class | | [[NGramPerplexity][NGramPerplexity]] | ngramperplexity [--OOV_symbol] [--OOV_class_size] [--OOV_probability] ngram.fst [in.far [out.txt]] | calculate perplexity of input corpus from model | | [[NGramPrint][NGramPrint]] | ngramprint [--ARPA] [--backoff] [--integers] [--negativelogs] [in.fst [out.txt]] | print n-gram model to text file | | [[NGramRandGen][NGramRandgen]] | ngramrandgen [--max_sents] [--max_length] [--seed] [in.mod [out.far]] | randomly sample sentences from an n-gram model | | [[NGramRead][NGramRead]] | ngramread [--ARPA] [--epsilon_symbol] [--OOV_symbol] [in.txt [out.fst]] | read n-gram counts or model from file | | [[NGramShrink][NGramShrink]] | ngramshrink [--method=count,relative_entropy,seymore] [-count_pattern] [-theta] [in.mod [out.mod]] | n-gram model pruning | | | !NGramCountPrune(&M, count_pattern); | --- count-based model pruning | | | !NGramRelativeEntropy(&M, theta); | --- relative-entropy-based model pruning | | | !NGramSeymoreShrink(&M, theta); | --- Seymore/Rosenfeld-based model pruning | | [[NGramSymbols][NGramSymbols]] | ngramsymbols [--epsilon_symbol] [--OOV_symbol] [in.txt [out.txt]] | create symbol table from corpus | #ConvenienceScript ---+++ Convenience Script %ICON{"wip"}% The shell script =ngram.sh= is provided to run some common !OpenGrm NGram pipelines of commands and to provide some rudimentary [[NGramAdvancedUsage#DistributedComputation][distributed computation support]]. For example: <pre> $ ngram.sh --itype=text_sents --otype=pruned_lm --ifile=in.txt --ofile=lm.fst --symbols=in.syms --order=5 --smooth_method=katz --shrink_method=relative_entropy --theta=.00015 </pre> will read a text corpus in the format accepted by =farcompilestrings= and output a backoff 5-gram LM pruned with a relative entropy threshold of .00015. See =ngram.sh --help= for available options and values and see [[NGramAdvancedUsage#DistributedComputation][here]] for a discussion of the distributed computation support.
Attachments
Attachments
Topic attachments
I
Attachment
History
Action
Size
Date
Who
Comment
txt
earnest.txt
r1
manage
89.0 K
2010-11-05 - 02:14
MichaelRiley
Edit
|
Attach
|
Watch
|
P
rint version
|
H
istory
:
r35
|
r32
<
r31
<
r30
<
r29
|
B
acklinks
|
V
iew topic
|
Raw edit
|
More topic actions...
Topic revision: r30 - 2013-08-07
-
MichaelRiley
GRM
Log In
or
Register
GRM Web
Create New Topic
Index
Search
Changes
Notifications
Statistics
Preferences
Webs
Contrib
FST
Forum
GRM
Kernel
Main
Sandbox
TWiki
Main
Copyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki?
Send feedback