The shortest path algorithm expands the whole transducer and memory usage goes up to 2 GB! (I did see a previous post about setting the option first_path to true, but i need the first 100 or 1000 best paths to be outputted).
It would be of great help if someone had ideas on how to proceed.
Thanks a lot in advance.
x32/fstdeterminize.exe --fst_default_cache_gc_limit=10000000000 in.fst out.fst ==> 292.40 secs.
x64/fstdeterminize.exe --fst_default_cache_gc_limit=10000000000 in.fst out.fst ==> 39034.45 secs.
Best, Cyril
Thank you.
If anyone needs details please, write me to alexei_v_ivanov@ieee.org
Or alternatively, am I totally mistaken? Can someone shed some light on my confusion?
Cheers,
Roland
We haven't implemented Floyd-Warshall. So there is no backup in case of a bad cycle, the algorithm would simply not converge.
Best, Cyril
potentials.txt
input is the ouptut of fstshortestdistance
.
fstshortestdistance a.fst > potentials.txt fstreweight --to_initial a.fst potentials.txt > b.fstThis is equivalent to:
fstpush --push_weights --to_initial a.fst > b.fstThis is actually the way pushing is implemented in the library, shortest-distance followed by reweighting.
I am trying to read a fst into a simple C++ application, however, for some reason I am being told that certain symbols are undefined:
Ld build/Debug/fstprintallpaths normal x86_64 cd /.../fstprintallpaths setenv MACOSX_DEPLOYMENT_TARGET 10.6 /Developer/usr/bin/g++-4.2 -arch x86_64 -isysroot /Developer/SDKs/MacOSX10.6.sdk -L/Users/michael/code/fstprintallpaths/build/Debug -L/usr/local/lib -L/usr/local/lib/Cabal-1.6.0.2 -L/usr/local/lib/ImageMagick-6.4.1 -L/usr/local/lib/pkgconfig -L/usr/local/lib/Cabal-1.6.0.2/ghc-6.10.4 -L/usr/local/lib/ImageMagick-6.4.1/config -L/usr/local/lib/ImageMagick-6.4.1/modules-Q16 -L/usr/local/lib/Cabal-1.6.0.2/ghc-6.10.4/Distribution -L/usr/local/lib/Cabal-1.6.0.2/ghc-6.10.4/Language -L/usr/local/lib/ImageMagick-6.4.1/modules-Q16/coders -L/usr/local/lib/ImageMagick-6.4.1/modules-Q16/filters -L/usr/local/lib/Cabal-1.6.0.2/ghc-6.10.4/Distribution/Compat -L/usr/local/lib/Cabal-1.6.0.2/ghc-6.10.4/Distribution/PackageDescription -L/usr/local/lib/Cabal-1.6.0.2/ghc-6.10.4/Distribution/Simple -L/usr/local/lib/Cabal-1.6.0.2/ghc-6.10.4/Language/Haskell -L/usr/local/lib/Cabal-1.6.0.2/ghc-6.10.4/Distribution/Simple/Build -L/usr/local/lib/Cabal-1.6.0.2/ghc-6.10.4/Distribution/Simple/GHC -L. -F/Users/michael/research/projects/lattice_parsing/code/fstprintallpaths/build/Debug -filelist /Users/michael/research/projects/lattice_parsing/code/fstprintallpaths/build/fstprintallpaths.build/Debug/fstprintallpaths.build/Objects-normal/x86_64/fstprintallpaths.LinkFileList -mmacosx-version-min=10.6 -o /Users/michael/research/projects/lattice_parsing/code/fstprintallpaths/build/Debug/fstprintallpaths Undefined symbols: "fst::Int64ToStr(long long, std::basic_string<char, std::char_traits<char>, std::allocator<char> >*)", referenced from: fst::FloatWeightTpl<float>::GetPrecisionString() in fstprintallpaths.o "fst::FstHeader::Read(std::basic_istream<char, std::char_traits<char> >&, std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool)", referenced from: fst::Fst<fst::ArcTpl<fst::TropicalWeightTpl<float> > >::Read(std::basic_istream<char, std::char_traits<char> >&, fst::FstReadOptions const&)in fstprintallpaths.o ld: symbol(s) not found collect2: ld returned 1 exit status ----------------------------------------------------------------- fstprintallpaths.cpp: #include "fstprintallpaths.h" #include <iostream> #include <vector> #include <fst/fstlib.h> #include <fst/fst-decl.h> #include <fst/vector-fst.h> #include <fst/weight.h> using namespace fst; int main (int argc, char * const argv[]) { // insert code here... std::cout << "Hello, World!\n"; StdFst *input = StdFst::Read("blah.fst"); //StdFst *model = StdFst::Read("blah.fst"); //printAllStrings(&fst); return 0; } string itos(int i) { // convert int to string; from Bjarne Stroustrup's FAQ stringstream s; s << i; return s.str(); } string vectorToString(vector<int> v) { if(v.size() == 0) return "<>"; string result = "<" + itos(v[0]); for(int i = 1; i < v.size(); i++) { result + "," + itos(v[i]); } return result + ">"; } void printAllStringsHelper(VectorFst<StdArc> &fst, int state, vector<int> str, TropicalWeight cost) { if(fst.Final(state) == TropicalWeight::Zero()) cout << vectorToString(str) << " with cost " << (Times(cost,fst.Final(state))) << endl; for(ArcIterator <VectorFst <StdArc> > aiter(fst,state); ! aiter.Done(); aiter.Next()) { StdArc arc = aiter.Value(); str.push_back(arc.ilabel); printAllStringsHelper(fst, arc.nextstate, str, Times(cost, arc.weight.Value())); str.pop_back(); } } // a bad idea if there are loops :) void printAllStrings(VectorFst<StdArc> &fst) { vector<int> str(0); TropicalWeight tw(TropicalWeight::One()); printAllStringsHelper(fst, 0, str, tw); } ----------------------------------------------------------------- fstprintallpaths.h /* * fstprintallpaths.h * fstprintallpaths * */ #include <iostream> #include <vector> #include <fst/fstlib.h> #include <fst/fst-decl.h> #include <fst/vector-fst.h> #include <fst/weight.h> using namespace fst; int main(int argc, char * const argv[]); string itos(int i); string vectorToString(vector<int> v); void printAllStringsHelper(VectorFst<StdArc> &fst, int state, vector<int> str, TropicalWeight cost); void printAllStrings(VectorFst<StdArc> &fst);
$ g++ fstprintallpaths.cpp fstprintallpaths.h -I /usr/local/include Undefined symbols: "fst::FstHeader::Read(std::basic_istream<char, std::char_traits<char> >&, std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool)", referenced from: fst::Fst<fst::ArcTpl<fst::TropicalWeightTpl<float> > >::Read(std::basic_istream<char, std::char_traits<char> >&, fst::FstReadOptions const&)in ccspqZtg.o "fst::Int64ToStr(long long, std::basic_string<char, std::char_traits<char>, std::allocator<char> >*)", referenced from: fst::FloatWeightTpl<float>::GetPrecisionString() in ccspqZtg.o ld: symbol(s) not found collect2: ld returned 1 exit status
$ fstinfo a_binary.fst fst type vector arc type standard input symbol table ia.txt output symbol table oa.txt # of states 24 # of arcs 9095 initial state 0 # of final states 4 # of input/output epsilons 0 # of input epsilons 0 # of output epsilons 9095 # of accessible states 24 # of coaccessible states 24 # of connected states 24 # of strongly conn components 24 expanded y mutable y acceptor n input deterministic n output deterministic n input/output epsilons n input epsilons n output epsilons y input label sorted n output label sorted y weighted y cyclic n cyclic at initial state n top sorted y accessible y coaccessible y string n $ fstinfo a_rmeps.fst fst type vector arc type standard input symbol table ia.txt output symbol table oa.txt # of states 24 # of arcs 645 initial state 0 # of final states 4 # of input/output epsilons 0 # of input epsilons 0 # of output epsilons 645 # of accessible states 24 # of coaccessible states 24 # of connected states 24 # of strongly conn components 24 expanded y mutable y acceptor n input deterministic n output deterministic n input/output epsilons n input epsilons n output epsilons y input label sorted n output label sorted y weighted y cyclic n cyclic at initial state n top sorted y accessible y coaccessible y string n $ fstinfo a_determinized.fst fst type vector arc type standard input symbol table ia.txt output symbol table oa.txt # of states 1026 # of arcs 22054 initial state 0 # of final states 906 # of input/output epsilons 0 # of input epsilons 0 # of output epsilons 22054 # of accessible states 1026 # of coaccessible states 1026 # of connected states 1026 # of strongly conn components 1026 expanded y mutable y acceptor n input deterministic y output deterministic n input/output epsilons n input epsilons n output epsilons y input label sorted y output label sorted y weighted y cyclic n cyclic at initial state n top sorted y accessible y coaccessible y string n
Is this normal?
My code is of the form:
#include "fst/fstlib.h"
using namespace std; using namespace fst;
int main() { StdVectorFst? fst1; StdVectorFst? fst2; StdVectorFst? fst3;
ArcSort? (fst1, StdOLabelCompare? ()); fst3 = ComposeFst? (fst1, fst2);
return 0; }
I posted this question as part of an old topic, but I am reposting it as a new topic because I am not sure whether old topics are still reviewed.
Thank you very much, Jonathan
#include "fst/fstlib.h" using namespace std; using namespace fst; int main() { StdVectorFst fst1; StdVectorFst fst2; StdVectorFst fst3; ArcSort(fst1, StdOLabelCompare()); fst3 = ComposeFst(fst1, fst2); return 0; }
ArcSort(&fst1, StdOLabelCompare()); fst3 = StdComposeFst(fst1, fst2);
I know that an epsilon transition encodes an empty string. However, I'd like to compose two FSA. In case a label exists in one FSA, but not in another, do we have a fail transition, instead of the epsilon transition, can encode this?
Thanks very much!!
You want to use PhiMatcher
(or maybe RhoMatcher
, I'm not completely sure from your message). See the matcher documentation here.
Best, Cyril
Best, Camille
Thanks
Example FST for word : cat
0 1 c c 1 2 a a 2 3 t n,sg 3
verbatim
tags.
As for your original question, you want to use Composition.
I have another question:
if we know the total cost of path1 and we know the total cost of path 2 and path 1 is part of path 2 (say path 1 has 1 arc, and path 2 has 2 arcs, and path2's 1st arc label is the same as the path1's arc's label.)
then, is there a way to automatically combine these two paths in a single fst, by keeping path1's cost as the 1st arc's cost, and subtracting the path1's total cost from path2's total cost, and assign it as the cost of the second arc?
thanks!
if we know the total cost of path1
and we know the total cost of path 2
and path 1 is part of path 2 (say path 1 has 1 arc, and path 2 has 2 arcs, and path2's 1st arc label is the same as the path1's arc's label.)
This is still not very clear to me. Could you clarify further?
Best, Cyril
Now I see how to solve my question. Actually I did not know hot to indicate a path's weight when writing this path into a fst format. After I can do that, I can use the commands to do the calculation, and now, I know how to indicate a path's weight, so, the question is solved!
Thanks!
Camille
fstmap --map_type=rmweight
I'v been experimenting around with the openfst library lately and have the following problem: I'm trying to align sequences using a transducer (i.e. find the edit distance with specifics weights to indels and mismatches). The transducer has 4 states and some epsilon transitions to model indels (no double epsilon transitions tho). The sequences to be aligned are nucleotide sequences (alphabet of size 4) and about 1000 nucleotides long. Composition of two of these sequences with the alignment transducer (as in I o T o O, where I is the input and O the output sequence/acceptor) results in a 400mb fst file and takes approximately 3-4 minutes. Computation of the shortest path again another 2 minutes or so.
Now here's my question. It is somehow obvious, that the number of states explodes in composition (1000 * 4 * 1000 are at least 4 mill states), so maybe direct composition is not the way to go. But what can I honestly do to improve performance? I switched from the shell versions to the C library and tried lazy composition. Now composition is done in an instant, but the shortest path algorithm still seems to expand the whole transducer, as memory usage goes up to 500mb and it still takes about 4 minutes.
Is it simply such that the transducer approach to alignment is slow? Computing 200 of these pairwise alignments using traditional dynamic programming takes about 20 seconds. Or did I make a mistake somewhere? Does the shortest path algorithm neccessarily have to expand the full transducer, i.e. does lazy composition help at all in this case? I didn't even expect it to be competitive in terms of speed to dedicated tools, but I hoped it to be at least computationally feasible. I would really appreciate any kind of help.
Cheers,
Roland
The issue is indeed that computing the direct composition is not the way to go since it is rather expensive. Your idea of using lazy composition was good but as you noticed, ShortestPath
expands the full machine by default. However, there is a way to avoid this behaviour: by using the shortest-first queue discipline and setting the option first_path
to true
.
ComposeFstOptions opts(); opts.gc_limit=0; // This disable the caching of the result of composition // and should lower the memory usage at no computational cost since // shortest-path should visit each state in the composed machine at most once. ComposeFst<Arc> cfst(fst1, fst2, opt); NaturalShortestFirstQueue<StateId, Weight> state_queue(distance); ShortestPathOptions<Arc, NaturalShortestFirstQueue<StateId, Weight>, AnyArcFilter<Arc> > opts(&state_queue, AnyArcFilter<Arc>()); opts.first_path = true; ShortestPath(cfst, ofst, &distance, opts);
This example assume composing two machines. In you case fst1
could be I o T (itself delayed or not, since it is a smaller machine both might be worth tryingboth) and fst2
could be O.
This should improve things a lot although it should still be slower than dedicated tools. I am very interested in this, so let me know how things turn out.
Best, Cyril
All the best,
Roland
Assume we have an FST A as depicted below:
By label pushing FST A (to final state), e.g. by the command
fstpush --push_labels --to_final A.fst PA.fst,
one might expect the following FST:
But instead the result FST PA.fst is generated, depicted below:
Question 1. In many cases, as is the example, there is an obvious way of performing pushing of the output labels to a final state, without adding any new states to the FST. Can you explain this growth behavior – under what circumstances it is triggered and, possibly, how it can be avoided?
When an FST that has been label pushed, is label pushed again, one might expect it to stabilize. However, Applying Push() to the already Pushed FST A generates the following output.
Question 2. Why is not the OpenFst implementation of label pushing idempotent, (e.g. like weight pushing is) so that the result of Push(A) is identical to result of Push(Push(A)) ?
Thanks!
-John
The reason for this is that label-pushing is implemented by considering the input transducer as a weighted automaton over the string semiring and applying weight-pushing (over the string semiring). The result is then the following weighted automaton over the string semiring:
The issue is how to convert that weighted automaton over the string into a transducer. Currently, we use the second step of the epsilon-normalization algorithm to do this. This is also what the cause the operation to be non idempotent.
One way to partially avoid the problem, would be to first reverse the transducer, apply label-pushing towards the initial state and reverse again:
fstreverse A.fst | fstpush --push_labels | fstreverse > PA.fst
This would result in:
The extra epsilon transitions are due to the use of reverse.
Best, Cyril
Thank you for taking your time to answer this question so thoroughly. I assume there exists, at least theoretically, a generic pushing algorithm to accomplish idempotent and “non-growing” label pushing. Do you agree on this? And if so, would you endorse such an algorithm to be included in any future version of OpenFST?
Best, John
But the problem is when I determinize it, it gives an error that the FST is non-functional i.e. for a particulat input there are more than one possible outputs, which will be there in this case.
So can someone please help me with this ??
Thank you, Vipul
Best
Dogan Can
Could you please tell me an estimate of the maximum size (in terms of #states and/or #transitions) of the FSTs, the openfst toolkit can work with (let's say on a machine with 8G of memory)? Alternatively, what was the size of your largest FST you worked with?
Thank you. Oana
OpenFst is a very efficient library which scales well to large problems. The answer to your question depends on the properties of the fsts and what you would like to do with them. For instance, you may run the shortest path and pruning algorithms over very large fsts (several millions of states and transitions) while you may not be able to determinize them or compose them with other large machines. What you can do also depends on whether your input fsts are acyclic, epsilon free etc. Using lazy implementations may help if your particular task does nor require complete fsts to be loaded onto the memory upfront. In short, (in my opinion) your best bet is to try and see.
Best,
Dogan
There are two main concrete class in the library ConstFst
and VectorFst
.
For ConstFst
, the memory requirement is 20 bytes per state. and 16 bytes per arc. For VectorFst
, the per-state memory requirement is about twice as much due to the overhead introduced by STL data structures.
Doing the math, you will see that you can easily represent machines with hundred of millions of states and transitions in a few GB of RAM.
If memory usage is critical, the CompactFst
class can also be used to provide more efficient memory representations.
This addresses the memory requirement for representing an FST in memory. But, as Dogan pointed out, your memory usage will also depends on which algorithms you want to use.
Best, Cyril
Also, are there any examples of employing specific queue disciplines with RmEpsilon? I've not been able to get this to work in practice.
Thank you for this great library; I am using it for my B.A. thesis project in C.L. I am using it through the shell from within my Java code.
I have a bit of difficulty with a few things. This is one of them:
- Is there a compact way to specify "any other symbol". As in: in state 0 transduce "a" to "b" and transduce any other symbol to "c"?
The problem specifically is that I am creating acceptors where the symbols are words rather than letters, in some cases I might have 1000 words in my isymbols list. I want to build a rational kernel (transducers to count sequences) to work with these acceptors. At some states I need to be able to map every non-explicitly-mentioned input symbol to epsilon with weight 1. How can I specify that? Do I really need to add a thousand arcs for each state?
For your second question, you will need to write your own function for this but it should be rather easy. Remember that the number of strings accepted by an acyclic automaton is in the worst case exponential in the size of the automaton.
Best, Cyril
A Chinese lexicon L: # of states 215371 # of arcs 303286 A Chinese bigram G: # of states 87918 # of arcs 16822088 Run fstcompose to get L o G # of states 391202 # of arcs 17125372 After determinization by running fstdeterminize, det(L o G) # of states 3479987 # of arcs 20071667 more complex !
I don't know why. I follows exactly the paper "Mehryar Mohri, Fernando C. N. Pereira, and Michael Riley. Speech Recognition with Weighted Finite-State Transducers. In Handbook on Speech Processing and Speech Communication, 2008". The output word label in L is already placed on the initial transitions of the word.
The code is as follows:
fstarcsort.exe --sort_type=olabel L.fst L.fst.sort fstarcsort.exe --sort_type=ilabel G.fst G.fst.sort fstcompose.exe L.fst.sort G.fst.sort LG.fst fstdeterminize.exe LG.fstb LG.fst.det fstminimize.exe LG.fst.det LG.fst.min Memory allocation failed
I run 64-bit openFST on windows 2003 server 64bit edition. The fstminimize.exe returns "Memory allocation failed" is a surprise. 64bit memory space is not sufficient?
Any suggestion that you could offer would be much appreciated.
Best, Zhijian
fstencode
to encode the labels, apply minimization and use fstencode
to decode (see FST.EncodeDecodeDoc for more details).
Best, Cyril
As mentioned in the paper "Allauzen, Mohri and Roark, Generalized algorithms for constructing statistical language models, ACL 2003." , in constructing a ngram model, there are transitions labeled with the end symbol < / s >, that lead to the final state.
To compose L with G, I need to add a psuedo word as follows to the L.
r eh d #0 read r eh d #1 read ... #0 </s>
Am I correct?
(btw: if I replace the < / s > in G with < eps >, the G is not deterministic. This is not a good way. So I propose the above way.)
Thanks for your patience. Your help is greately appreciated.
Best, Zhijian
</s>
in G and adding </s>
as a word with empty pronunciation (or silence) in L. The GRM library also allowed replace </s>
in G by a final weight.
Best, Cyril
I built a transducer of a trigram, following the paper "Mehryar Mohri, Fernando C. N. Pereira, and Michael Riley. Speech Recognition with Weighted Finite-State Transducers. In Handbook on Speech Processing and Speech Communication, 2008"
As we note, after we introduce the backoff state, the G is only an approximate representation, if there are such seen trigram that have lower probability than its backed-off bigram, i.e. p(z|x,y) < alph(x,y)*p(z|y). To my viewpoints, there are two solutions.
1) One is to splitting states before replacing failure-transition by eps-transition, as proposed in "Allauzen, Mohri and Roark, Generalized algorithms for constructing statistical language models, ACL 2003."
2) The other, as I think, is to create a backoff model from an interpolated algorithm (e.g. modified kn smoothing). Then we have p(z|x,y) > alph(x,y)*p(z|y) for all trigrams. And if we further work in the tropical semiring, the representation will be exact.
So I chose to construct G in the tropical semiring. Now the question comes. As we know, weight pushing in the tropical semiring is inferior to pusing in the log semiring. I'm not sure whether the following procedure is correct, and if it's correct, how to do it? The semiring info is embeded in the arc type. If I use shell command, how can I change the arc type?
1) Switch from representing G (or L o G) in the tropical semiring to representing in the log semiring
2) run fstminimize to get min(L o G)
3) Swtich back to representing min(L o G) in the tropical semiring.
Thanks for your patience. Your help is greately appreciated.
Best, Zhijian
VectorFst<StdArc> ifst; VectorFst<StdArc> ofst; VectorFst<LogArc> lfst; Map(ifst,&lfst,StdToLogMapper()); Minimize(&lfst); Map(lfst,&ofst,LogToStdMapper());
Best, Dogan
Sorry for the above lengthy question. To summarize, the main question is: when both lexicon L and grammar G are constructed in tropical semiring. Compared with directly run fstminimize.exe (i.e. in tropical semiring), is it better to switch to log semiring, run fstminimize.exe, and then switch back ?
Best, Zhijian
is there anybody who has compiled openfst-1.1 (latest version) on Cygwin successfully? Which "3rd-party" libs need to be available on Cygwin in advance?
Best regards, Henrik
I tried to run fstcompile.exe on a transducer with 23M states and 41M arcs, from AT&T FSM format text file. It terminates, saying that "Memory allocation failed". The 2G address space is not sufficent for compiling such a large transducer. Does the current openFST code support 64bit compile? Could you give me some instructions to build a 64bit openFST?
It is very appreciated that you could favor me with a reply. Thank you.
Zhijian
It's very kind of you if you can add a 64bit profile to the Visual Studio solution.
Best regards, Zhijian
//Building 3 trivial fsts for fst replacement VectorFst<StdArc> fst1,fst2,fst3; fst1.AddState(); fst1.AddState(); fst1.AddArc(0, StdArc(0, 123456789, StdArc::Weight::One(), 1)); fst1.SetFinal(1,StdArc::Weight::One()); fst1.SetStart(0); fst2.AddState(); fst2.AddState(); fst2.AddArc(0, StdArc(0, 1234567890, StdArc::Weight::One(), 1)); fst2.SetFinal(1,StdArc::Weight::One()); fst2.SetStart(0); fst3.AddState(); fst3.AddState(); fst3.AddState(); fst3.SetFinal(2,StdArc::Weight::One()); fst3.SetStart(0); fst3.AddArc(0,StdArc(0, 1000, StdArc::Weight::One(), 1)); fst3.AddArc(1,StdArc(0, 2000, StdArc::Weight::One(), 2)); vector< pair< StdArc::Label, const Fst<StdArc> * > > pairlabelfsts; pairlabelfsts.push_back(pair< StdArc::Label, const Fst<StdArc> * >(1000,&fst1) ); pairlabelfsts.push_back(pair< StdArc::Label, const Fst<StdArc> * >(2000,&fst2) ); pairlabelfsts.push_back(pair< StdArc::Label, const Fst<StdArc> * >(3000,&fst3) ); VectorFst<StdArc> rulefst; StdArc::Label slabel=3000; #ifndef USEOPENFSTBETA rulefst=ReplaceFst<StdArc>(pairlabelfsts,ReplaceFstOptions<StdArc>(slabel,false)); #else rulefst=ReplaceFst<StdArc>(pairlabelfsts,ReplaceFstOptions(slabel,false)); #endif RmEpsilon(&rulefst); //valgrind complains here rulefst.Write(some.fst); // and valgrind complains here too return 0; }
(I may submit a link to a full example code)
Cheers!
Thanks for pointing this out! I'll look more deeply into this.
Best, Cyril
Hi,
Thanks for such a splendid tool as openfst! I'm a very happy user of the beta version for more than a year now.
I'm writing here because I have encountered a specific problem with the delayed union operation using openfst-1.1. library. Basically, I perform a delayed union of many fsts. Afterwards, I update by reinstancing to a VectorFst . I have isolated a particular case in which the version compiled with openfst-beta works things out very quickly (less than one minute), whilst it takes 40 minutes on openfst 1.1 to create the same fst. I have a small but full example including the necessary fsts to read from (~6MB) and perform union with. I don't understand this behaviour and it seems to me something could be wrong within version 1.1
Perhaps this is a known issue? May I post a link to a tar file containing the example? Please let me know how should I proceed.
Thanks in advance!
I've started to look into this and I think I found the reason for this inefficiency. We'll fix this for the next release.
Thanks a lot for your feedback!
Best, Cyril
Union(RationalFst<A> *fst1, const Fst<A> &fst2)
as follows:
UnionFst<A> *union = new UnionFst<A>(fsts[0], fsts[1]); for (int i = 2; i < fsts.size(); ++i) Union(union, fsts[i]);This takes advantage of the fact that UnionFst derives from RationalFst and that delayed rational operations such as union can be applied destructively to a RationalFst.
Cyril
madhu
Say you have the string "x y z" in the set defined by the regex and the final string is "a b".
If you want the transduction "x y z" -> "a b a b a b", one possible solution is to create a simple transducer T in the form:
0 1 X a 1 2 <eps> b 2 0 <eps> <eps> 2where
X
is any symbol in your input alphabet (You need to add a transition in the form "0 1 X a" for each X in your alphabet) and compose the regex with T.
If you want the transduction "x y z" -> "a b" instead, then one possible solution is to create a transducer T in the form:
0 1 X <eps> 1 0 <eps> <eps> 1 2 <eps> a 2 3 <eps> b 3where
X
is any symbol in your input alphabet and compose the regex with T.
You might achieve the same behaviour using sigma matcher in composition (At least I believe it is possible).
You might even try (for "x y z" -> "a b"):
Best, Dogan
1) How to easily get the command-line help in openFST's shell-level exe? You know, in FsmLib, an argument "-?" will print help information.
2) If exe is sufficient for my application, which one do you recommend to use? In particular, I have a special emphasis on the performance of acceptor optimization operations, e.g., determinize, minimize, remove-epsilon. As we see, minimizing a transducer is not supported at the command level.
Thanks in advance.
Best regards, ZhijianOu
I'd like you to point out whether my following understandings are correct.
1) If the input is an acceptor, the implementation of fstminimize is first do weight pushing, and then do (unweighted) acceptor minimization.
2) If 1) is correct, then using different semirings will have different weight pushing results, and different minimization results.
3) In the description of fstminimize in openFST (or fsmminimize in fsmLib), there is no mention of the default semiring used. What semiring is the default?
4) For a deterministic transducer, P(I, x, y, F) contains only one path, using the notation in the paper - Mehryar Mohri, Fernando C. N. Pereira, and Michael Riley. Speech Recognition with Weighted Finite-State Transducers. In Handbook on Speech Processing and Speech Communication, Springer-Verlag, 2008.
5) fstdeterminize depends on the semiring ?
6) If we create binary fst from text file using fstcompile or fsmcompile, where to specify the semiring ?
It is very appreciated that you could favor me with a reply. Thank you.
Best regards, ZhijianOu
For fsmcompile use the -s switch fsmcompile -t -s log test.txt > test.fst
The -s switch for fsmcompile is to give the states textual names.
The output from "fsmcompile -?" is FSM Version 3.7 Copyright (C) 1998 AT&T Corp. All rights reserved. Usage: fsmcompile [-opts] [fsm] Options: -i symbols input arc symbols -o symbols output arc symbols -s symbols state symbols -t transducer -V file input potentials file -F file output FSM file -? info/options Continue with my question 3. In openFST, we can use --arc_type to specify the semiring. But we can run fstcompile without --arc_type, such as fstcompile --isymbols=isyms.txt --osymbols=osyms.txt text.fst binary.fst The default --arc_type is "standard", which means tropical semiring. Am I right? But in fsmLib, if we run fsmcompile as follow: fsmcompile -t -i i.syms -o o.syms < T.txt > T.fst It is not clear what semiring is used. Thanks for your attention.
Best, ZhijianOu
Yes, standard means tropical semiring (Explained in previous post "Why_is_TropicalWeight_considered_standard?")
As said in the openfstwin-1.1.zip's readme.txt, I open the solution file openfstwin.sln in Visual Studio 2008 and compile. But there are errors as follows:
1>Compiling...
1>fst.cc
1>d:\openfstwin-1.1\openfst-1.1\src\include\fst\unordered_set(19) : fatal error C1083: Cannot open include file: 'unordered_set': No such file or directory
1>properties.cc
1>d:\openfstwin-1.1\openfst-1.1\src\include\fst\unordered_map(19) : fatal error C1083: Cannot open include file: 'unordered_map': No such file or directory
1>Generating Code...
1>Build log was saved at "file://d:\openfstwin-1.1\openfstwin\openfstwinlib\Debug\BuildLog.htm"
1>openfstwinlib - 2 error(s), 0 warning(s)
It is very appreciated that you could favor me with a reply. Thank you.
Best regards, ZhijianOu
First, I'll make sure that you are using Visual Studio 2008 Service Pack 1 as mentioned here.
This looks like a missing TR1 issue. You might not have a working version of TR1 currently installed. As far as I know (I have never used VS2008), an implementation TR1 is included with SP1.
Best, Cyril
It works. Thank you. After installing VS2008 Service Pack1, I compiled openfstwin-1.1 successfully, though still with warnings, either warning C4396 or warning C4800. For example,
d:\openfstwin-1.1\openfst-1.1\src\include\fst/pair-weight.h(195) : warning C4396: 'fst::operator >>' : the inline specifier cannot be used when a friend declaration refers to a specialization of a function template
d:\openfstwin-1.1\openfst-1.1\src\include\fst/matcher.h(317) : warning C4800: 'uint64' : forcing value to bool 'true' or 'false' (performance warning)
Best regards, ZhijianOu
If you think any of the warnings are bogus you can the warning number to the pragma statement at the top of config.h
my build of OpenFST-1.1 with Visual Studio 2008 SP1 was successful.
Now I would like to compile/link an example test application which is shipped together with OpenFST in file 'openfst-1.1\src\test\fst_test.cc'.
Visual Studio reports the following compile errors for fst_tes.cc:
1>------ Rebuild All started: Project: TestFst, Configuration: Debug Win32 ------ 1>Deleting intermediate and output files for project 'TestFst', configuration 'Debug|Win32' 1>Compiling... 1>stdafx.cpp 1>Compiling... 1>TestFst.cpp 1>d:\usr\local\include\fst\test-properties.h(24) : error C3083: 'tr1': the symbol to the left of a '::' must be a type 1>d:\usr\local\include\fst\test-properties.h(24) : error C2039: 'unordered_set' : is not a member of 'std' 1>d:\usr\local\include\fst\test-properties.h(24) : error C2873: 'unordered_set' : symbol cannot be used in a using-declaration 1>TestFst - 3 error(s), 1 warning(s)
What do I have to do in order to use 'unordered_set'?
Thanks in advance, Henrik
Do OpenFst and AT&T FSM support UTF8 or Unicode Encoding?
Thank you in advance,
Best
This question was discussed previously in the forum: link.
Let me know if that answer your questions.
Best, Cyril
I'm interested in pattern matching, i.e. finding occurrences of a set of strings within a text.
M. Mohri [1997] has described an algorithm to generate an efficient pattern search automaton based on failure functions in this paper:
Mohri, Mehryar. String-Matching with Automata. Nordic Journal of Computing, 4(2):217-231, Summer 1997. http://www.cs.nyu.edu/~mohri/postscript/njc.ps
This method has also been referred to by W. Skut [FSMNLP 2008].
Is there anyone who has submitted (or could submit) code implementing this algorithm that would be free for commercial use?
Thank you, Phil Sours
http://swtch.com/~rsc/regexp/regexp1.html
although it is very simple to implement, this algorithm generates transducers which are, in some cases, not terribly well-suited to further optimization or post-processing, especially epsilon removal. the following paper provides an overview of some more efficient ( but more complex ) alternatives which may be better suited to general or more complex inputs,
www.cs.nyu.edu/~mohri/postscript/glush.pdf
dunno if that is what you are looking for, and perhaps it's overkill, but depending on what you mean by 'match strings in a text' it seems like openfst itself might be overkill.
A version of this algorithm is implemented in the GRM library. The binaries are available for free for non-commercial use, no source code however.
Cyril
I need to prune an fst while keeping the state ID's unchanged. I need this behaviour since I keep auxiliary information (i.e potentials) about the states of the input fst. Is there a convenient way of doing this?
Thanks, Dogan Can
Cyril
I noticed the following comment in the advanced usage section.
In a user-defined binary, the command line options processing will all also work if the user calls:
SetFlags(usage, &argc, &argv, true);In that case, the user can set his own flags as well, following the conventions in <fst/flags.h>.
My problem is that when I add a new flag to a custom fst binary, I need to put its definition along with other flag definitions in <src/bin/main.cc> or in <src/lib/flags.cc> and rebuild the library, otherwise SetFlags
method does not recognize it.
I have the generic OpenFst binary setup:
fstbinary.cc -- includes binary-main.h, calls SetFlags
and Arc
templated BinaryMain
function
binary-main.h -- includes <fst/main.h>, declares a new flag using DECLARE_type(newflag)
and defines BinaryMain
function
Defining the new flag using DEFINE_type(newflag,default,help_message)
in fstbinary.cc or binary-main.h allows me to use the newflag
with its default value, however i cannot pass it from the command line.
Is it possible to to define and use new flags in a user-defined binary without modifying the library source code? If so, could you provide an example?
Many thanks, Dogan Can
#include "test-main.h" namespace fst { REGISTER_FST_MAIN(TestMain, StdArc); REGISTER_FST_MAIN(TestMain, LogArc); } DEFINE_double(myflag, 1234, "My Flag"); using fst::CallFstMain; int main(int argc, char **argv) { string usage = "Test flags \n\n Usage: "; usage += argv[0]; usage += " Flags: myflag\n"; std::set_new_handler(FailedNewHandler); SetFlags(usage.c_str(), &argc, &argv, true); if (argc !=1) { ShowUsage(); return 1; } return CallFstMain("TestMain", argc, argv, FLAGS_arc_type); }
#ifndef TEST_MAIN_H__ #define TEST_MAIN_H__ #include <fst/main.h> DECLARE_double(myflag); DECLARE_string(arc_type); namespace fst { template <class Arc> int TestMain(int argc, char **argv, istream&, const FstReadOptions&) { cerr << "myflag=" << FLAGS_myflag << endl; return 0; } } #endif
This is exactly what I tried before. Seems like there is a problem with my build settings. Thanks for your help.
After going through my build settings, I discovered that GCC_SYMBOLS_PRIVATE_EXTERN compiler flag was set to YES in default Release settings of XCode, and changing it to NO solved the problem. This flag sets symbol visibility and I suppose it makes flag definitions invisible to the flagregister.
unfortunately, although I've now read through,
this one point still eludes me, and does not seem to be described or illustrated in any of these papers ( or any others that I was able to find ). Almost every other aspect of the transducer construction and cascade composition seems to be very clearly explained and usually illustrated with simple examples, but the trick with auxiliary phones and composition continues to elude me.
At present my approach consists of the following steps,
min( det( *C* o det( *L* o *G* ) ) )
eh+d-#1will be replaced with '-'
there are clearly a couple of problems with the last stage of my process. first, each auxiliary symbol at the context-independent level is being mapped to multiple different different auxiliary symbols at the context-dependent level. nevertheless the literature implies that this should not in fact be the case,
For further determinizations at the context-dependent phone level and distribution level, each auxiliary phone must be mapped to a distinct context-dependent phone. Thus, self-loops are added at each state of C mapping each auxiliary phone to a new auxiliary context-dependent phone. The augmented context-dependency transducer is denoted by C.
so i suppose that the mapping should be one-to-one, rather that one-to-many, which is what my current approach yields.
another clearly negative consequence of my current botched approach is that the final auxiliary-symbol removal step results in lost context wherever such auxiliary symbols appear. for example, the following transducer
0 1 SIL+r-eh red 1 2 r+eh-d - 2 3 eh+d-#2 - 3 4 d+#2-b - 4 5 #2+b-ah ball 5 6 b+ah-l - ... ...
defaults to,
0 1 SIL+r-eh red 1 2 r+eh-d - 2 3 - - 3 4 - - 4 5 - ball 5 6 b+ah-l - ... ...
which, although i can actually use it for recognition, is obviously wrong.
so finally my question: are there any other papers which provide either diagrams of an augmented component context-dependent triphone transducer, or a more detailed description of the I(M) operation described in "Generalized Optimization...",
For any transducer M, we also denote by I(M) the transducer augmented with extra transitions such that each new symbol on its input side is mapped to some new and distinct output symbol.
try as I might still don't quite get this. sorry if this seems a foolish question but im teaching this to myself and dont really know where else i might ask.
The most trivial, canonical example of a deterministic context-dependent triphone transducer.
The same example, inverted, and augmented with 2 auxiliary phones, #0
, and #1
Assuming we have an example grammar transducer of the following form,
0 1 a testa 1 2 b - 2 3 #0 - 3 4 a testb 4 5 b - 5 6 #1 - 6
Then composition with the CD transducer yields,
where the auxiliary triphone symbols can now be safely replaced with epsilons, or short-pauses, or removed without any loss of information.
so, in contrast to the above examples where we have a list of monophones which includes auxiliary phones, 'a, b, #0, #1', which are all treated the same. and further in contrast to the self-loop approach, where we have two separate classes of symbols, monophones: 'a, b', and aux. symbols: '#0, #1' we instead of a selective combination depending on the structure of the lexicon transducer, e.g., monophones: 'a#0, b, a#1' which are all treated as normal monophones. this approach has an advantage in that there is no need to treat anything specially, and at the C-level ( assuming we've no need to worry about anything further down in the cascade ), there is no loss of information. instead of 'eps-a+b, #0c:, a-b+a' we get 'epse-a#0-b, a#0-b+a#1'. and instead of epsilon replacement, we have a simple symbol replacement.
this actually seems like a reasonable solution so long as, a. the degree of homophony is small, and b. there is no need to interact with lower levels in the cascade. if either a. or b. do not hold this alternative rapidly becomes intractable due to the increase in monophones and the increase in the size of the resulting C-level transducer.
However, I want to point out that the first approach does not really require to physically add the self-loops and can be achieved without increasing the size of C by using custom matchers in composition.
a b c d e f
in which each even-numbered symbol (counting from 0) represents an Input symbol and each odd-numbered symbol represents an Output symbol. (This is the "linearized transducer".)
Question: Is there an easy way to turn this Acceptor (a kind of linearized transducer) into a real Transducer where each path like the one above is converted into a path of half the length, with these labels:
a:b c:d e:f
1) Use composition to turn a b c d e f
into a:eps eps:b c:eps eps:d
.
This can easily be done using two two-state transducers T_1
and T_2
and computing T_1 o A o T_2
. T_2
is the transducer erasing any symbol in odd position: for all a
in the alphabet,
0 1 a eps 1 0 a a 1
T_1
is the inverse of the reverse of T_2
2) Apply Synchronize() followed by RmEpsilon().
Cyril
I noticed two small bugs/problems in the library and here are my fixes for them.
1. fstcompile does not give any error message when no input file is specified.
Solution: change line 00218 in <compile-main.h>
< if (!istrm) {
> if (!*istrm) {
2. <lexicographic-weight.h> defines pair weight type using "_<_" symbol between component weight types. However "<" symbol creates problems say when you want to register LexicographicArc
for fst binaries since you can neither supply type1_<_type2
as an arc type nor create a dynamic shared object named type1_<_type2-arc.so
.
Solution: change line 00075 in <lexicographic-weight.h>
< static const string type = W1::Type() + "_<_" + W2::Type();
> static const string type = W1::Type() + "_L_" + W2::Type();
Best, Dogan Can
Best, Cyril
as a software developer of an OCR team I'm looking for a library which can
(a) represent certain kind of language models (e.g., OCR error correction rules, n-grams, word dictionaries, morphology and grammar rules for German language) in a unique way, for example, as weighted finite automata
and
(b) has some high-level operations already implemented to process the language models, e.g. functions like 'compose', 'minimize' and 'shortest-path-search'.
The BOOST Graph Library (http://www.boost.org/doc/libs/1_39_0/libs/graph/doc/index.html) does also have weighted finite automata which could be used to represent language models and BOOST Graph offers lot's of functions on them.
My question: Where are the supposed differences between BOOST Graph and OpenFST in relation to my OCR application, and, what are the strenghts and weaknesses of both of them?
Best regards, Henrik
The Boost Graph Library focuses on directed and undirected graphs. As far as we know, it does not provide specific operations for weighted automata and transducers such as determinization, minimization or composition. You can probably use the BGL to represent language models, but you'll have to re-implement these algorithms. If you know otherwise, please point us to the relevant documents as there is no mention of automata-specific algorithms at the URL you provided.
The OpenFst library was especially designed for handling weighted automata and transducers. It provides a large selection of algorithms (including minimization and composition), some of them being implemented as on-the-fly (lazy) operations. The library also includes a general semiring framework. OpenFst provides some of the basic graph algorithms offered by the BGL such as generic visitors (depth-first, breadth-first, shortest-first, ...), generic shortest-path algorithms (Dijkstra, Bellman-Ford, ...), topological sort, strongly-connected components, ... However, OpenFst does not offer some other algorithms provided by BGL (spamming tree algorithms for instance).
The OpenFst library has been successful used by several natural language processing applications: speech recognition, speech synthesis, machine translation and OCR among others. It is for instance one of the core components of the speech recognition and speech synthesis efforts at Google. In that particular example, it is used among other things to represent (and combine) n-gram language models, word dictionaries and regular grammars.
To conclude, the Boost Graph library could provide a) but not b). The OpenFst library provides a) and b).
Best regards, Cyril
Are there any plans to support the three-way composition mentioned in the technical report "N-way composition of weighted finite-state transducers" (by Cyril Allauzen and Mehryar Mohri) in the near future? It would come in handy for one of my projects and I would be willing to help with the implementation if no-one is working on it, but if it's already being worked on I'll wait for the next release.
Does anyone know of an existing implementation of the generalized n-way composition algorithm from that paper? I'm considering working on my own implementation if not.
Best regards, Greg
I am working with the 1.1 version and it seems that there is a change in behavior of SetInputSymbols/SetOutputSymbols. As from version 1.1, it seems that these two functions make a deep copy of the SymbolTable given in argument. This is due to the addition of a copy constructor in symbol-table.h :
explicit SymbolTableImpl(const SymbolTableImpl& impl) : name_(impl.name_), available_key_(0), dense_key_limit_(0), check_sum_finalized_(false) { for (size_t i = 0; i < impl.symbols_.size(); ++i) { AddSymbol(impl.symbols_[i], impl.Find(impl.symbols_[i])); } }* Am I right ?
* if yes, i don't see the use of having a reference counting on a deep copy
* if yes, the following comment in symbol-table.h is not true anymore :
// SymbolTables are reference counted and can therefore be shared across // multiple machines. For example a language model grammar G, with a // SymbolTable for the words in the language model can share this symbol // table with the lexical representation L o G.Thank you for your answer and for this great library !
This is inexact. SetInputSymbols and SetOutputSymbols still do a shallow copy of the SymbolTable in 1.1. They both call SymbolTable::Copy that in turn calls the copy constructor of SymbolTable:
SymbolTable(const SymbolTable& table) : impl_(table.impl_) { impl_->IncrRefCount(); }The copy constructor of SymbolTableImpl is not called since impl_ is a pointer to a SymbolTableImpl.
Best, Cyril
Off-topic post deleted.
MichaelRiley .
Back on 14 August 2007, you wrote:
"The library currently only supports the determinization of functional transducers (if two successful paths have the same input label, they need to also have the same output label). The reason for that is that we use the weighted automata determinization algorithm viewing the output labels as weights in the string semiring ....
A workaround is to use Encode/Decode to view the transducer as an acceptor, considering the pair (ilabel, olabel) as one symbol.
1. Encode: EncodeMapper encoder(kEncodeLabels, ENCODE); Encode(fst, &encoder);
2. Determinize.
3. Decode: Decode(fst, encoder);
*** End of Quotation ***
I need a bit of mind-tuning.
Question: Is the Encode/Determinize/Decode workaround recommended ONLY when determinizing a NON-functional Fst? If so, what's the best way to determine if an Fst candidate for determization is functional or not?
Background: In my application, Fsts are created from arbitrarily complex regular expressions, and the resulting Fsts may be functional or non-functional. Without the workaround, statements like
$v = a | a:b ;
currently cause a crash when the result Fst is determinized (without the Encode/Decode workaround), because the Fst maps "a" to "a" and to "b"--i.e. in this case the Fst is not functional.
Thanks,
Ken
After following the INSTALL instructions using the defaults settings and using the library in my own program as described in the README file. The compiler gives and error looking for config.h (included compat.h). Is config.h copied as part of the installation? Everything compiles fine if I add another include path to the location where the configure script was run from.
That means that if you then try to assign one MutableFst to another, nothing happens.
Recommended fix: add an operator=(const MutableFst &fst) method that simply calls the existing (abstract virtual) assignment operator, via static_cast or something. I can submit a diff if this isn't clear.
Indeed, we found out about this bug and fixed it a while back. The latest release of the library (version 1.0) included that fix.
Best, Cyril
$myfst = a* b+ ((c|d|e|f) - (m|n|d)){2} ((dog|cat|rat) & (elephant|dog|pig))? ;
After the regular expression is parsed, the interpreter walks the parse tree and makes multiple calls to OpenFst library functions--Closure, Compose, Concat, Difference, Intersect, etc--building many intermediate Fsts and combining them using the various operations until the final Fst is created and (in this case) bound to the variable $myfst. Obviously, all this processing gets done automatically, without human intervention. The user is interested only in the final result, $myfst, which then might get used in subsequent regular expressions.
Ideally, the final result would be automatically determinized and minimized.
I understand, with help from Cyril , that if an Fst is tested and found to be 1) acyclic, or 2) both (unweighted AND an Acceptor) then it can be safely Determinized and Minimized. (Corrections would be welcome.) So after each intermediate or final Fst is created by the Kleene interpreter, it might be sent to the following clean-up function
// first-draft pseudo-code if ( the fst is acyclic || (the fst is unweighted && the fst is an acceptor) ) { RmEpsilon() Determinize() Minimize() } else { // this is questionable, see below RmEpsilon() }
However, I'm aware that this could still be problematic. In a response on this forum, Cyril stated
"Indeed, in some situations you cannot remove epsilons because it will lead to a space blow up. In that case, you might still want to optimize the machine with epsilons [i.e. without first invoking RmEpsilon] using determinization and minimization."
In another response, Cyril stated
"Indeed, union introduces epsilon-transitions and epsilon-transitions will slow down composition. Non-determinism will also slow down composition. So, I would recommend you first epsilon-remove and then determinize your fst. However, there is always a risk with determinization (it might blow up or even not terminate). Hence, if what I suggest does not work better, you should investigate using determinize only, or rmepsilon only or determinize followed by rmepsilon."
This puts in question the 'else' clause in the pseudo-code above, where RmEpsilon() is automatically called on Fsts that don't satisfy the requirements for safe determinization and minimization.
Question: Assuming my Kleene scenario, where regular expressions are being interpreted and many Fsts (intermediate and final) are being created automatically, when can RmEpsilon(), Determinize() and Minimize() be safely invoked automatically? I want the interpreter to do all the optimizations that are mathematically safe, but to attempt nothing more.
We can assume that the interpreter can interrogate various features of the candidate Fst (cyclic vs. acyclic, weighted vs. unweighted, acceptor vs. transducer, etc.). The interpreter also knows exactly which operation has just been performed to produce the Fst in question (Union, Concat, Closure, etc.).
Thanks,
Ken
To clarify, when talking about determinization:
Assuming this definition of safe/unsafe, determinization is safe for acyclic FSTs and for unweighted acceptors and rmepsilon and minimization are always safe.
However, safe does not guarantee that the algorithms will terminate on your machine because you have a finite amount of memory and don't want to wait more than x minutes.
The complexity of determinization in the safe case is in the worst case exponential in the size of the input machine, and quadratic in the size of the result (see DeterminizeDoc).
The complexity of epsilon removal is in the worst case quadratic in the size of the input (see RmEpsilonDoc).
There is no way you can ensure these algorithms will never blowup of memory. However, you can ensure that for most "reasonable" input they will behave "reasonnably".
Due to potentially large cost of these algorithms, it is always recommended you try to reduce the size of the input Fst before hand. This is why I would recommend first optimizing the machine with epsilon, applying epsilon-removal and then optimizing the machine without epsilons, Hence, the "safest" optimization function would be:
// first-draft pseudo-code if ( the fst is acyclic || (the fst is unweighted && the fst is an acceptor) ) { Determinize() Minimize() RmEpsilon() Determinize() Minimize() } else { // this is questionable, see below RmEpsilon() }
#include <ctime>
I should also have Visual Studio build instructions ready soon.
Paul
I'm running into some trouble trying to build static recognition transducers using OpenFst. Specifically, for even a fairly trivial language model G:
# of states 42980 # of arcs 183468 # of final states 923 # of input/output epsilons 43867
and (determinized and minimized) dictionary L~:
# of states 1180 # of arcs 2424 # of final states 1 # of input/output epsilons 1
when I try to compose these with fstcompose, it rapidly uses all available memory. I haven't let it run long enough to see if it actually terminates.
I'm not sure this is specific to OpenFst, because I find that the FSM library has the same problem. It's possible that I'm building the language model incorrectly, though I'm not sure why this would cause composition to blow up. I am planning to do a sanity check using the GRM library but haven't figured this out yet.
Is it necessary to use on-demand composition when building a transducer of this sort?
You should try not to determinize/minimize L and make sure that the output labels (i.e words) are on the first non-input epsilon on a path. Also, make sure that L is output label-sorted. (I assumed you do fstcompose L G > LG)
Let me know if this works.
Cyril
I assume the problem with a minimized L is that its maximum out-degree is much higher than an un-minimized one?
I'm glad to read that this works quite well now. The issue is with determinization and not minimization. Determinization pushes back the output labels creating a lot of long epsilons paths. In composition, while trying to match a label x at a state in G, all this epsilon paths in L are going to be followed in order to find an x. A lot of this work is wasteful since there is only a few of these paths at the end of which an x can be found.
Cyril
when the output labels (i.e words) are on the first non-input epsilon on a path,I cannot determinize the Lexicon.Could you please tell me how to determinize it?
I'm trying to determinize an FST and the amount of memory it uses gradually increases and it reaches to the limit, results "memory allocation failed" error. However, after removing 2-3 certain arcs in the FST, the process ends pretty quickly. Those arcs don't have any particular characteristics, so I guess it's due to a structure of the FST which consists of more than 2-3 arcs.
Could you give me any solution or hint?
Roughly, two states are sibling states if they are reachable from the start state on different paths that have the same label sequence, and there are cycles at these two states that have similar labels. If these two cycles have the same weight then they are twins.
On this machine, for example, fstdeterminize doesn't terminate: 0 1 1 1 0 2 1 2 1 1 2 3 1 3 3 5 2 2 2 4 2 3 4 6 3
0 1 1 1 0 2 1 2 1 1 2 3 1 3 3 5 2 2 2 4 2 3 4 6 3
Thanks,
Ken
The definition of the header can be found in fst/lib/fst.h, and the symbol table format can be gleaned from fst/lib/symbol-table.cc. How the subsequent state and arc definitions are stored depends entirely on what particular Fst and Arc classes are being used.
In the case of VectorFst
int32 magic_number = 2125659606; string fst_type = "vector"; string arc_type = "standard"; int32 file_version = 2; /* file format version */ int32 flags; /* bitfield: HAS_ISYMBOLS = 1, HAS_OSYMBOLS = 2 */ uint64 properties; /* bitfield, defined in fst/lib/properties.h */ int64 start; /* start state ID */ int64 num_states; int64 num_arcs; /* input symbol table, if flags & HAVE_ISYMBOLS */ int32 magic_number = 2125658996; string name; int64 next_available_key; /* Next available key (number) */ int64 size; struct { string symbol; int64 key; } symbols[size]; /* output symbol table, if flags & HAVE_OSYMBOLS - see above... */ /* states and arcs */ struct { float final_weight; int64 num_arcs; struct { int ilabel; int olabel; float weight; int next_state; } arcs[num_arcs]; } states[num_states];
Access control: