OpenFst Forum 2009 Archive

OpenFst Forum 2009 Archive

Finding n-shortest paths

NewUser - 29 Dec 2009 - 16:44

I was wondering what is the best way to go about finding n-shortest paths in an fst when the fst is considerably large (the fst in question is 1.3 GB in size; has around 10 million states and three times more number of arcs).

The shortest path algorithm expands the whole transducer and memory usage goes up to 2 GB! (I did see a previous post about setting the option first_path to true, but i need the first 100 or 1000 best paths to be outputted).

It would be of great help if someone had ideas on how to proceed.

Thanks a lot in advance.

Why is 64bit determinization slower 100 times than 32bit determinization?

KyuwoongHwang - 13 Dec 2009 - 21:29

To overcome memory limitation, I compiled the openfst for x64 in visual studio. It seems working well. but the speed is very slow. I suspected garbage collection and set the limit to 10GB to practically disable it. It didn't matter. The in.fst file size is 84MB. Did anyone have same problem? Thanks in advance for any help.

x32/fstdeterminize.exe --fst_default_cache_gc_limit=10000000000 in.fst out.fst ==> 292.40 secs.

x64/fstdeterminize.exe --fst_default_cache_gc_limit=10000000000 in.fst out.fst ==> 39034.45 secs.

NewestUser - 22 Dec 2009 - 04:32

is either one using swap? seems unlikely given the fairly small input size and x64, but i noticed that if the operation starts thrashing the swap it generally takes orders of magnitude longer to complete. also, what is your pointer size?

CyrilAllauzen - 11 Jan 2010 - 12:21

We commonly use 64 bit binaries on Linux and didn't notice any inefficiencies. Swapping could be the issue as pointed out by NewestUser since going 64 bits would increase the memory usage. You should also double-check that you are using the right compiler optimization options.

Best, Cyril

Adding other properties to arcs

AlexeiIvanov - 22 Nov 2009 - 17:00

I wonder what is the best way to augment arcs with properties other then input/output symbols and weights? E.g. in ASR lattice re-scoring sometimes it is beneficial to keep the time stamps of elementary acoustic events and propagate them to the final FST.

Thank you.

AlexeiIvanov - 24 Nov 2009 - 07:01

OK, I have found a way to do it. It works perfectly. It was achieved without any code modification. All necessary information is inserted into arc label.

If anyone needs details please, write me to alexei_v_ivanov@ieee.org

k-closed semirings

RolandSchwarz - 20 Nov 2009 - 08:28

I probably did not totally understand the Mohri papers, but what I got was that the single source shortest distance algorithm only works for locally closed semirings, where in the general case the Floyd-Warshall or Gauss-Jordan all-pairs shortest distance has to be applied. Now the code documentation of the openFST library tells me that the implementation there is the single source one according to Mohri 2002 and that it only works for k-closed semirings. Now here's what I think that means: (i) the code documentation is not totally precise, in that the algorithm also works for locally closed, not just k-closed semirings (ii) it therefore works on the tropical (which is 1-closed) and the log semiring with positive weights (which corresponds to the real semiring with 0 <= weights <=1) up to a certain accuracy (iii) if that is true, it does not work on the log semiring with positive and negative weights. It does not give an error though, and it seems to compute sensible distnances. What happens in that case? Is there an internal check for closedness? Does it fall back to an all-pairs implementation of the Floyd-Warshall-whatever variant?

Or alternatively, am I totally mistaken? Can someone shed some light on my confusion?

Cheers,

Roland

CyrilAllauzen - 11 Jan 2010 - 12:17

As long as there are no cycles with negative weights (in the log semiring), the single source shortest-distance will converge. This is what M. Mohri tried to convey by the notion of being k-closed for a given automaton.

We haven't implemented Floyd-Warshall. So there is no backup in case of a bad cycle, the algorithm would simply not converge.

Best, Cyril

ReWeight: Working Example

NewestUser - 08 Nov 2009 - 04:11

Would it be possible to get a practical working example of the ReWeight operation, along with an example of the 'potentials.txt' input? At present it is not exactly clear how this operation might be used, or what it might be useful for.

CyrilAllauzen - 09 Nov 2009 - 11:45

An example of potentials.txt input is the ouptut of fstshortestdistance.

fstshortestdistance a.fst > potentials.txt
fstreweight --to_initial a.fst potentials.txt > b.fst

This is equivalent to:

fstpush --push_weights --to_initial a.fst > b.fst

This is actually the way pushing is implemented in the library, shortest-distance followed by reweighting.

read function causes problems

MichaelBerger - 30 Oct 2009 - 16:38

Hi,

I am trying to read a fst into a simple C++ application, however, for some reason I am being told that certain symbols are undefined:

Ld build/Debug/fstprintallpaths normal x86_64
cd /.../fstprintallpaths
setenv MACOSX_DEPLOYMENT_TARGET 10.6
/Developer/usr/bin/g++-4.2 -arch x86_64 -isysroot /Developer/SDKs/MacOSX10.6.sdk -L/Users/michael/code/fstprintallpaths/build/Debug -L/usr/local/lib -L/usr/local/lib/Cabal-1.6.0.2 -L/usr/local/lib/ImageMagick-6.4.1 -L/usr/local/lib/pkgconfig -L/usr/local/lib/Cabal-1.6.0.2/ghc-6.10.4 -L/usr/local/lib/ImageMagick-6.4.1/config -L/usr/local/lib/ImageMagick-6.4.1/modules-Q16 -L/usr/local/lib/Cabal-1.6.0.2/ghc-6.10.4/Distribution -L/usr/local/lib/Cabal-1.6.0.2/ghc-6.10.4/Language -L/usr/local/lib/ImageMagick-6.4.1/modules-Q16/coders -L/usr/local/lib/ImageMagick-6.4.1/modules-Q16/filters -L/usr/local/lib/Cabal-1.6.0.2/ghc-6.10.4/Distribution/Compat -L/usr/local/lib/Cabal-1.6.0.2/ghc-6.10.4/Distribution/PackageDescription -L/usr/local/lib/Cabal-1.6.0.2/ghc-6.10.4/Distribution/Simple -L/usr/local/lib/Cabal-1.6.0.2/ghc-6.10.4/Language/Haskell -L/usr/local/lib/Cabal-1.6.0.2/ghc-6.10.4/Distribution/Simple/Build -L/usr/local/lib/Cabal-1.6.0.2/ghc-6.10.4/Distribution/Simple/GHC -L. -F/Users/michael/research/projects/lattice_parsing/code/fstprintallpaths/build/Debug -filelist /Users/michael/research/projects/lattice_parsing/code/fstprintallpaths/build/fstprintallpaths.build/Debug/fstprintallpaths.build/Objects-normal/x86_64/fstprintallpaths.LinkFileList -mmacosx-version-min=10.6 -o /Users/michael/research/projects/lattice_parsing/code/fstprintallpaths/build/Debug/fstprintallpaths

Undefined symbols:
  "fst::Int64ToStr(long long, std::basic_string<char, std::char_traits<char>, std::allocator<char> >*)", referenced from:
      fst::FloatWeightTpl<float>::GetPrecisionString()  in fstprintallpaths.o
  "fst::FstHeader::Read(std::basic_istream<char, std::char_traits<char> >&, std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool)", referenced from:
      fst::Fst<fst::ArcTpl<fst::TropicalWeightTpl<float> > >::Read(std::basic_istream<char, std::char_traits<char> >&, fst::FstReadOptions const&)in fstprintallpaths.o
ld: symbol(s) not found
collect2: ld returned 1 exit status



-----------------------------------------------------------------
fstprintallpaths.cpp:

#include "fstprintallpaths.h"

#include <iostream>
#include <vector>
#include <fst/fstlib.h>
#include <fst/fst-decl.h>
#include <fst/vector-fst.h>
#include <fst/weight.h>

using namespace fst;

int main (int argc, char * const argv[]) {
    // insert code here...
   std::cout << "Hello, World!\n";
   StdFst *input = StdFst::Read("blah.fst");
   //StdFst *model = StdFst::Read("blah.fst");
   //printAllStrings(&fst);
    return 0;
}

string itos(int i) { // convert int to string; from Bjarne Stroustrup's FAQ
   stringstream s;
   s << i;
   return s.str();
}

string vectorToString(vector<int> v) {
   if(v.size() == 0)
      return "<>";
   string result = "<" + itos(v[0]);
   for(int i = 1; i < v.size(); i++) {
      result + "," + itos(v[i]);
   }
   return result + ">";
}

void printAllStringsHelper(VectorFst<StdArc> &fst, int state, vector<int> str, TropicalWeight cost) {
   if(fst.Final(state) == TropicalWeight::Zero()) 
      cout << vectorToString(str) << " with cost " << (Times(cost,fst.Final(state))) << endl;
   for(ArcIterator <VectorFst <StdArc> > aiter(fst,state); ! aiter.Done(); aiter.Next()) {
      StdArc arc = aiter.Value();
      str.push_back(arc.ilabel);
      printAllStringsHelper(fst, arc.nextstate, str, Times(cost, arc.weight.Value()));
      str.pop_back();
   }
}

// a bad idea if there are loops :)
void printAllStrings(VectorFst<StdArc> &fst) {
   vector<int> str(0);
   TropicalWeight tw(TropicalWeight::One());
   printAllStringsHelper(fst, 0, str, tw);
}

-----------------------------------------------------------------

fstprintallpaths.h

/*
 *  fstprintallpaths.h
 *  fstprintallpaths
 *
 */

#include <iostream>
#include <vector>
#include <fst/fstlib.h>
#include <fst/fst-decl.h>
#include <fst/vector-fst.h>
#include <fst/weight.h>

using namespace fst;

int main(int argc, char * const argv[]);

string itos(int i);

string vectorToString(vector<int> v);

void printAllStringsHelper(VectorFst<StdArc> &fst, int state, vector<int> str, TropicalWeight cost);

void printAllStrings(VectorFst<StdArc> &fst);

MichaelBerger - 30 Oct 2009 - 16:57

Trying it on the commandline gives the same problems:

$ g++ fstprintallpaths.cpp fstprintallpaths.h -I  /usr/local/include 
Undefined symbols:
  "fst::FstHeader::Read(std::basic_istream<char, std::char_traits<char> >&, std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool)", referenced from:
      fst::Fst<fst::ArcTpl<fst::TropicalWeightTpl<float> > >::Read(std::basic_istream<char, std::char_traits<char> >&, fst::FstReadOptions const&)in ccspqZtg.o
  "fst::Int64ToStr(long long, std::basic_string<char, std::char_traits<char>, std::allocator<char> >*)", referenced from:
      fst::FloatWeightTpl<float>::GetPrecisionString()  in ccspqZtg.o
ld: symbol(s) not found
collect2: ld returned 1 exit status

PaulDixon - 31 Oct 2009 - 08:19

Add a link switch -lfst for openfst to the g++ command

Determinization increases size of FST

MichaelBerger - 25 Oct 2009 - 14:36

Hi, I am trying to determinize a FST using fstdeterminize but for some reason the determinized fst is much larger than the original one:

$ fstinfo a_binary.fst
fst type                                          vector
arc type                                          standard
input symbol table                                ia.txt
output symbol table                               oa.txt
# of states                                       24
# of arcs                                         9095
initial state                                     0
# of final states                                 4
# of input/output epsilons                        0
# of input epsilons                               0
# of output epsilons                              9095
# of accessible states                            24
# of coaccessible states                          24
# of connected states                             24
# of strongly conn components                     24
expanded                                          y
mutable                                           y
acceptor                                          n
input deterministic                               n
output deterministic                              n
input/output epsilons                             n
input epsilons                                    n
output epsilons                                   y
input label sorted                                n
output label sorted                               y
weighted                                          y
cyclic                                            n
cyclic at initial state                           n
top sorted                                        y
accessible                                        y
coaccessible                                      y
string                                            n


$ fstinfo a_rmeps.fst
fst type                                          vector
arc type                                          standard
input symbol table                                ia.txt
output symbol table                               oa.txt
# of states                                       24
# of arcs                                         645
initial state                                     0
# of final states                                 4
# of input/output epsilons                        0
# of input epsilons                               0
# of output epsilons                              645
# of accessible states                            24
# of coaccessible states                          24
# of connected states                             24
# of strongly conn components                     24
expanded                                          y
mutable                                           y
acceptor                                          n
input deterministic                               n
output deterministic                              n
input/output epsilons                             n
input epsilons                                    n
output epsilons                                   y
input label sorted                                n
output label sorted                               y
weighted                                          y
cyclic                                            n
cyclic at initial state                           n
top sorted                                        y
accessible                                        y
coaccessible                                      y
string                                            n


$ fstinfo a_determinized.fst 
fst type                                          vector
arc type                                          standard
input symbol table                                ia.txt
output symbol table                               oa.txt
# of states                                       1026
# of arcs                                         22054
initial state                                     0
# of final states                                 906
# of input/output epsilons                        0
# of input epsilons                               0
# of output epsilons                              22054
# of accessible states                            1026
# of coaccessible states                          1026
# of connected states                             1026
# of strongly conn components                     1026
expanded                                          y
mutable                                           y
acceptor                                          n
input deterministic                               y
output deterministic                              n
input/output epsilons                             n
input epsilons                                    n
output epsilons                                   y
input label sorted                                y
output label sorted                               y
weighted                                          y
cyclic                                            n
cyclic at initial state                           n
top sorted                                        y
accessible                                        y
coaccessible                                      y
string                                            n

MichaelBerger - 25 Oct 2009 - 14:38

sorry for the cluttered command line output. The point I would like to get across is that the size of the original FST increases by 1000 states and the number of arcs also more than doubles when determinizing it.

Is this normal?

CyrilAllauzen - 02 Nov 2009 - 14:31

I fixed the formating of your original post. I don't see anything very unusual here. It is possible for determinization to significantly increase the size of the Fst.

ArcSort error with Mutable FSTs

JonathanB - 16 Oct 2009 - 16:45

I am receiving the compile error "no matching function for call to 'ArcSort(fst::StdVectorFst&, fst::StdOLabelCompare)' " when I attempt to compose two mutable FSTs. Although an answer to an older post with the same error stated that making FSTs mutable would resolve this error, I receive the same error with mutable FSTs. I would really appreciate some help!

My code is of the form:

#include "fst/fstlib.h"

using namespace std; using namespace fst;

int main() { StdVectorFst? fst1; StdVectorFst? fst2; StdVectorFst? fst3;

ArcSort? (fst1, StdOLabelCompare? ()); fst3 = ComposeFst? (fst1, fst2);

return 0; }

I posted this question as part of an old topic, but I am reposting it as a new topic because I am not sure whether old topics are still reviewed.

Thank you very much, Jonathan

JonathanB - 16 Oct 2009 - 16:48

I apologize for the poor formatting. The code should read:

#include "fst/fstlib.h"

using namespace std;
using namespace fst;

int main()
{
StdVectorFst fst1;
StdVectorFst fst2;
StdVectorFst fst3;

ArcSort(fst1, StdOLabelCompare());
fst3 = ComposeFst(fst1, fst2);
return 0;
}

PaulDixon - 16 Oct 2009 - 20:27

Changing to the following compiles for me

 
ArcSort(&fst1, StdOLabelCompare());
fst3 = StdComposeFst(fst1, fst2);

do we have a transition that encodes a fail transition?

CamilleChen - 15 Oct 2009 - 15:37

Hi,

I know that an epsilon transition encodes an empty string. However, I'd like to compose two FSA. In case a label exists in one FSA, but not in another, do we have a fail transition, instead of the epsilon transition, can encode this?

Thanks very much!!

CyrilAllauzen - 15 Oct 2009 - 16:00

Hi Camille,

You want to use PhiMatcher (or maybe RhoMatcher, I'm not completely sure from your message). See the matcher documentation here.

Best, Cyril

CamilleChen - 15 Oct 2009 - 17:35

Hi Cyril, Yeah, these matchers are quite useful! Thanks!

Best, Camille

regular expression compiler

FrancoisBarthelemy - 15 Oct 2009 - 05:10

Is there an available regular expression compiler for OpenFst, namely a compiler which compiles a regular expression into an fst?

Thanks

Traverse the FST to check if it accepts a string or not and give appropriate output

VipulMittal - 15 Oct 2009 - 02:09

I have built an FST for a dictionary and I want to traverse it to check if the given word is a valid dictionary entry. If its a valid word, I'll output its feature values. Can someone please help me with how can I traverse a FST ? My Fst size is around 100MB.

Example FST for word : cat

0 1 c c 1 2 a a 2 3 t n,sg 3

VipulMittal - 15 Oct 2009 - 02:15

dont knw why the FST appeared like this. It is: 0 1 c c 1 2 a a 2 3 t n,sg 3

CyrilAllauzen - 15 Oct 2009 - 14:41

What you type is interpreted as twiki markup language. As mentioned above, you would have need to use the verbatim tags.

As for your original question, you want to use Composition.

compute weight automatically

CamilleChen - 14 Oct 2009 - 23:34

Hi,

I have another question:

if we know the total cost of path1 and we know the total cost of path 2 and path 1 is part of path 2 (say path 1 has 1 arc, and path 2 has 2 arcs, and path2's 1st arc label is the same as the path1's arc's label.)

then, is there a way to automatically combine these two paths in a single fst, by keeping path1's cost as the 1st arc's cost, and subtracting the path1's total cost from path2's total cost, and assign it as the cost of the second arc?

thanks!

CamilleChen - 14 Oct 2009 - 23:36

sorry: let me reorganize the condition a little bit:

if we know the total cost of path1

and we know the total cost of path 2

and path 1 is part of path 2 (say path 1 has 1 arc, and path 2 has 2 arcs, and path2's 1st arc label is the same as the path1's arc's label.)

CyrilAllauzen - 15 Oct 2009 - 16:02

Hi Camille,

This is still not very clear to me. Could you clarify further?

Best, Cyril

CamilleChen - 16 Oct 2009 - 19:18

Hi Cyril,

Now I see how to solve my question. Actually I did not know hot to indicate a path's weight when writing this path into a fst format. After I can do that, I can use the commands to do the calculation, and now, I know how to indicate a path's weight, so, the question is solved! smile

Thanks!

Camille

how to remove a weighted FSA's weights?

CamilleChen - 13 Oct 2009 - 21:50

Hi all, Suppose now I have a weighted FSA ready, and I'd like to rewrite it into another FSA by removing all the weights. Can anybody give me some suggestions? Thanks a lot!

RolandSchwarz - 14 Oct 2009 - 03:49

I don't know if there is a built-in function, but you could always try it with fstprint | cut -f 1-4 | fstcompile ... i.e. you print it, extract the fields which you want to keep and then recompile it or redirect it to a new text file.

RolandSchwarz - 14 Oct 2009 - 03:50

btw, I'm still looking for a solution to my problem below...

PaulDixon - 14 Oct 2009 - 19:00

fstmap with map_type set to rmweight can remove the arc and final weights

fstmap --map_type=rmweight

CamilleChen - 14 Oct 2009 - 23:27

both way works!! Thanks so much!!!

Performance issues?

RolandSchwarz - 09 Oct 2009 - 10:07

Hiya,

I'v been experimenting around with the openfst library lately and have the following problem: I'm trying to align sequences using a transducer (i.e. find the edit distance with specifics weights to indels and mismatches). The transducer has 4 states and some epsilon transitions to model indels (no double epsilon transitions tho). The sequences to be aligned are nucleotide sequences (alphabet of size 4) and about 1000 nucleotides long. Composition of two of these sequences with the alignment transducer (as in I o T o O, where I is the input and O the output sequence/acceptor) results in a 400mb fst file and takes approximately 3-4 minutes. Computation of the shortest path again another 2 minutes or so.

Now here's my question. It is somehow obvious, that the number of states explodes in composition (1000 * 4 * 1000 are at least 4 mill states), so maybe direct composition is not the way to go. But what can I honestly do to improve performance? I switched from the shell versions to the C library and tried lazy composition. Now composition is done in an instant, but the shortest path algorithm still seems to expand the whole transducer, as memory usage goes up to 500mb and it still takes about 4 minutes.

Is it simply such that the transducer approach to alignment is slow? Computing 200 of these pairwise alignments using traditional dynamic programming takes about 20 seconds. Or did I make a mistake somewhere? Does the shortest path algorithm neccessarily have to expand the full transducer, i.e. does lazy composition help at all in this case? I didn't even expect it to be competitive in terms of speed to dedicated tools, but I hoped it to be at least computationally feasible. I would really appreciate any kind of help.

Cheers,

Roland

CyrilAllauzen - 15 Oct 2009 - 15:51

Hi Roland,

The issue is indeed that computing the direct composition is not the way to go since it is rather expensive. Your idea of using lazy composition was good but as you noticed, ShortestPath expands the full machine by default. However, there is a way to avoid this behaviour: by using the shortest-first queue discipline and setting the option first_path to true.

ComposeFstOptions opts();
opts.gc_limit=0;  
// This disable the caching of the result of composition
// and should lower the memory usage at no computational cost since
// shortest-path should visit each state in the composed machine at most once.
ComposeFst<Arc> cfst(fst1, fst2, opt);

NaturalShortestFirstQueue<StateId, Weight> state_queue(distance);
ShortestPathOptions<Arc, NaturalShortestFirstQueue<StateId, Weight>,
AnyArcFilter<Arc> >
opts(&state_queue, AnyArcFilter<Arc>());
opts.first_path = true;
ShortestPath(cfst, ofst, &distance, opts);

This example assume composing two machines. In you case fst1 could be I o T (itself delayed or not, since it is a smaller machine both might be worth tryingboth) and fst2 could be O.

This should improve things a lot although it should still be slower than dedicated tools. I am very interested in this, so let me know how things turn out.

Best, Cyril

RolandSchwarz - 16 Oct 2009 - 05:56

Thanks alot Cyril, this is really working now. For some sequences it still takes 10 seconds or so, but memory consumption is low and most sequences are aligned within 1-2 seconds. Very nice. Does the NaturalShortestFirstQueue impose any restrictions on the type of semiring I chose? Or could I use the same strategy if I'm working on the log semiring and would be interested in the shortest distance?

All the best,

Roland

Label pushing - unexpected behavior?

JohnS - 02 Oct 2009 - 08:52

Hello, I have a couple of questions regarding label pushing in OpenFst.
It seems like a relatively common behavior of the operation Push(), in label pushing, is to generate resulting FSTs that are a lot larger than the original FST.
For fairly large FSTs, this (for me unexpected) growth in number of states and arcs, can be so significant that it makes any further processing impossible due to practical memory limitations.

I have attached the following example that produces the same behavior, using a very small FST.

Assume we have an FST A as depicted below:

By label pushing FST A (to final state), e.g. by the command
fstpush --push_labels --to_final A.fst PA.fst,
one might expect the following FST:

But instead the result FST PA.fst is generated, depicted below:

Question 1. In many cases, as is the example, there is an obvious way of performing pushing of the output labels to a final state, without adding any new states to the FST. Can you explain this growth behavior – under what circumstances it is triggered and, possibly, how it can be avoided?

When an FST that has been label pushed, is label pushed again, one might expect it to stabilize. However, Applying Push() to the already Pushed FST A generates the following output.

Question 2. Why is not the OpenFst implementation of label pushing idempotent, (e.g. like weight pushing is) so that the result of Push(A) is identical to result of Push(Push(A)) ?
Thanks!
-John

CyrilAllauzen - 15 Oct 2009 - 16:29

Hi John,

The reason for this is that label-pushing is implemented by considering the input transducer as a weighted automaton over the string semiring and applying weight-pushing (over the string semiring). The result is then the following weighted automaton over the string semiring:

The issue is how to convert that weighted automaton over the string into a transducer. Currently, we use the second step of the epsilon-normalization algorithm to do this. This is also what the cause the operation to be non idempotent.

One way to partially avoid the problem, would be to first reverse the transducer, apply label-pushing towards the initial state and reverse again:

fstreverse A.fst | fstpush --push_labels | fstreverse > PA.fst

This would result in:

The extra epsilon transitions are due to the use of reverse.

Best, Cyril

JohnS - 23 Oct 2009 - 05:48

Hi Cyril,

Thank you for taking your time to answer this question so thoroughly. I assume there exists, at least theoretically, a generic pushing algorithm to accomplish idempotent and “non-growing” label pushing. Do you agree on this? And if so, would you endorse such an algorithm to be included in any future version of OpenFST?

Best, John

Problem with Determinize and Minimize

VipulMittal - 01 Oct 2009 - 02:17

Hi, I want to build an FST for the words of a language where if the word is accepted, the output will be the features of the word. For this I'm using two FSTs - one for the prefix (root) part and other for the suffix - and I concatenate them to get the fina FST. Both the input files are in AT&T format.

But the problem is when I determinize it, it gives an error that the FST is non-functional i.e. for a particulat input there are more than one possible outputs, which will be there in this case.

So can someone please help me with this ??

Thank you, Vipul

DoganCan - 01 Oct 2009 - 19:55

Current version of OpenFst library does not support determinization of non-functional transducers. You can take the "encode -> determinize -> decode" path to determinize your final fst over (input label,output label) pairs. The result will be non-deterministic over input labels but may satisfy your current needs. Check the answer to the "det(L o G) becomes more complex than L o G" question below for a pointer to the encode/decode mechanism.

Best

Dogan Can

Constrains for the maximum size of the FSTs?

OanaNicolov - 30 Sep 2009 - 19:53

Hi,

Could you please tell me an estimate of the maximum size (in terms of #states and/or #transitions) of the FSTs, the openfst toolkit can work with (let's say on a machine with 8G of memory)? Alternatively, what was the size of your largest FST you worked with?

Thank you. Oana

DoganCan - 01 Oct 2009 - 20:14

Hi Oana,

OpenFst is a very efficient library which scales well to large problems. The answer to your question depends on the properties of the fsts and what you would like to do with them. For instance, you may run the shortest path and pruning algorithms over very large fsts (several millions of states and transitions) while you may not be able to determinize them or compose them with other large machines. What you can do also depends on whether your input fsts are acyclic, epsilon free etc. Using lazy implementations may help if your particular task does nor require complete fsts to be loaded onto the memory upfront. In short, (in my opinion) your best bet is to try and see.

Best,

Dogan

CyrilAllauzen - 15 Oct 2009 - 14:58

Hi Oana,

There are two main concrete class in the library ConstFst and VectorFst.

For ConstFst, the memory requirement is 20 bytes per state. and 16 bytes per arc. For VectorFst, the per-state memory requirement is about twice as much due to the overhead introduced by STL data structures.

Doing the math, you will see that you can easily represent machines with hundred of millions of states and transitions in a few GB of RAM.

If memory usage is critical, the CompactFst class can also be used to provide more efficient memory representations.

This addresses the memory requirement for representing an FST in memory. But, as Dogan pointed out, your memory usage will also depends on which algorithms you want to use.

Best, Cyril

RmEpsilon, AutoQueue and Queue Disciplines

NewestUser - 11 Sep 2009 - 11:51

Hi, I would like to know how 'reliable' the AutoQueue associated with RmEpsilon is; that is, how effective is this at selecting the optimal queue discipline given a particular input fst.

Also, are there any examples of employing specific queue disciplines with RmEpsilon? I've not been able to get this to work in practice.

NewestUser - 11 Sep 2009 - 23:59

Nevermind the second part of the above question, I was looking in the wrong place.

a stand in symbol for any symbol; printing out the strings for FSA

AnasElghafari - 28 Aug 2009 - 12:32

Hi there,

Thank you for this great library; I am using it for my B.A. thesis project in C.L. I am using it through the shell from within my Java code.

I have a bit of difficulty with a few things. This is one of them:

- Is there a compact way to specify "any other symbol". As in: in state 0 transduce "a" to "b" and transduce any other symbol to "c"?

The problem specifically is that I am creating acceptors where the symbols are words rather than letters, in some cases I might have 1000 words in my isymbols list. I want to build a rational kernel (transducers to count sequences) to work with these acceptors. At some states I need to be able to map every non-explicitly-mentioned input symbol to epsilon with weight 1. How can I specify that? Do I really need to add a thousand arcs for each state?

AnasElghafari - 28 Aug 2009 - 12:34

The second question (I included it in the title but forgot to include it in the body): is there a way to obtain/print out all the string accepted by some FSA ? (in the case where FSA accepts a finite set, of course)

CyrilAllauzen - 01 Sep 2009 - 11:46

For your first question, you can use RhoMatcher. See the advanced usage section.

For your second question, you will need to write your own function for this but it should be rather easy. Remember that the number of strings accepted by an acyclic automaton is in the worst case exponential in the size of the automaton.

Best, Cyril

det(L o G) becomes more complex than L o G

ZhijianOu - 25 Aug 2009 - 03:45

The experiment setup is as follows.

A Chinese lexicon L:
# of states                                       215371
# of arcs                                         303286

A Chinese bigram G:
# of states                                        87918
# of arcs                                       16822088

Run fstcompose to get
L o G 
# of states                                       391202
# of arcs                                       17125372

After determinization by running fstdeterminize,
det(L o G)
# of states                                      3479987
# of arcs                                       20071667
more complex !

I don't know why. I follows exactly the paper "Mehryar Mohri, Fernando C. N. Pereira, and Michael Riley. Speech Recognition with Weighted Finite-State Transducers. In Handbook on Speech Processing and Speech Communication, 2008". The output word label in L is already placed on the initial transitions of the word.

The code is as follows:

fstarcsort.exe --sort_type=olabel  L.fst L.fst.sort
fstarcsort.exe --sort_type=ilabel  G.fst G.fst.sort
fstcompose.exe L.fst.sort G.fst.sort LG.fst
fstdeterminize.exe LG.fstb LG.fst.det
fstminimize.exe LG.fst.det LG.fst.min
Memory allocation failed

I run 64-bit openFST on windows 2003 server 64bit edition. The fstminimize.exe returns "Memory allocation failed" is a surprise. 64bit memory space is not sufficient?

Any suggestion that you could offer would be much appreciated.

Best, Zhijian

CyrilAllauzen - 30 Sep 2009 - 12:43

I'm not sure what the issue is. You might want to try minimizing the transducer as an acceptor: use fstencode to encode the labels, apply minimization and use fstencode to decode (see FST.EncodeDecodeDoc for more details).

Best, Cyril

How to deal with < / s > label in constructing ngram model

ZhijianOu - 23 Aug 2009 - 23:08

Dear all,

As mentioned in the paper "Allauzen, Mohri and Roark, Generalized algorithms for constructing statistical language models, ACL 2003." , in constructing a ngram model, there are transitions labeled with the end symbol < / s >, that lead to the final state.

To compose L with G, I need to add a psuedo word as follows to the L.

r eh d #0 read
r eh d #1 read
...
#0 </s>

Am I correct?

(btw: if I replace the < / s > in G with < eps >, the G is not deterministic. This is not a good way. So I propose the above way.)

Thanks for your patience. Your help is greately appreciated.

Best, Zhijian

CyrilAllauzen - 01 Sep 2009 - 12:09

As far as I remember, we were using </s> in G and adding </s> as a word with empty pronunciation (or silence) in L. The GRM library also allowed replace </s> in G by a final weight.

Best, Cyril

Switching weight pushing in the log semiring and in the tropical semiring

ZhijianOu - 23 Aug 2009 - 22:51

dear all,

I built a transducer of a trigram, following the paper "Mehryar Mohri, Fernando C. N. Pereira, and Michael Riley. Speech Recognition with Weighted Finite-State Transducers. In Handbook on Speech Processing and Speech Communication, 2008"

As we note, after we introduce the backoff state, the G is only an approximate representation, if there are such seen trigram that have lower probability than its backed-off bigram, i.e. p(z|x,y) < alph(x,y)*p(z|y). To my viewpoints, there are two solutions.

1) One is to splitting states before replacing failure-transition by eps-transition, as proposed in "Allauzen, Mohri and Roark, Generalized algorithms for constructing statistical language models, ACL 2003."

2) The other, as I think, is to create a backoff model from an interpolated algorithm (e.g. modified kn smoothing). Then we have p(z|x,y) > alph(x,y)*p(z|y) for all trigrams. And if we further work in the tropical semiring, the representation will be exact.

So I chose to construct G in the tropical semiring. Now the question comes. As we know, weight pushing in the tropical semiring is inferior to pusing in the log semiring. I'm not sure whether the following procedure is correct, and if it's correct, how to do it? The semiring info is embeded in the arc type. If I use shell command, how can I change the arc type?

1) Switch from representing G (or L o G) in the tropical semiring to representing in the log semiring

2) run fstminimize to get min(L o G)

3) Swtich back to representing min(L o G) in the tropical semiring.

Thanks for your patience. Your help is greately appreciated.

Best, Zhijian

DoganCan - 24 Aug 2009 - 21:08

Current shell-level map command (fstmap) does not support mapping between std and log arcs. A simple workaround is to print the fst and compile the text representation with the appropriate arc type. On the other hand, C++ library includes mappers for conversion between std and log arcs. Following code might help:

VectorFst<StdArc> ifst;
VectorFst<StdArc> ofst;
VectorFst<LogArc> lfst;

Map(ifst,&lfst,StdToLogMapper());
Minimize(&lfst);
Map(lfst,&ofst,LogToStdMapper());

Best, Dogan

ZhijianOu - 25 Aug 2009 - 03:35

Hi, thanks to Dogan.

Sorry for the above lengthy question. To summarize, the main question is: when both lexicon L and grammar G are constructed in tropical semiring. Compared with directly run fstminimize.exe (i.e. in tropical semiring), is it better to switch to log semiring, run fstminimize.exe, and then switch back ?

Best, Zhijian

Compiling openfst-1.1 on cygwin

HenrikKinnemann - 11 Aug 2009 - 03:28

Hi,

is there anybody who has compiled openfst-1.1 (latest version) on Cygwin successfully? Which "3rd-party" libs need to be available on Cygwin in advance?

Best regards, Henrik

PaulDixon - 11 Aug 2009 - 04:08

I have compiled it in cygwin with the gcc-4 package and without any 3rd party libraries ./configure CXX=g++-4.exe CC=gcc-4.exe make install make

build 64bit openFST

ZhijianOu - 06 Aug 2009 - 10:19

hi,

I tried to run fstcompile.exe on a transducer with 23M states and 41M arcs, from AT&T FSM format text file. It terminates, saying that "Memory allocation failed". The 2G address space is not sufficent for compiling such a large transducer. Does the current openFST code support 64bit compile? Could you give me some instructions to build a 64bit openFST?

It is very appreciated that you could favor me with a reply. Thank you.

Zhijian

PaulDixon - 06 Aug 2009 - 23:10

Are you running the Windows version? If so, I can try add a 64bit profile to the Visual Studio solution

ZhijianOu - 14 Aug 2009 - 00:03

Hi, PaulDixon, I have installed Windows server 2003 Enterprise x64 edition.

It's very kind of you if you can add a 64bit profile to the Visual Studio solution.

Best regards, Zhijian

PaulDixon - 17 Aug 2009 - 08:47

Hi I added an updated solution to the contrib section. To build 64bit select the Release64 x64 in your build toolbar. If you are running VS express edition you might also need the platform SDK to get the 64 bit compiler.

ZhijianOu - 23 Aug 2009 - 22:53

PaulDixon, It works! Thank you very much. Best, Zhijian

valgrind complaints with Replace on openfst-1.1

GonzaloIglesias - 02 Aug 2009 - 13:36

Hi again, I have discovered that valgrind starts complaining with the openfst library with many operations, but only after a replace operation is performed. This didn't happen if compiled using openfst beta. I'm using valgrind 3.2.3. Here goes a small example:



    //Building 3 trivial fsts for fst replacement                                                                                                            
    VectorFst<StdArc> fst1,fst2,fst3;
    fst1.AddState();
    fst1.AddState();
    fst1.AddArc(0, StdArc(0, 123456789, StdArc::Weight::One(), 1));
    fst1.SetFinal(1,StdArc::Weight::One());
    fst1.SetStart(0);
    fst2.AddState();
    fst2.AddState();
    fst2.AddArc(0, StdArc(0, 1234567890, StdArc::Weight::One(), 1));
    fst2.SetFinal(1,StdArc::Weight::One());
    fst2.SetStart(0);
    fst3.AddState();
    fst3.AddState();
    fst3.AddState();
    fst3.SetFinal(2,StdArc::Weight::One());
    fst3.SetStart(0);
    fst3.AddArc(0,StdArc(0, 1000, StdArc::Weight::One(), 1));
    fst3.AddArc(1,StdArc(0, 2000, StdArc::Weight::One(), 2));
    vector< pair< StdArc::Label, const Fst<StdArc> * > > pairlabelfsts;
    pairlabelfsts.push_back(pair< StdArc::Label, const Fst<StdArc> * >(1000,&fst1) );
    pairlabelfsts.push_back(pair< StdArc::Label, const Fst<StdArc> * >(2000,&fst2) );
    pairlabelfsts.push_back(pair< StdArc::Label, const Fst<StdArc> * >(3000,&fst3) );
    VectorFst<StdArc> rulefst;
    StdArc::Label slabel=3000;

#ifndef USEOPENFSTBETA
    rulefst=ReplaceFst<StdArc>(pairlabelfsts,ReplaceFstOptions<StdArc>(slabel,false));
#else
    rulefst=ReplaceFst<StdArc>(pairlabelfsts,ReplaceFstOptions(slabel,false));
#endif
    RmEpsilon(&rulefst); //valgrind complains here
    rulefst.Write(some.fst); // and valgrind complains here too
    return 0;
}

(I may submit a link to a full example code)

Cheers!

CyrilAllauzen - 11 Aug 2009 - 10:09

Hi Gonzalo,

Thanks for pointing this out! I'll look more deeply into this.

Best, Cyril

very slow delayed union on openfst 1.1

GonzaloIglesias - 02 Aug 2009 - 11:45

Hi,

Thanks for such a splendid tool as openfst! I'm a very happy user of the beta version for more than a year now.

I'm writing here because I have encountered a specific problem with the delayed union operation using openfst-1.1. library. Basically, I perform a delayed union of many fsts. Afterwards, I update by reinstancing to a VectorFst . I have isolated a particular case in which the version compiled with openfst-beta works things out very quickly (less than one minute), whilst it takes 40 minutes on openfst 1.1 to create the same fst. I have a small but full example including the necessary fsts to read from (~6MB) and perform union with. I don't understand this behaviour and it seems to me something could be wrong within version 1.1

Perhaps this is a known issue? May I post a link to a tar file containing the example? Please let me know how should I proceed.

Thanks in advance!

CyrilAllauzen - 11 Aug 2009 - 10:08

Hi Gonzalo,

I've started to look into this and I think I found the reason for this inefficiency. We'll fix this for the next release.

Thanks a lot for your feedback!

Best, Cyril

CyrilAllauzen - 13 Aug 2009 - 16:11

Update: there is actually no efficiency issue for delayed union. The issue is that Gonzalo was using a sub-optimal approach. The most efficient way of performing a large number of delayed union is to use Union(RationalFst<A> *fst1, const Fst<A> &fst2) as follows:

UnionFst<A> *union = new UnionFst<A>(fsts[0], fsts[1]);
for (int i = 2; i < fsts.size(); ++i)
  Union(union, fsts[i]);

This takes advantage of the fact that UnionFst derives from RationalFst and that delayed rational operations such as union can be applied destructively to a RationalFst.

Cyril

Transforming a collection of strings

MadhuTherani - 29 Jul 2009 - 14:35

I am trying to transduce a set of strings (defined by a Regex) into one final string..I have an FST defining the Regex and another FST defining the final string.. I need the second FST to match to any output label of the first FST in the composition fo the two FSTs..Can anyone suggest how to go about it..? Sigma matcher in the composition.? thanks

madhu

DoganCan - 30 Jul 2009 - 15:47

Depending on what you mean by "I need the second FST to match to any output label of the first FST in the composition of the two FSTs", there might be many different solutions to your problem.

Say you have the string "x y z" in the set defined by the regex and the final string is "a b".

If you want the transduction "x y z" -> "a b a b a b", one possible solution is to create a simple transducer T in the form:

0 1 X a
1 2 <eps> b
2 0 <eps> <eps>
2

where X is any symbol in your input alphabet (You need to add a transition in the form "0 1 X a" for each X in your alphabet) and compose the regex with T.

If you want the transduction "x y z" -> "a b" instead, then one possible solution is to create a transducer T in the form:

0 1 X <eps>
1 0 <eps> <eps>
1 2 <eps> a
2 3 <eps> b
3

where X is any symbol in your input alphabet and compose the regex with T.

You might achieve the same behaviour using sigma matcher in composition (At least I believe it is possible).

You might even try (for "x y z" -> "a b"):

replace output labels of the regex with <eps> labels
replace input labels of the final string with <eps> labels
concat two transducers
syncronize
rmepsilon

Best, Dogan

AT&T FsmLib vs openFST for exe ussage

ZhijianOu - 29 Jul 2009 - 11:20

hi,

1) How to easily get the command-line help in openFST's shell-level exe? You know, in FsmLib, an argument "-?" will print help information.

2) If exe is sufficient for my application, which one do you recommend to use? In particular, I have a special emphasis on the performance of acceptor optimization operations, e.g., determinize, minimize, remove-epsilon. As we see, minimizing a transducer is not supported at the command level.

Thanks in advance.

Best regards, ZhijianOu

PaulDixon - 29 Jul 2009 - 23:41

1) fstcommand -help

What semiring is the default in fstminimize, fstdeterminize ?

ZhijianOu - 29 Jul 2009 - 11:07

hi,

I'd like you to point out whether my following understandings are correct.

1) If the input is an acceptor, the implementation of fstminimize is first do weight pushing, and then do (unweighted) acceptor minimization.

2) If 1) is correct, then using different semirings will have different weight pushing results, and different minimization results.

3) In the description of fstminimize in openFST (or fsmminimize in fsmLib), there is no mention of the default semiring used. What semiring is the default?

4) For a deterministic transducer, P(I, x, y, F) contains only one path, using the notation in the paper - Mehryar Mohri, Fernando C. N. Pereira, and Michael Riley. Speech Recognition with Weighted Finite-State Transducers. In Handbook on Speech Processing and Speech Communication, Springer-Verlag, 2008.

5) fstdeterminize depends on the semiring ?

6) If we create binary fst from text file using fstcompile or fsmcompile, where to specify the semiring ?

It is very appreciated that you could favor me with a reply. Thank you.

Best regards, ZhijianOu

PaulDixon - 29 Jul 2009 - 20:42

6) For fstcompile use the arc_type switch fstcompile --arc_type=log test.txt > test.fst

For fsmcompile use the -s switch fsmcompile -t -s log test.txt > test.fst

DoganCan - 30 Jul 2009 - 16:22

Correct
Correct
No default semiring. Operation is performed in the semiring given by the arc type of the input fst.
Correct
Yes. Say there are distinct paths with the same labels in the input fst.
- Tropical semiring: minimum of the path weights is assigned to the output path.
- Log semiring: sum of the path weights is assigned to the output path.

ZhijianOu - 30 Jul 2009 - 22:04

Thank PaulDixon and DoganCan.

The -s switch for fsmcompile is to give the states textual names.

The output from "fsmcompile -?" is
FSM Version 3.7
Copyright (C) 1998 AT&T Corp. All rights reserved.
Usage: fsmcompile [-opts] [fsm]
Options:
 -i symbols       input arc symbols
 -o symbols       output arc symbols
 -s symbols       state symbols
 -t               transducer
 -V file          input potentials file
 -F file          output FSM file
 -?               info/options

Continue with my question 3. 
In openFST, we can use --arc_type to specify the semiring. But we can run fstcompile without --arc_type, such as
fstcompile --isymbols=isyms.txt --osymbols=osyms.txt text.fst binary.fst
The default --arc_type is "standard", which means tropical semiring. Am I right?

But in fsmLib, if we run fsmcompile as follow:
fsmcompile -t -i i.syms -o o.syms < T.txt > T.fst
It is not clear what semiring is used.

Thanks for your attention.

Best, ZhijianOu

PaulDixon - 30 Jul 2009 - 23:06

It seems the semiring option was added to fsmtoolkit 4.0.

Yes, standard means tropical semiring (Explained in previous post "Why_is_TropicalWeight_considered_standard?")

problem with compiling openfst on Windows with Visual Studio 2008

ZhijianOu - 27 Jul 2009 - 10:49

hi,

As said in the openfstwin-1.1.zip's readme.txt, I open the solution file openfstwin.sln in Visual Studio 2008 and compile. But there are errors as follows:

1>Compiling... 1>fst.cc 1>d:\openfstwin-1.1\openfst-1.1\src\include\fst\unordered_set(19) : fatal error C1083: Cannot open include file: 'unordered_set': No such file or directory
1>properties.cc
1>d:\openfstwin-1.1\openfst-1.1\src\include\fst\unordered_map(19) : fatal error C1083: Cannot open include file: 'unordered_map': No such file or directory
1>Generating Code...
1>Build log was saved at "file://d:\openfstwin-1.1\openfstwin\openfstwinlib\Debug\BuildLog.htm"
1>openfstwinlib - 2 error(s), 0 warning(s)

It is very appreciated that you could favor me with a reply. Thank you.

Best regards, ZhijianOu

CyrilAllauzen - 27 Jul 2009 - 12:31

Hi Zhijian,

First, I'll make sure that you are using Visual Studio 2008 Service Pack 1 as mentioned here.

This looks like a missing TR1 issue. You might not have a working version of TR1 currently installed. As far as I know (I have never used VS2008), an implementation TR1 is included with SP1.

Best, Cyril

ZhijianOu - 29 Jul 2009 - 11:03

Hi, Cyril,

It works. Thank you. After installing VS2008 Service Pack1, I compiled openfstwin-1.1 successfully, though still with warnings, either warning C4396 or warning C4800. For example,

d:\openfstwin-1.1\openfst-1.1\src\include\fst/pair-weight.h(195) : warning C4396: 'fst::operator >>' : the inline specifier cannot be used when a friend declaration refers to a specialization of a function template
d:\openfstwin-1.1\openfst-1.1\src\include\fst/matcher.h(317) : warning C4800: 'uint64' : forcing value to bool 'true' or 'false' (performance warning)

Best regards, ZhijianOu

PaulDixon - 29 Jul 2009 - 20:40

changing fst.Properties(kAcceptor, true) to fst.Properties(kAcceptor, true)!=0 Should stop the C4800 warning.

If you think any of the warnings are bogus you can the warning number to the pragma statement at the top of config.h

HenrikKinnemann - 25 Aug 2009 - 05:02

Hello Paul,

my build of OpenFST-1.1 with Visual Studio 2008 SP1 was successful.

Now I would like to compile/link an example test application which is shipped together with OpenFST in file 'openfst-1.1\src\test\fst_test.cc'.

Visual Studio reports the following compile errors for fst_tes.cc:

1>------ Rebuild All started: Project: TestFst, Configuration: Debug Win32 ------ 1>Deleting intermediate and output files for project 'TestFst', configuration 'Debug|Win32' 1>Compiling... 1>stdafx.cpp 1>Compiling... 1>TestFst.cpp 1>d:\usr\local\include\fst\test-properties.h(24) : error C3083: 'tr1': the symbol to the left of a '::' must be a type 1>d:\usr\local\include\fst\test-properties.h(24) : error C2039: 'unordered_set' : is not a member of 'std' 1>d:\usr\local\include\fst\test-properties.h(24) : error C2873: 'unordered_set' : symbol cannot be used in a using-declaration 1>TestFst - 3 error(s), 1 warning(s)

What do I have to do in order to use 'unordered_set'?

Thanks in advance, Henrik

RolandZimbel - 01 Feb 2010 - 02:31

Hi Paul, I had a problem to compile OpenFst with VS2008 SP1: In fst/compat.h a WIN32 macro is used in the beginning, but it is defined later on by including fst/config.h. The compilation works when I include fst/config.h before using WIN32. As the windows compiler defines the macro _WIN32 I would propose to use this macro to get code that works on all platforms.

OpenFst and AT&T FSM support UTF8 or Unicode Encoding?

AbbasMalik - 24 Jul 2009 - 15:23

Dear All I have question regarding OpenFst and AT&T FSM.

Do OpenFst and AT&T FSM support UTF8 or Unicode Encoding?

Thank you in advance,

Best

CyrilAllauzen - 24 Jul 2009 - 16:45

Hi Abbas,

This question was discussed previously in the forum: link.

Let me know if that answer your questions.

Best, Cyril

Pattern Matching using Mohri 1997 algorithm

PhilSours - 16 Jul 2009 - 16:58

Hi,

I'm interested in pattern matching, i.e. finding occurrences of a set of strings within a text.

M. Mohri [1997] has described an algorithm to generate an efficient pattern search automaton based on failure functions in this paper:

Mohri, Mehryar. String-Matching with Automata. Nordic Journal of Computing, 4(2):217-231, Summer 1997. http://www.cs.nyu.edu/~mohri/postscript/njc.ps

This method has also been referred to by W. Skut [FSMNLP 2008].

Is there anyone who has submitted (or could submit) code implementing this algorithm that would be free for commercial use?

Thank you, Phil Sours

NewestUser - 18 Jul 2009 - 11:11

it's not all that clear what you are aiming to do, but there are quite a few generic string matching, or regular expression matching solutions out there. the following link is to a great tutorial on the subject, which presents a well known solution which is quite easy to implement using the openfst library,

http://swtch.com/~rsc/regexp/regexp1.html

although it is very simple to implement, this algorithm generates transducers which are, in some cases, not terribly well-suited to further optimization or post-processing, especially epsilon removal. the following paper provides an overview of some more efficient ( but more complex ) alternatives which may be better suited to general or more complex inputs,

www.cs.nyu.edu/~mohri/postscript/glush.pdf

dunno if that is what you are looking for, and perhaps it's overkill, but depending on what you mean by 'match strings in a text' it seems like openfst itself might be overkill.

CyrilAllauzen - 20 Jul 2009 - 15:09

Hi Phil,

A version of this algorithm is implemented in the GRM library. The binaries are available for free for non-commercial use, no source code however.

Cyril

Pruning without changing state ID's.

DoganCan - 12 Jul 2009 - 13:32

Hi,

I need to prune an fst while keeping the state ID's unchanged. I need this behaviour since I keep auxiliary information (i.e potentials) about the states of the input fst. Is there a convenient way of doing this?

Thanks, Dogan Can

NewestUser - 12 Jul 2009 - 21:08

can you use the 'keep_state_numbering' flag associated with fstcompile? i have not tested it in this particular situation, but it seems like it might suffice.

DoganCan - 13 Jul 2009 - 16:55

'keep_state_numbering' flag only works when compiling a binary fst from its text representation. it does not help when an fst operation like pruning is applied on that same fst.

CyrilAllauzen - 20 Jul 2009 - 15:19

Unfortunately, there is no way to do that in the current version of the library. A workaround would be "encode" the origin state of each transition in its labels.

Cyril

User-defined Flags

DoganCan - 12 Jul 2009 - 07:01

Hi,

I noticed the following comment in the advanced usage section.

In a user-defined binary, the command line options processing will all also work if the user calls:

SetFlags(usage, &argc, &argv, true);

In that case, the user can set his own flags as well, following the conventions in <fst/flags.h>.

My problem is that when I add a new flag to a custom fst binary, I need to put its definition along with other flag definitions in <src/bin/main.cc> or in <src/lib/flags.cc> and rebuild the library, otherwise SetFlags method does not recognize it.

I have the generic OpenFst binary setup:

fstbinary.cc -- includes binary-main.h, calls SetFlags and Arc templated BinaryMain function

binary-main.h -- includes <fst/main.h>, declares a new flag using DECLARE_type(newflag) and defines BinaryMain function

Defining the new flag using DEFINE_type(newflag,default,help_message) in fstbinary.cc or binary-main.h allows me to use the newflag with its default value, however i cannot pass it from the command line.

Is it possible to to define and use new flags in a user-defined binary without modifying the library source code? If so, could you provide an example?

Many thanks, Dogan Can

PaulDixon - 13 Jul 2009 - 06:36

I've managed to add and use flags without recompiling everything. No 100% sure if this does what you want as it is very similar to what you already tried, but here are minimalistic header and cpp files which allow the value of myflag to be changed. Compiled and tested on a standard OpenFst 1.1 install under Linux using g++ fsttest.cpp -lfstmain -lfst -ldl -o fsttest

#include "test-main.h"

namespace fst 
{
        REGISTER_FST_MAIN(TestMain, StdArc);    
        REGISTER_FST_MAIN(TestMain, LogArc);
}  

DEFINE_double(myflag, 1234, "My Flag");
using fst::CallFstMain;
int main(int argc, char **argv) 
{
  string usage = "Test flags \n\n  Usage: ";
  usage += argv[0];
  usage += "  Flags: myflag\n";
  std::set_new_handler(FailedNewHandler);
  SetFlags(usage.c_str(), &argc, &argv, true);
  if (argc !=1)
  {
    ShowUsage();
    return 1;
  }
  return CallFstMain("TestMain", argc, argv, FLAGS_arc_type);
}

#ifndef TEST_MAIN_H__
#define TEST_MAIN_H__

#include <fst/main.h>
DECLARE_double(myflag);
DECLARE_string(arc_type);

namespace fst 
{

   template <class Arc>
   int TestMain(int argc, char **argv, istream&, const FstReadOptions&)
   {
     
      cerr << "myflag=" << FLAGS_myflag << endl;
        return 0;
   }
}  
#endif

DoganCan - 13 Jul 2009 - 17:32

Hi Paul,

This is exactly what I tried before. Seems like there is a problem with my build settings. Thanks for your help.

DoganCan - 13 Jul 2009 - 19:21

Hi,

After going through my build settings, I discovered that GCC_SYMBOLS_PRIVATE_EXTERN compiler flag was set to YES in default Release settings of XCode, and changing it to NO solved the problem. This flag sets symbol visibility and I suppose it makes flag definitions invisible to the flagregister.

auxiliary symbols and a speech recognition cascade

NewestUser - 10 Jul 2009 - 04:35

Hi, I'm trying to understand the proper usage of auxiliary symbols in the construction of a recognition cascade using.

unfortunately, although I've now read through,

"Speech Recognition with Weighted Finite State Transducers",
"A Generalized Construction of Integrated Speech Recognition Transducers",
"Generalized Optimization Algorithm for Speech Recognition Transducers", and
"Integrated Context-Dependent Networks in Very Large Vocabulary Speech Recognition",

this one point still eludes me, and does not seem to be described or illustrated in any of these papers ( or any others that I was able to find ). Almost every other aspect of the transducer construction and cascade composition seems to be very clearly explained and usually illustrated with simple examples, but the trick with auxiliary phones and composition continues to elude me.

At present my approach consists of the following steps,

construct G, a grammar transducer, which in my case happens to be determinizable. thus i perform no augmentation.
construct L, the closure of a lexicon transducer derived from a pronunciation dictionary.
- i automatically supplement this transducer with auxiliary symbols/phones as necessary, as described in the papers mentioned above
- e.g., red r eh d #1, read r eh d #2, etc.
construct C, an inverted, context-dependent, deterministic triphone transducer.
- at present i am generating this transducer from a phoneme list which also includes the auxiliary symbols/phones so as to guarantee proper composition. i generate all possible logical triphones and rely on the composition process to weed out the superfluous combinations.
perform the composition and optimization of the cascade ( i am not dealing with the H-level at present )
- ```
 min( det( *C* o det( *L* o *G* ) ) ) 
```
replace all auxiliary triphones with the empty string, thus
```
 eh+d-#1 
```
will be replaced with '-'

there are clearly a couple of problems with the last stage of my process. first, each auxiliary symbol at the context-independent level is being mapped to multiple different different auxiliary symbols at the context-dependent level. nevertheless the literature implies that this should not in fact be the case,

For further determinizations at the context-dependent phone level and distribution level, each auxiliary phone must be mapped to a distinct context-dependent phone. Thus, self-loops are added at each state of C mapping each auxiliary phone to a new auxiliary context-dependent phone. The augmented context-dependency transducer is denoted by C.

so i suppose that the mapping should be one-to-one, rather that one-to-many, which is what my current approach yields.

another clearly negative consequence of my current botched approach is that the final auxiliary-symbol removal step results in lost context wherever such auxiliary symbols appear. for example, the following transducer

0 1 SIL+r-eh red
1 2 r+eh-d -
2 3 eh+d-#2 -
3 4 d+#2-b -
4 5 #2+b-ah ball
5 6 b+ah-l -
...
...

defaults to,

0 1 SIL+r-eh red
1 2 r+eh-d -
2 3 - -
3 4 - -
4 5 - ball
5 6 b+ah-l -
...
...

which, although i can actually use it for recognition, is obviously wrong. so finally my question: are there any other papers which provide either diagrams of an augmented component context-dependent triphone transducer, or a more detailed description of the I(M) operation described in "Generalized Optimization...", For any transducer M, we also denote by I(M) the transducer augmented with extra transitions such that each new symbol on its input side is mapped to some new and distinct output symbol.

try as I might still don't quite get this. sorry if this seems a foolish question but im teaching this to myself and dont really know where else i might ask.

NewestUser - 10 Jul 2009 - 04:42

another, perhaps more obvious way of asking this might be, what exactly is a self-loop in the context of this discussion, and what does it look like in practice?

NewestUser - 11 Jul 2009 - 10:34

I believe I figured out a more appropriate solution, which seems to work, but perhaps Im still missing something,

The most trivial, canonical example of a deterministic context-dependent triphone transducer.

The same example, inverted, and augmented with 2 auxiliary phones, #0, and #1

Assuming we have an example grammar transducer of the following form,

0 1 a testa
1 2 b -
2 3 #0 -
3 4 a testb
4 5 b -
5 6 #1 -
6

Then composition with the CD transducer yields,

where the auxiliary triphone symbols can now be safely replaced with epsilons, or short-pauses, or removed without any loss of information.

CyrilAllauzen - 14 Jul 2009 - 18:33

This is indeed the correct construction. In this context, a self-loop is a transition whose destination state is the origin state.

NewestUser - 14 Jul 2009 - 23:25

thanks for the confirmation! i think there may also be at least one more valid construction, this involves augmenting the monophone symbols as needed with auxiliary symbols, then treating these as normal monophones during construction of the context-dependency transducer.

so, in contrast to the above examples where we have a list of monophones which includes auxiliary phones, 'a, b, #0, #1', which are all treated the same. and further in contrast to the self-loop approach, where we have two separate classes of symbols, monophones: 'a, b', and aux. symbols: '#0, #1' we instead of a selective combination depending on the structure of the lexicon transducer, e.g., monophones: 'a#0, b, a#1' which are all treated as normal monophones. this approach has an advantage in that there is no need to treat anything specially, and at the C-level ( assuming we've no need to worry about anything further down in the cascade ), there is no loss of information. instead of 'eps-a+b, #0c:, a-b+a' we get 'epse-a#0-b, a#0-b+a#1'. and instead of epsilon replacement, we have a simple symbol replacement.

this actually seems like a reasonable solution so long as, a. the degree of homophony is small, and b. there is no need to interact with lower levels in the cascade. if either a. or b. do not hold this alternative rapidly becomes intractable due to the increase in monophones and the increase in the size of the resulting C-level transducer.

CyrilAllauzen - 20 Jul 2009 - 15:16

This alternative approach would also work indeed.

However, I want to point out that the first approach does not really require to physically add the self-loops and can be achieved without increasing the size of C by using custom matchers in composition.

Converting a "linearized transducer" into a true transducer FST

KennethRBeesley - 09 Jul 2009 - 12:17

Assume that you have an OpenFst Acceptor in which each string of the encoded Language has an even number of symbols, e.g.

a b c d e f

in which each even-numbered symbol (counting from 0) represents an Input symbol and each odd-numbered symbol represents an Output symbol. (This is the "linearized transducer".)

Question: Is there an easy way to turn this Acceptor (a kind of linearized transducer) into a real Transducer where each path like the one above is converted into a path of half the length, with these labels:

a:b c:d e:f

CyrilAllauzen - 20 Jul 2009 - 15:04

I would do it in two steps:

1) Use composition to turn a b c d e f into a:eps eps:b c:eps eps:d. This can easily be done using two two-state transducers T_1 and T_2 and computing T_1 o A o T_2. T_2 is the transducer erasing any symbol in odd position: for all a in the alphabet,

0 1 a eps
1 0 a a
1

T_1 is the inverse of the reverse of T_2

2) Apply Synchronize() followed by RmEpsilon().

Cyril

KennethRBeesley - 17 Aug 2010 - 12:27

Thanks for this solution. In the general case, for an arbitrary Fst, is there a way to test if Synchronize() will terminate?

A few small bugs

DoganCan - 08 Jul 2009 - 09:08

Hi,

I noticed two small bugs/problems in the library and here are my fixes for them.

1. fstcompile does not give any error message when no input file is specified.

Solution: change line 00218 in <compile-main.h>

< if (!istrm) {

> if (!*istrm) {

2. <lexicographic-weight.h> defines pair weight type using "_<_" symbol between component weight types. However "<" symbol creates problems say when you want to register LexicographicArc for fst binaries since you can neither supply type1_<_type2 as an arc type nor create a dynamic shared object named type1_<_type2-arc.so.

Solution: change line 00075 in <lexicographic-weight.h>

< static const string type = W1::Type() + "_<_" + W2::Type();

> static const string type = W1::Type() + "_L_" + W2::Type();

Best, Dogan Can

CyrilAllauzen - 09 Jul 2009 - 11:00

Thanks a lot. We will fix this in the next release.

Best, Cyril

OpenFST vs. BOOST Graph Library for operations on language models

HenrikKinnemann - 18 Jun 2009 - 08:01

Hello,

as a software developer of an OCR team I'm looking for a library which can

(a) represent certain kind of language models (e.g., OCR error correction rules, n-grams, word dictionaries, morphology and grammar rules for German language) in a unique way, for example, as weighted finite automata

and

(b) has some high-level operations already implemented to process the language models, e.g. functions like 'compose', 'minimize' and 'shortest-path-search'.

The BOOST Graph Library (http://www.boost.org/doc/libs/1_39_0/libs/graph/doc/index.html) does also have weighted finite automata which could be used to represent language models and BOOST Graph offers lot's of functions on them.

My question: Where are the supposed differences between BOOST Graph and OpenFST in relation to my OCR application, and, what are the strenghts and weaknesses of both of them?

Best regards, Henrik

CyrilAllauzen - 19 Jun 2009 - 12:09

Hi Henrik,

The Boost Graph Library focuses on directed and undirected graphs. As far as we know, it does not provide specific operations for weighted automata and transducers such as determinization, minimization or composition. You can probably use the BGL to represent language models, but you'll have to re-implement these algorithms. If you know otherwise, please point us to the relevant documents as there is no mention of automata-specific algorithms at the URL you provided.

The OpenFst library was especially designed for handling weighted automata and transducers. It provides a large selection of algorithms (including minimization and composition), some of them being implemented as on-the-fly (lazy) operations. The library also includes a general semiring framework. OpenFst provides some of the basic graph algorithms offered by the BGL such as generic visitors (depth-first, breadth-first, shortest-first, ...), generic shortest-path algorithms (Dijkstra, Bellman-Ford, ...), topological sort, strongly-connected components, ... However, OpenFst does not offer some other algorithms provided by BGL (spamming tree algorithms for instance).

The OpenFst library has been successful used by several natural language processing applications: speech recognition, speech synthesis, machine translation and OCR among others. It is for instance one of the core components of the speech recognition and speech synthesis efforts at Google. In that particular example, it is used among other things to represent (and combine) n-gram language models, word dictionaries and regular grammars.

To conclude, the Boost Graph library could provide a) but not b). The OpenFst library provides a) and b).

Best regards, Cyril

N-way Composition Support

GrahamNeubig - 25 May 2009 - 08:06

Hello,

Are there any plans to support the three-way composition mentioned in the technical report "N-way composition of weighted finite-state transducers" (by Cyril Allauzen and Mehryar Mohri) in the near future? It would come in handy for one of my projects and I would be willing to help with the implementation if no-one is working on it, but if it's already being worked on I'll wait for the next release.

GregDhuse - 07 Jul 2010 - 12:27

Hello,

Does anyone know of an existing implementation of the generalized n-way composition algorithm from that paper? I'm considering working on my own implementation if not.

Best regards, Greg

New behaviour of SetInputSymbols/SetOutputSymbols ?

ChristopherKermorvant - 07 May 2009 - 08:04

Hi,

I am working with the 1.1 version and it seems that there is a change in behavior of SetInputSymbols/SetOutputSymbols. As from version 1.1, it seems that these two functions make a deep copy of the SymbolTable given in argument. This is due to the addition of a copy constructor in symbol-table.h :

explicit SymbolTableImpl(const SymbolTableImpl& impl)
      : name_(impl.name_),
        available_key_(0),
        dense_key_limit_(0),
        check_sum_finalized_(false) {
    for (size_t i = 0; i < impl.symbols_.size(); ++i) {
      AddSymbol(impl.symbols_[i], impl.Find(impl.symbols_[i]));
    }
  }

* Am I right ?

* if yes, i don't see the use of having a reference counting on a deep copy

* if yes, the following comment in symbol-table.h is not true anymore :

// SymbolTables are reference counted and can therefore be shared across
// multiple machines. For example a language model grammar G, with a
// SymbolTable for the words in the language model can share this symbol
// table with the lexical representation L o G.

Thank you for your answer and for this great library !

CyrilAllauzen - 07 May 2009 - 11:14

Hi,

This is inexact. SetInputSymbols and SetOutputSymbols still do a shallow copy of the SymbolTable in 1.1. They both call SymbolTable::Copy that in turn calls the copy constructor of SymbolTable:

  SymbolTable(const SymbolTable& table) : impl_(table.impl_) {
    impl_->IncrRefCount();
  }

The copy constructor of SymbolTableImpl is not called since impl_ is a pointer to a SymbolTableImpl.

Best, Cyril

Sat vocabulary building software

AllenSam - 05 May 2009 - 06:34

hi all looking for an English student to work on our ...

Off-topic post deleted.
MichaelRiley .

Mindtuning for the Encode/Determinize/Decode workaround

KennethRBeesley - 06 Apr 2009 - 17:22

Cyril,

Back on 14 August 2007, you wrote:

"The library currently only supports the determinization of functional transducers (if two successful paths have the same input label, they need to also have the same output label). The reason for that is that we use the weighted automata determinization algorithm viewing the output labels as weights in the string semiring ....

A workaround is to use Encode/Decode to view the transducer as an acceptor, considering the pair (ilabel, olabel) as one symbol.

1. Encode: EncodeMapper encoder(kEncodeLabels, ENCODE); Encode(fst, &encoder);

2. Determinize.

3. Decode: Decode(fst, encoder);

*** End of Quotation ***

I need a bit of mind-tuning.

Question: Is the Encode/Determinize/Decode workaround recommended ONLY when determinizing a NON-functional Fst? If so, what's the best way to determine if an Fst candidate for determization is functional or not?

Background: In my application, Fsts are created from arbitrarily complex regular expressions, and the resulting Fsts may be functional or non-functional. Without the workaround, statements like

$v = a | a:b ;

currently cause a crash when the result Fst is determinized (without the Encode/Decode workaround), because the Fst maps "a" to "a" and to "b"--i.e. in this case the Fst is not functional.

Thanks,

Ken

MichaelRiley - 08 Apr 2009 - 00:01

You can always determinize and minimize an 'encoded' machine. The result may not be fully deterministic or minimal when 'decoded' but at least redundant transitions and states in the encoded version have been dealt with. Probably the safe thing to do in your application when a transduction is specified.

config.h

PaulDixon - 17 Mar 2009 - 08:14

Hi,

After following the INSTALL instructions using the defaults settings and using the library in my own program as described in the README file. The compiler gives and error looking for config.h (included compat.h). Is config.h copied as part of the installation? Everything compiles fine if I add another include path to the location where the configure script was run from.

MichaelRiley - 21 Mar 2009 - 11:45

Thanks. It's been fixed in version 1.1, now uploaded. Autotools growing pains.

Minor bug in assignment operator

DanEgnor - 16 Mar 2009 - 04:29

MutableFst declares an operator=(const Fst &), but it doesn't declare an operator=(const MutableFst &). You'd think that would be OK because MutableFst is a subclass of Fst, but in fact what it means is that C++ defines an implicit operator=(const MutableFst &), which does a shallow copy, which does nothing (because MutableFst is an abstract base class with no data members).

That means that if you then try to assign one MutableFst to another, nothing happens.

CyrilAllauzen - 16 Mar 2009 - 11:17

Hi Dan,

Indeed, we found out about this bug and fixed it a while back. The latest release of the library (version 1.0) included that fix.

Best, Cyril

DanEgnor - 17 Mar 2009 - 22:00

Zounds, I was three days out of date! Thanks.

When is it safe to RmEpsilon(), Determinize(), Minimize() When is it safe to automatically RmEpsilon(), Determinize() and/or Minimize()

KennethRBeesley - 03 Mar 2009 - 16:59

Sorry to belabor this point, but I'm still struggling to understand when an Fst can be safely and automatically processed by RmEpsilon(), Determinize() and/or Minimize(). The challenge arises in my Kleene programming language, which uses OpenFst as the underlying library. In Kleene you can enter statements like

$myfst  =   a* b+ ((c|d|e|f) - (m|n|d)){2} ((dog|cat|rat) & (elephant|dog|pig))? ;

After the regular expression is parsed, the interpreter walks the parse tree and makes multiple calls to OpenFst library functions--Closure, Compose, Concat, Difference, Intersect, etc--building many intermediate Fsts and combining them using the various operations until the final Fst is created and (in this case) bound to the variable $myfst. Obviously, all this processing gets done automatically, without human intervention. The user is interested only in the final result, $myfst, which then might get used in subsequent regular expressions.

Ideally, the final result would be automatically determinized and minimized.

I understand, with help from Cyril , that if an Fst is tested and found to be 1) acyclic, or 2) both (unweighted AND an Acceptor) then it can be safely Determinized and Minimized. (Corrections would be welcome.) So after each intermediate or final Fst is created by the Kleene interpreter, it might be sent to the following clean-up function

// first-draft pseudo-code
if ( the fst is acyclic || (the fst is unweighted &&  the fst is an acceptor) ) {
      RmEpsilon()
      Determinize()
      Minimize()
} else {
      // this is questionable, see below
      RmEpsilon()
}

However, I'm aware that this could still be problematic. In a response on this forum, Cyril stated

"Indeed, in some situations you cannot remove epsilons because it will lead to a space blow up. In that case, you might still want to optimize the machine with epsilons [i.e. without first invoking RmEpsilon] using determinization and minimization."

In another response, Cyril stated

"Indeed, union introduces epsilon-transitions and epsilon-transitions will slow down composition. Non-determinism will also slow down composition. So, I would recommend you first epsilon-remove and then determinize your fst. However, there is always a risk with determinization (it might blow up or even not terminate). Hence, if what I suggest does not work better, you should investigate using determinize only, or rmepsilon only or determinize followed by rmepsilon."

This puts in question the 'else' clause in the pseudo-code above, where RmEpsilon() is automatically called on Fsts that don't satisfy the requirements for safe determinization and minimization.

Question: Assuming my Kleene scenario, where regular expressions are being interpreted and many Fsts (intermediate and final) are being created automatically, when can RmEpsilon(), Determinize() and Minimize() be safely invoked automatically? I want the interpreter to do all the optimizations that are mathematically safe, but to attempt nothing more.

We can assume that the interpreter can interrogate various features of the candidate Fst (cyclic vs. acyclic, weighted vs. unweighted, acceptor vs. transducer, etc.). The interpreter also knows exactly which operation has just been performed to produce the Fst in question (Union, Concat, Closure, etc.).

Thanks,

Ken

CyrilAllauzen - 03 Mar 2009 - 17:27

Hi Ken,

To clarify, when talking about determinization:

safe means a condition under which the algorithm will always terminate assuming you have enough memory and wait long enough;
unsafe means the algorithm might not terminate ever.

Assuming this definition of safe/unsafe, determinization is safe for acyclic FSTs and for unweighted acceptors and rmepsilon and minimization are always safe.

However, safe does not guarantee that the algorithms will terminate on your machine because you have a finite amount of memory and don't want to wait more than x minutes.

The complexity of determinization in the safe case is in the worst case exponential in the size of the input machine, and quadratic in the size of the result (see DeterminizeDoc).

The complexity of epsilon removal is in the worst case quadratic in the size of the input (see RmEpsilonDoc).

There is no way you can ensure these algorithms will never blowup of memory. However, you can ensure that for most "reasonable" input they will behave "reasonnably".

Due to potentially large cost of these algorithms, it is always recommended you try to reduce the size of the input Fst before hand. This is why I would recommend first optimizing the machine with epsilon, applying epsilon-removal and then optimizing the machine without epsilons, Hence, the "safest" optimization function would be:

// first-draft pseudo-code
if ( the fst is acyclic || (the fst is unweighted &&  the fst is an acceptor) ) {
      Determinize()
      Minimize()
      RmEpsilon()
      Determinize()
      Minimize()
} else {
      // this is questionable, see below
      RmEpsilon()
}

Compiling openfst-1.0 on cygwin

PaulDixon - 26 Feb 2009 - 18:17

I have successfully compiled openfst-1.0 on cygwin. It was necessary to install a newer version of gcc to get TR1 support. I also had to add this line to main.cc

#include <ctime>

I should also have Visual Studio build instructions ready soon.

Paul

PaulDixon - 17 Mar 2009 - 19:37

To compile on the latest verion of cygwin with the cygwin gcc-4 package. Add the #include to the main.cc file and Run configure with the following arguments. ./configure CXX=g++-4.exe CC=gcc-4.exe Build and install as described in the readme

HenrikKinnemann - 11 Aug 2009 - 03:24

fstcompose memory consumption?

DavidHugginsDaines - 19 Jan 2009 - 11:45

Hi,

I'm running into some trouble trying to build static recognition transducers using OpenFst. Specifically, for even a fairly trivial language model G:

# of states                     42980
# of arcs                       183468
# of final states               923
# of input/output epsilons      43867

and (determinized and minimized) dictionary L~:

# of states                     1180
# of arcs                       2424
# of final states               1
# of input/output epsilons      1

when I try to compose these with fstcompose, it rapidly uses all available memory. I haven't let it run long enough to see if it actually terminates.

I'm not sure this is specific to OpenFst, because I find that the FSM library has the same problem. It's possible that I'm building the language model incorrectly, though I'm not sure why this would cause composition to blow up. I am planning to do a sanity check using the GRM library but haven't figured this out yet.

Is it necessary to use on-demand composition when building a transducer of this sort?

CyrilAllauzen - 21 Jan 2009 - 18:10

Hi,

You should try not to determinize/minimize L and make sure that the output labels (i.e words) are on the first non-input epsilon on a path. Also, make sure that L is output label-sorted. (I assumed you do fstcompose L G > LG)

Let me know if this works.

Cyril

DavidHugginsDaines - 26 Jan 2009 - 13:30

Thanks for the advice. It is working quite well now. I had been putting words at the end of words, on the word boundary symbols in L~, under the assumption that epsilon normalization would push them forward eventually (which it does), and also including entirely unnecessary epsilon transitions from the start state to the first state of each word.

I assume the problem with a minimized L is that its maximum out-degree is much higher than an un-minimized one?

CyrilAllauzen - 30 Jan 2009 - 13:37

Hi David,

I'm glad to read that this works quite well now. The issue is with determinization and not minimization. Determinization pushes back the output labels creating a lot of long epsilons paths. In composition, while trying to match a label x at a state in G, all this epsilon paths in L are going to be followed in order to find an x. A lot of this work is wasteful since there is only a few of these paths at the end of which an x can be found.

Cyril

BingbingLeng - 03 Dec 2009 - 11:59

Hi Cryil,

when the output labels (i.e words) are on the first non-input epsilon on a path,I cannot determinize the Lexicon.Could you please tell me how to determinize it?

possibility of facing infinite loop in fstdeterminize

IlbinLee - 18 Jan 2009 - 21:47

Is ever possible for "fstdeterminize" to fall into infinite loop?

I'm trying to determinize an FST and the amount of memory it uses gradually increases and it reaches to the limit, results "memory allocation failed" error. However, after removing 2-3 certain arcs in the FST, the process ends pretty quickly. Those arcs don't have any particular characteristics, so I guess it's due to a structure of the FST which consists of more than 2-3 arcs.

Could you give me any solution or hint?

MarkusD - 20 Jan 2009 - 14:25

Yes, the determinization algorithm does not halt for certain weighted finite-state machines. This is described in Mohri 2004 (http://www.cs.nyu.edu/~mohri/postscript/fla.pdf

). If the machine contains 'sibling states' then they must be 'twins'. Otherwise, the machine is not determinizable (under the commonly used semirings).

Roughly, two states are sibling states if they are reachable from the start state on different paths that have the same label sequence, and there are cycles at these two states that have similar labels. If these two cycles have the same weight then they are twins.

On this machine, for example, fstdeterminize doesn't terminate: 0 1 1 1 0 2 1 2 1 1 2 3 1 3 3 5 2 2 2 4 2 3 4 6 3

MarkusD - 20 Jan 2009 - 14:27

Here is the finite-state machine again (this time in "pre" tags):

0 1  1 1 0 2  1 2 1 1  2 3 1 3  3 5 2 2  2 4 2 3  4 6 3

Minor fixes needed to compile with GCC 4.3

DavidHugginsDaines - 06 Jan 2009 - 13:46

As usual, the C++ "standard" got reinterpreted in GCC 4.3, and a few things need to be changed for OpenFst-20080422 to compile. They are mostly trivial, just adding std:: to C string functions and including to get INT_MAX. A patch is at http://www.cs.cmu.edu/~dhuggins/OpenFst-gcc-4.3.diff

Ignore(A, B, &C) algorithm?

KennethRBeesley - 05 Jan 2009 - 17:30

If you're taking requests: It would be convenient to have an Ignore(A, B, &C) algorithm, creating a network (C) that encodes the language/relation A, ignoring any occurrences of B.

Documentation of OpenFst binary file format?

KennethRBeesley - 05 Jan 2009 - 14:46

Is there any available documentation on OpenFst's binary file format?

Thanks,

Ken

DavidHugginsDaines - 06 Jan 2009 - 14:24

I've been wondering this too, as I'm interested in reading and writing the binary file format without having to link in the OpenFst library (for no good reason aside from not wanting to deal with C++ compiler and library issues). It appears that it's slightly machine specific (but only in terms of endianness), so perhaps it's mainly meant to be used as an internal, intermediate format?

The definition of the header can be found in fst/lib/fst.h, and the symbol table format can be gleaned from fst/lib/symbol-table.cc. How the subsequent state and arc definitions are stored depends entirely on what particular Fst and Arc classes are being used.

In the case of VectorFst, I believe the whole thing looks like this, in native byte order. Strings are stored as a 32-bit length field followed by the raw character data. I may be a bit wrong about this as I haven't actually implemented it yet. Code for this should show up in the CMU Sphinx repository at some point in the near future though.

int32 magic_number = 2125659606;
string fst_type = "vector";
string arc_type = "standard";
int32 file_version = 2; /* file format version */
int32 flags; /* bitfield: HAS_ISYMBOLS = 1, HAS_OSYMBOLS = 2 */
uint64 properties; /* bitfield, defined in fst/lib/properties.h */
int64 start; /* start state ID */
int64 num_states;
int64 num_arcs;

/* input symbol table, if flags & HAVE_ISYMBOLS */
int32 magic_number = 2125658996;
string name;
int64 next_available_key; /* Next available key (number) */
int64 size;
struct {
    string symbol;
    int64 key;
} symbols[size];

/* output symbol table, if flags & HAVE_OSYMBOLS - see above... */

/* states and arcs */
struct {
    float final_weight;
    int64 num_arcs;
    struct {
        int ilabel;
        int olabel;
        float weight;
        int next_state;
    } arcs[num_arcs];
} states[num_states];

DavidHugginsDaines - 06 Jan 2009 - 14:26

That should say VectorFst<StdArc> ... forgot to format it correctly.

Access control:

Set ALLOWTOPICCHANGE = AllAuthUsersGroup

This topic: Forum > WebHome > FstForum > FstForumArchive2009
Topic revision: r1 - 2014-04-21 - CyrilAllauzen