OpenFst Forum 2008 Archive

Patch for GNU autotools including libtool for OpenFst 20080422

DavidShao - 28 Dec 2008 - 00:41

I have been developing patches to enable an open source software stack of OpenFst 20080422, leptonlib 1.58, iulib subversion revision 123, Tesseract subversion revision 205, and OCRopus subversion revision 1307 to all use libtool to create dynamic linked libraries on all of Mac OS X 10.5.6 using Macports, Ubuntu 8.10, FreeBSD 7, and Windows XP using Cygwin. The patches can be found at OCRopus files with filenames beginning with toautotools. As of today, for OpenFst, the corresponding file would be
toautotools_openfst_20080422_20081227.diff
As I have stated in a previous post, to apply and use the patch, say to a user subdirectory busr so that one does not need administrator privileges to install,

patch -p1 -E < ../toautotools_openfst_20080422_20081227.diff
autoreconf --install --force
./configure --prefix=$HOME/busr CPPFLAGS="-I/opt/local/include" LDFLAGS="-L/opt/local/lib"
make
make check
make install

I have enabled make check using tests already provided in OpenFst so that one can test whether modules can be dlopened. All tests pass for all platforms except Cygwin on Windows XP. Even for Cygwin on Windows XP, this patch enables a dynamic linked library to be built such that OCRopus is able to use it to pass its own OpenFst tests. (There is an additional failure of one OCRopus test, test-fst-search2.lua, on Mac OS X and FreeBSD.)

My preliminary investigation for Cygwin on Windows XP appears to show that something is going wrong with reinterpret_cast-ing the read() function so that it can be recovered using a map in both fst/lib/register.h and fst/bin/main.h. The pointer to this function appears to be immediately recovered if the map is accessed right after it is stored, but a later GetEntry() returns a null pointer.

I believe that using GNU autotools to provide dynamic linked libraries is an essential step to helping more wide distribution of OpenFst through standard platform package or port managers. I think the current patch just about finishes this task other than the above problem for Cygwin. I also have not tested this patch on any 64-bit platforms other than an Intel Macbook running Mac OS X.

DavidShao - 28 Dec 2008 - 00:42

Log In

Determinize, Sequentialize, Subsequentialize

KennethRBeesley - 11 Nov 2008 - 14:17

My colleagues and I are trying to understand the relationship between the OpenFst Determinize() algorithm and the sequentialization and subsequentialization discussed in Mohri's 1997 paper "Finite-State Transducers in Language and Speech Processing".

Any mind-tuning that you could offer would be much appreciated.

Ken

CyrilAllauzen - 13 Nov 2008 - 14:15

The Determinize() algorithm is the algorithm described by Mohri in 1997 Computational Linguistic paper.

In the weighted automaton case, the implementation follows exactly Mohri's paper. The weighted transducer is dealt by considering weighted transducer over seminiring K as weighted automaton over the semiring obtained by taking the cross product of the string semiring and the semiring K.

Cyril

Log In

Determinize and functional transducers

KennethRBeesley - 11 Nov 2008 - 14:10

The documentation of the Determinize() function in OpenFst states "The transducer must be functional." and deeper down, "N.B. Current version only determinizes functional transducers". Can we expect a future version of Determinize() that is not limited to functional transducers?

Thanks,

Ken

CyrilAllauzen - 13 Nov 2008 - 16:35

Yes, we indeed plan to have determinization of non-functional transducers in a future version.

Best, Cyril

ThomasDeniau - 11 Nov 2009 - 00:06

Has anybody written a version of determinization which works for non-functional transducers ? Cyril, would you have a preliminary version of that code to give to people looking to implement it a headstart ? Thanks
Log In

Request: Determinize in place

KennethRBeesley - 10 Sep 2008 - 23:44

Minimize(&A) minimizes a network A "in place", but Determinize(A, &B) takes A and produces as output a new determinized Fst B. Would it be possible to provide a new version of Determinize that determinizes in place, i.e. Determinize(&A)?

Thanks,

Ken

CyrilAllauzen - 24 Sep 2008 - 17:27

We could add a Determinize(&A). The reason we chose not to so was that the underlying algorithm would still be constructive: create B as the determinization of A, then delete A and assign B to A. Hence, it won't be any more (memory) efficient than using Determinize(A, &B).

KennethRBeesley - 24 Sep 2008 - 21:46

Whatever goes on underneath, I venture to suggest that it would be (to some of us at least) intuitive and useful to add a new Determinize(&A) that is parallel to RmEpsilon(&A) and Minimize(&A). From a user's point of view, it often makes sense to perform such operations "in place" on an existing Fst object.

In my own work (a programming language built on top of OpenFst) an assignment statement like

$net = a*b+[w-z]? ;

causes the right-hand side to be evaluated (by calling various functions in the OpenFst library) and in a symbol table the name $net is bound to a Java Fst object that stores (as a Java long int) the pointer to the resulting C++ Fst object. That pointer is my handle from the Java world to the Fst object in the C++ universe. If I then type

$net2 = $net ;

then $net2 becomes an alias for the same Java Fst object that contains the pointer to the original C++ Fst object. (If I really want to make a copy, I've got a copy function that relies on the Map() function in the OpenFst library.)

With RmEpsilon(&A), I can remove the epsilons from the original C++ Fst object in place, and both $net and $net2 still reference the same Java Fst object that still contains the pointer to the original C++ Fst object. If I could (at least superficially) also Determinize() that original Fst object in place, life would be a bit easier for me. With the existing Determinize(A, &B), a new C++ Fst object B is created, which usually causes my language to create a new Java Fst object to wrap it; and I've got to worry about changing the pointer in the original Java Fst object (the value, in the symbol table, of $net and any aliases like $net2) and deleting the original C++ object A to prevent a kind of memory leak.

So I now understand why Determinize(A, &B) is as it is, but I do still suggest that I and perhaps others would find a new Determinize(&A) useful and intuitively parallel to RmEpsilon(&A) and Minimize(&A), so that all could be performed "in place".

Best,

Ken

CyrilAllauzen - 14 Oct 2008 - 11:59

Hi Ken,

I understand that what you describe is a rather unsastisfying way of wrapping. However, I not sure I understand why this is the only of wrapping. Anyhow, you can easily add a Determinize(&A) as follows:

template <class Arc>
void Determinize(MutableFst<Arc> *fst) {
  *fst = DeterminizeFst<Arc>(*fst, 
                             DeterminizeFstOptions(CacheOptions(true, 0)));
}

Best, Cyril

Log In

StateIterator and the Start state

KennethRBeesley - 21 Aug 2008 - 14:28

When you iterate through the states in an Fst:

for (StateIterator<StdFst> siter(fst); !siter.Done(); siter.Next()) {
     StateId state_id = siter.Value() ;
     ...
}


is the first state_id returned from the Iterator always the Start state of the Fst?

Thanks,

Ken

CyrilAllauzen - 02 Sep 2008 - 11:39

The first state returned by the iterator is not guaranteed to be the start state. You should always use the fst.Start() method to determinne the start state,

Best, Cyril

Log In

Failure transitions

ChrisHarris - 21 Jul 2008 - 14:56

Hey guys -

Great effort to build this library - there are so many uses - it's very nice to have such knowledgeable people building this for everyone. Thank you!

I'm interested in doing some language modeling with this library which I anticipate will require backoff weights that are commonly going to build paths which will be "better" than the proper path for certain n-grams.

I know you guys implemented failure transitions in the GRM library for LM purposes, and later an algorithm to replace them with epsilon transitions along with some compression heuristics to reduce the size of the final machines to address this exact problem.

My question is: What's the best route to take to implement this for the OpenFST project? If I take this on myself... should I create a new FST class to implement failure transitions? Or can I get away with just building a new Arc class which can trick the search functions for composition?

Thanks again for your help!

Chris

CyrilAllauzen - 21 Jul 2008 - 16:56

Hi Chris,

There is an undocumented support for failure transitions in composition. This feature is not documented and the handling of failure transitions will change significantly in the next version of the library.

For now, you can label failure transitions with kPhiLabel (you will need to use the C++ library for this, Fsts containing transitions labeled by kPhiLabel cannot be written to a file).

To perform composition with failure transitions, you need to use

ComposeFst<Arc>(fst1, fst2, ComposeFstOptions<COMPOSE_FST1_PHI>())
when fst1 contains transitions labeled by kPhiLabel or
ComposeFst<Arc>(fst1, fst2, ComposeFstOptions<COMPOSE_FST2_PHI>())
when fst2 contains transitions labeled by kPhiLabel.

Best, Cyril

ChrisHarris - 21 Jul 2008 - 20:03

Nice... thanks for the head's up... I'm glad I asked!

Out of curiosity, what's the new implementaiton going to be? Something along the lines of a custom Arc/State/Fsm class?

Chris

ChrisHarris - 23 Jul 2008 - 19:39

Cyril -

I noticed that during composition with failure transitions, the composition does not chase down the transitive closure of both epsilon and failure transitions.

You can reproduce the behavior by having a state with a failure transition lead to a state with an epsilon transition - which in turn leads to a state with a real symbol arc on it. For example an FSA with the following makeup: 0 1 -2 -2 1 2 0 0 2 3 1 1 3

When you do composition with an FST containing only that real symbol (symbol 1 in the above case) - I would expect it to yield a connected machine (failure -> epsilon -> symbol) - which it does not. Instead it looks through the state with the failure transition and sees no matching symbol (only an epsilon transition) and aborts composition.

I tried to fix the problem, but couldn't quite figure out a good way to do so. The logic to chase down the transitive closure of failure transitions is making assumptions on the structure of the FST (deterministic w.r.t. failure transitions, or at least that they're all equally good, because it takes the first one it finds which yields a match) which probably means any fix I make will be short lived anyway.

If you'd like more information, please let me know, otherwise I'll just try another workaround - but I don't think fixing it in the current implementation is worthwhile... do you?

Chris

CyrilAllauzen - 31 Jul 2008 - 18:42

Hi Chris,

Indeed, I should have mentioned this: failure and epsilon transitions do not work together. You should also have at most 1 failure transition at a given state.

Cyril

Log In

Interactive manual testing of an FST

KennethRBeesley - 07 Jul 2008 - 11:53

In the context of an interactive GUI, it might sometimes be useful to build an FST and then test it interactively by typing in individual strings to be generated (or analyzed), getting back the output string(s). If the FST to be tested is MainFST, then one obvious way to do this would be the following:

1. Get the input string typed by the user.
2. Build the input string into a one-path acceptor (call it InputACC)
3. StdVectorFst ResultFST ;
Compose(InputACC, MainFST, &ResultFST) ; // compose for generation
Project(&ResultFST, PROJECT_OUTPUT) ; // take the output projection
if (number of paths in ResultFST not infinite or huge) {
// print strings encoded by ResultFST
}

This testing would be put in a loop, allowing the user to manually enter as many test input strings as desired before terminating the loop. To implement analysis, one could 1) invert the MainFST before performing the steps above, or 2) do the following:

Compose(MainFST, InputACC, &ResultFST) ; // compose for analysis
Project(&ResultFST, PROJECT_INPUT) ; // take the input projection

QUESTION: Is there a better, more efficient, way to implement such manual testing? Perhaps using the lazy operations ComposeFst() and ProjectFst() ?

Thanks,

Ken

CyrilAllauzen - 09 Jul 2008 - 18:11

This approach looks good to me. Using lazy operations shouldn't improve the efficiency as is.

You could benefit from lazy operations when the number of strings in ResultFst is large and you don't want to display all these strings (or even don't want to compute all of ResultFst). Instead, you could rewrite your code to generate all paths to work for a non-expanded FST and to stop once a fixed number of strings have been found (you could even ask to the user if he wants to see more strings then).

Best, Cyril

Log In

Easy way to query the number of paths in an FST?

If you have an FST, is there an easy way to query the number of paths?

Thanks,
Ken
KennethRBeesley 13 Jun 2008 - 14:10
The natural way is to use the shortest-distance algorithm:
1. Use Map to turn your FST into an FST over the log semiring setting the weight of every transition and the final weight of final states to LogWeight::One().
2. Call ShortestDistance with reverse == true. Let d be the value return in the distance vector for the initial state. d is then the -log of the number of successful paths in the FST.

Best,
Cyril
CyrilAllauzen 16 Jun 2008 - 12:56
Thanks, Cyril. I suspect that my terminology was wrong and/or my C++ skills are still shaky. What I really need is the number of complete paths in an FST, from the start state to a final state. Following these instructions (as best I can), I seem to get the number of paths from the start state to any state (final or not). I hesitate to post my code here, but I'd be more than happy to send it to you or anyone else.

Ken
KennethRBeesley 17 Jun 2008 - 16:48
The key is to use the reverse option: ShortestDistance(fst, &distance, true). For any state q, we then have that distance[q] is the sum of the weights of all the paths from q to a final states. Hence, distance[fst.Start()] is what you want (but you should always check before that distance.size() > fst.Start()).

Cyril
CyrilAllauzen 18 Jun 2008 - 11:37

Cyril,

I am using the reverse option = true. Perhaps I'm not properly setting the exit weight of the final states to LogWeight::One(). Will the following Map do it?

// Mapper from StdArc to LogArc.
// orig. from lib/map.h
// struct StdToLogMapper {
struct ModifiedStdToLogMapper {
typedef StdArc FromArc;
typedef LogArc ToArc;

// orig
//LogArc operator()(const StdArc &arc) const {
// return LogArc(arc.ilabel, arc.olabel, arc.weight.Value(), arc.nextstate);
//}

// KRB, change weight of all arcs to LogWeight::One()
LogArc operator()(const StdArc &arc) const {
return LogArc(arc.ilabel, arc.olabel, LogWeight::One(), arc.nextstate);
}

// KRB: also need to change the exit weight of all final states to
// LogWeight::One()

// KRB, will this do it?
MapFinalAction FinalAction() const { return MAP_REQUIRE_SUPERFINAL; }

// MAP_NO_SUPERFINAL, the final weight is mapped to a final weight
//
// MAP_ALLOW_SUPERFINAL, final weight mapped to an arc to the
// superfinal state, if the result cannot be represented as a final
// weight (see below?); superfinal state created only if needed)
//
// MAP_REQUIRE_SUPERFINAL, final weight mapped to an arc to the
// superfinal state, unless the weight can be represented as a final
// weight of weight Zero(). superfinal state always created. (N.B. the
// weight of the superfinal state is set to Weight::One())

uint64 Properties(uint64 props) const { return props; }
};



Thanks,

Ken
KennethRBeesley 25 Jun 2008 - 13:17
Following instructions from Cyril, here's a pathCount() function, and a trivial main() for testing, that work for me. If it gets garbled in the posting, I'd be glad to forward it by email to anyone interested.

Ken


#include "fst/lib/fstlib.h"
#include <iostream>
// iostream gives access to cout

using namespace fst ;

// From map.h
// Mapper to map all non-Zero() weights to One().
//template <class A, class B = A>
//struct RmWeightMapper {
//  typedef A FromArc;
//  typedef B ToArc;
//  typedef typename FromArc::Weight FromWeight;
//  typedef typename ToArc::Weight ToWeight;
//  B operator()(const A &arc) const {
//    ToWeight w = arc.weight != FromWeight::Zero() ?
//                   ToWeight::One() : ToWeight::Zero();
//    return B(arc.ilabel, arc.olabel, w, arc.nextstate);
//  }

//  MapFinalAction FinalAction() const { return MAP_NO_SUPERFINAL; }

//  uint64 Properties(uint64 props) const {
//    return props & kWeightInvariantProperties | kUnweighted;
//  }
//}

typedef VectorFst<LogArc> LogVectorFst ;
typedef Fst<LogArc>       LogFst ;
typedef LogArc::StateId StateId ;

// pathCount  return the number of paths in an FST
// (this version assumes that the input is a StdVectorFst)

int pathCount(const StdVectorFst &sfst) {
        // if the fst has loops, then the language or relation 
   // encoded by the FST is infinite, just return -1
   if (sfst.Properties(kCyclic, true)) {
      return -1 ;
   }
   // copy the FST input to a new unweighted network, 
   // over the Log Semiring
   LogVectorFst lfst ;

   // for each original StdArc, if the weight is _not_
   // TropicalWeight::Zero() then change it to LogWeight::One(),
   // else (when the weight is TropicalWeight::Zero()) change it to
   // LogWeight::Zero().  This is handily done with
   // RmWeightMapper(), which is in the library.

   Map(sfst, &lfst, RmWeightMapper<StdArc, LogArc>()) ;

   vector<LogArc::Weight> distance ;
   ShortestDistance(lfst, &distance, true) ;

   // from the distance vector, get the weight for the start state
   LogArc::Weight w = distance[lfst.Start()] ;

   // w.Value() is the -log of the number of paths
   // cout << "Weight from pathCount(): " << w << endl ;
   // so make w positive and get the exp()
   int paths = exp((double)(-1.0 * w.Value())) ;
   // cout << "Paths from pathCount(): " << paths << endl ; 
   return paths ;
}

// a trivial main() that tests pathCount()
//
int main() {
  // Create a little FST over the Tropical Semiring

  StdVectorFst fst ;
  fst.AddState() ;  // will be state 0
  fst.SetStart(0) ;

  fst.AddState() ;  // will be state 1
  fst.AddState() ;  // will be state 2

  // add some arcs
  fst.AddArc(0, StdArc(1, 1, 0.693, 1)) ; // arc from 0 to 1
  fst.AddArc(0, StdArc(2, 2, 0.693, 1)) ; // arc from 0 to 1
  fst.AddArc(0, StdArc(7, 7, 0.0, 1)) ;

  fst.AddArc(1, StdArc(3, 3, 0.693, 2)) ; // arc from 1 to 2
  fst.AddArc(1, StdArc(4, 4, 0.693, 2)) ; // arc from 1 to 2
  fst.AddArc(1, StdArc(5, 5, 0.693, 2)) ;
  fst.AddArc(0, StdArc(8, 8, 0.0, 2)) ;
  // Uncomment the following line to create a cycle/loop,
  // which makes the language or relation infinite.
  //fst.AddArc(1, StdArc(5, 5, 0.0, 1)) ;

  fst.SetFinal(2, 0.693) ;

  int count = pathCount(fst) ;
  cout << "Path Count from main(): " << count << endl << endl ;
  return 0 ;
}
KennethRBeesley 03 Jul 2008 - 16:46
Hi Ken,

I fixed your post.

Cyril
CyrilAllauzen 09 Jul 2008 - 11:41
Cyril,
Thanks again for your help in implementing the path-counting algorithm above.
Could you or anyone help me understand how the code above returns the number of paths? Why does the weight of the start state, in the Log semiring, correspond to the number of paths.? In larger FSTs, some accuracy seems to be lost. In one of our examples, an FST reported (by the algorithm above) to have 1,048,580 paths, has, in fact 4 fewer paths. Not a big error, but it surprised one of our testers who expected to get an exact path count.
Thanks,

Ken
KennethRBeesley 24 Jun 2009 - 19:55
It uses the shortest-distance algorithm describe in FST.ShortestDistanceDoc. In the real semiring, when the weights of all the arcs are 1, the shortest-distance from the initial state to a state q is the number of paths from the initial state to q. However, since the libary does not implement the real semiring, we have to do the computation in the log semiring. In this very special case where the weights are integer and we always multiply by 1, going to the log actually leads to more numerical errors.

There are two things you can try if you want to get better precision:

1. Use a smaller delta. Shortest-distance (like many algorithms in the library), use a delta parameter to determine whether two weights are "equal". By default, this delta is kDelta. This is to make the algorithm converge faster when they are cycles. Here, since the machine is acyclic, you could get away with a much much smaller delta.

2. Use doubles instead of floats. If reducing significantly delta does not help, the imprecisions might be due to floating point errors. You could try using doubles instead of floats by using ArcTpl<LogWeightTpl > instead of LogArc.

Finally, if this is very important to you, the best thing would probably be to implement the (+,x)-semiring over unsigned 64-bit integers.

Cyril
CyrilAllauzen 25 Jun 2009 - 17:02

About Factoring Algorithm of WFST

Hi,

In this paper

Mehryar Mohri, Fernando C. N. Pereira, and Michael Riley.
Speech Recognition with Weighted Finite-State Transducers.
In Larry Rabiner and Fred Juang, editors, Handbook on Speech Processing and Speech
Communication, Part E: Speech recognition. Springer-Verlag, Heidelberg, Germany, 2008.

The factoring algorithm is mentioned but no enough details.

Can you tell me where to find the code or detailed descrption of the factoring algorithm?

Thanks in advance!


Zhenyu Pan
ZyPan 30 May 2008 - 13:08

Here is a patch to use CMake for builds (tested on Linux and Mac OS X)

Performs build, tests, and installation. Could generate for Cygwin and even Visual Studio; might work with the former, but code is of course UNIX-specific, so won't build with the latter.

Can't upload here, so download from: http://lemurz.org/io/wp-content/uploads/2008/05/OpenFst-beta-20080422-cmakep1.patch.gz
TomMurray 12 May 2008 - 17:42
Small update to install header files with 'make install': http://lemurz.org/io/wp-content/uploads/2008/05/OpenFst-beta-20080422-cmakep2.patch.gz TomMurray 12 May 2008 - 19:03

SymbolTable::size()

It's minor, but it would be nice to have a SymbolTable::size() function that returns the number of keys. MarkusD 23 Apr 2008 - 11:06

Efficient Way of Taking the Union of Several FSTs

Hi,

I'm trying to take the union of several thousands of FSTs. After a while the union becomes quite large and the union operation appears to take more and more time to complete. I tried removing epsilon transitions and determinizing after each union. This helps to reduce the time consumed by the union operation, however this time the overall operation (union+rmepsilon+encode+determinize+decode) takes much more time compared to the former case. I wonder if there is an efficient way of taking the union of several FSTs.

Thanks in advance,
Dogan
DoganCan 22 Apr 2008 - 12:57
Hi,

When dealing with the union of a very large number N of FSTs, a good approach is two proceed in log N steps. The first step consists of computing the union of the (2i)-th and (2i+1)-th FSTs for each 0 < i < N/2, leading to a set of N/2 transducer for the next step. The final transducer is obtained as the union of 2 intermediary transducers at the log N step.

Best,
Cyril
CyrilAllauzen 22 Apr 2008 - 17:58

possible bug in fstdeterminize in case of log semiring

Hi,

I noticed that the outputs of fstdeterminize and fsmdeterminize from AT&T FSM toolkit do not match in the case of log semiring. Here is a simple fst in log semiring with the mentioned problem and the results of two executables


cat test.txt
4 0 74 0 0.00274658203
4 1 74 0 5.89324951
4 3 20364 0 0.0282592773
4 2 20364 0 3.57971692
4 5 0 1 -1.0986321
0 5 0 1 -4.69826591e-06
0 2 20364 0 3.68112183
0 3 20364 0 0.0255126953
1 5 0 1
1 2 20364 0
2 5 0 1
3 5 0 1
5

fstcompile --arc_type=log test.txt | fstdeterminize | fstprint
0 1 0 1 -1.0986321
0 2 74 0 -1.51862259e-05
0 3 20364 0 -1.98714133e-05
1
2 1 0 1 0.000163570745
2 3 20364 0 0.000163570745
3 1 0 1 4.95321437e-05

fsmcompile -t -s log test.txt | fsmdeterminize | fsmprint
0 1 0 1 -1.0986321
0 2 74 0 -1.51862259e-05
0 3 20364 0 -1.98714133e-05
1
2 1 0 1 -4.6852183e-06
2 3 20364 0 -4.68526787e-06
3 1 0 1 -7.15111692e-10


I believe a previous discussion titled "fstdeterminize" is about the same issue where fstdeterminize changes the path weights in the log semiring. I tried the suggestion

"The issue might be due to numerical instabilities. One solution would then be to use a smaller delta (in DeterminizeFstOptions)."

with much smaller/larger deltas, but the result of fstdeterminize did not change.

Could you please explain what I'm missing or how to correct if this is actually a bug.

Best,
Dogan
DoganCan 15 Apr 2008 - 10:44
Hi,

The output of fsmdeterminize and fstdeterminize do not match because there are significant differences in the implementation. In that example, both outputs seem valid to me.

However, changing the delta should have had an effect on the result. It appears there was a bug and the delta parameter was ignored when determinizing transducers (delta works as it is supposed to for acceptors).

A new version of the library fixing this issue is available on the download page.

Best,
Cyril
CyrilAllauzen 17 Apr 2008 - 17:28

LM using FST

Hi,
I used to use GRM to build LMs with FSM, is FST compatible with GRM ?
If not, is there an alternative in FST ?

Also, What is the equivalent of farcompilestrings & farprintstrings under FST?

Thanks,
Mohamed Noamany
MohamedNoamany 03 Apr 2008 - 20:50
Hi,
Only the text representation of OpenFst and FSM are compatible. So, if you want to use a LM build by GRM in OpenFst, you need to convert it to the OpenFst format using fsmprint / fstcompile:
 fsmprint lm.fsm | fstcompile > lm.fst 
There are no equivalent of farcompilestrings/farprintstrings in OpenFst.

Best,
Cyril
CyrilAllauzen 04 Apr 2008 - 10:50
  JonathanB 16 Oct 2009 - 11:16
Hi Cyril,
Would the far tool in OpenKernel do the trick? (edit) ..mmhh.. I take it back. I just noticed version 1.2 came out and it does have the far extension. cheers
AMorenoDaniel 13 Aug 2010 - 19:36

compilation gotcha with spaces in directory names

Low priority item for the todo list (which will be likely fixed with autoconf but good to put in the test suite): compiling in a directory with spaces in the name (allowed and quite possible under macos) causes $(PWD) in bin/Makefile to break.

Symptom:

localhost:~/cse733/sp08/794L tutorial materials/fst/OpenFst/fst fosler$ make
( cd lib ; make all )
make[1]: Nothing to be done for `all'.
( cd bin ; make all )
make[1]: * No rule to make target `"/Users/fosler/cse733/sp08/794L', needed by `fstarcsort.o'. Stop.
make: * [all] Error 2

Clear and easy workaround is to replace $(PWD) with a hard coded directory with escaped spaces.
EricFoslerLussier 26 Mar 2008 - 21:35

Input and output alphabets

I've been using non-disjoint but different input and output
alphabets, along with --keep_osymbols and --keep_isymbols.

Two commands in particular are confusing here. fstinvert
does not understand that the input and output
symbol tables need to be swapped if the result is going
to be correct, and fstproject does not understand that
the alphabets in the result are going to be either both
derived from the osymbols or both derived from the
isymbols. Presumably the trouble is that these two
commands are outliers compared to the rest of the
library, so the infrastructure which they get from the
library doesn't quite fit.
ChrisBrew 14 Mar 2008 - 16:40
This is actually a bug. It should do the right thing with the symbol tables for project and invert. The bug has been fixed and you can download the fixed version of the library on the download page. CyrilAllauzen 14 Mar 2008 - 19:24
Thankyou. There's an easy workaround: don't use different alphabets, which actually
works for me. But better not to have to.
ChrisBrew 15 Mar 2008 - 10:49
Thankyou. There's an easy workaround: don't use different alphabets, which actually
works for me. But better not to have to.
ChrisBrew 15 Mar 2008 - 10:49
Actually, a new bug was introduced in fstinvert in the version released on friday. A new version is now available on the download page fixing this bug. CyrilAllauzen 17 Mar 2008 - 11:51

Interpreting complement character ranges

I'm implementing a programming language based on regular expressions, which are interpreted to produce finite-state networks. The current interpreter, which is just starting to work, calls functions in the OpenFst library, though I've kept the design modular enough to use other interpreters based on other libraries.

Interpreting character ranges: If I parse and interpret the following statement:

$foo = b [aeiou] t ;

a character range like [aeiou] or [a-z] can be handled straightforwardly as a union, equivalent to (a
e i o u) and (a b c d ... z), respectively. No problem. But how, in OpenFst, can I interpret complemented character ranges such as

$bar = a [^aeiou] e ;

???

If . is used in the syntax to represent "any possible character" (denoting the set of all single-character strings), then [^aeiou] is equivalent to (. - (a
e i o u)), but that just raises the question, How would one interpret the following:

$fum = a . e ;

denoting the set of all three-character strings where the first character is 'a', the second character is "any possible character" and the third is 'e' ?

Thanks,

Ken
KennethRBeesley 13 Mar 2008 - 13:12
Wow. That last message got garbled wherever I used a vertical-bar to indicate union. The first garbled example, expanding [aeiou], was (a verticalbar e verticalbar i verticalbar o verticalbar u ).
The second, equivalent to [a-z] was
(a verticalbar b verticalbar c verticalbar d verticalbar ... z)

The third, equivalent to [^aeiou] was

(. - ( a verticalbar e verticalbar i verticalbar o verticalbar u))

Ken
KennethRBeesley 13 Mar 2008 - 14:02

Mindtuning: value of separate input and output symbol tables?

OpenFst allows the use of separate input and output symbol tables for mapping symbol names (strings) to the integers used as labels on an Fst. So, for examples, the label 2 on the input side could be mapped to 'a', and the label 2 on the output side could be mapped to 'b'.

1. What's the perceived value of allowing separate input and output mappings?
2. In practice, is it most common to use the same symbol table for both sides?
Thanks,

Ken
KennethRBeesley 04 Mar 2008 - 11:53
Actually, it is very common to have different symbol tables for each side. For instance, if you a lexicon that maps prononciation of words, i.e. sequence of phonemes, to words. The input symbol table would contains "phonemes" and the output table words. This is a real example from speech recognition (cf. this paper) CyrilAllauzen 04 Mar 2008 - 11:59
Yes, that's a good example. Thanks. In my own work, at least, I would tend to want individual alphabetic symbols like 'a', 'b' and 'c' to be stored (by default) on arcs with their standard Unicode code point values, including IPA symbols in a speech application; and any symbols with multicharacter print names (including whole words, as in the speech-recognition example) would be stored with code point values from a Unicode Private Use Area. Unicode currently offers 137,468 Private Use code points--is that enough for all the words and other multicharacter symbols (e.g. ngrams of phoneme characters) used in a typical speech application? KennethRBeesley 05 Mar 2008 - 15:43
Unfortunately, this solution would not be satisfying for speech recognition. Some very large vocabulary speech recognition systems use vocabularies consisting of several million words (i.e. multicharacter symbols).

Allowing users to define their own input and output symbol tables is the most general approach. For your work, the approach you describe may be more than adequate, and the library indeed allows you to have the same symbol table used as both input and output table for all your FSTs.

Cyril
CyrilAllauzen 06 Mar 2008 - 10:55

"Normalizing" arc/final weights on an FST?

I am using FSTs to represent probabilistic finite-state automata and
there are various situations in which I need to "normalize" them--that
is, scale the arc weights such that the probability of arcs leading
out of a state (plus the probability of state finality) sums to 1 for
each state. I am having trouble, however, figuring out how to modify
the weights on arcs of an FST. I have tried the following code, and
while this modifies the final weight appropriately, it doesn't affect the arc weights at all:

void makeProper(VectorFst<StdArc> &fst) {
    using fst::TropicalWeight;
    for(StateIterator<VectorFst<StdArc> > siter(fst); !siter.Done(); siter.Next()) {
      int s = siter.Value();
      float tot = pow(2, -1 * fst.Final(s).Value());
      for(ArcIterator<VectorFst<StdArc> > aiter(fst,s); ! aiter.Done(); aiter.Next()) {
        StdArc arc = aiter.Value();
        float contribution = pow(2, -1 * arc.weight.Value());
        tot += contribution;
      }
      float adjustment = -1 * log(tot) / log(2);
      float newFinal = fst.Final(s).Value() - adjustment;
      fst.SetFinal(s,TropicalWeight(newFinal));
      for(ArcIterator<VectorFst<StdArc> > aiter(fst,s); ! aiter.Done(); aiter.Next()) {
        StdArc arc = aiter.Value();
        float newWeight = arc.weight.Value() - adjustment;
        arc.weight = TropicalWeight(0.0);
      }
    }
  }


How can I modify the arc weights? Or is there some other
functionality I haven't noticed yet that performs this normalization
on its own?

Many thanks once again,

Roger
RogerLevy 02 Mar 2008 - 23:34
Ahh...my naive attempt to use
...
was not entirely successful. My full query with properly-indented code can be found here:

http://idiom.ucsd.edu/~rlevy/normalize_fst_weights

Roger
RogerLevy 02 Mar 2008 - 23:37
Hi Roger,

1. In order to modify an arc, you need to use a MutableArcIterator instead of an ArcIterator. Then do aiter.SetValue(arc) at the end of your last for loop. The constructor of aiter should then also take &fst as an argument instead of fst.

2. There is an operation to normalize the weights in the way you describe. It is call pushing: Push(&fst, REWEIGHT_TO_INITIAL)

However, pushing is defined using the semiring operation, hence in your case you'll need to push in the log semiring.
VectorFst<StdArc> fst;
VectorFst<LogArc> log_fst;
Map(fst, &log_fst, StdToLogMapper());
Push(&log_fst, REWEIGHT_TO_INITIAL);
Map(log_fst, &fst, LogToStdMapper());


The total weight will still be present at the initial state. You can remove it by doing after calling Push:
vector<LogWeight> distance;
ShortestDistance(log_fst, &distance, true);
LogWeight total = distance[log_fst->Start()];
MutableArcIterator< VectorFst<LogArc> > aiter(&log_fst, log_fst->Start());
for (; !aiter.Done(); aiter.Next()) {
 LogArc arc = aiter.Value();
 arc.weight = Divide(arc.weight, total);
 aiter.SetValue();
}


P.S. I didn't compile the code so there might be some typos.
P.P.S. I fixed the formating of your original post.
CyrilAllauzen 04 Mar 2008 - 11:25
Many thanks once more, Cyril. (The last call needed to be aiter.SetValue(arc) but that was easy enough to figure out.) The approach I was taking had an error in it, anyway, so Push is much better.

Best

Roger
RogerLevy 04 Mar 2008 - 22:50
One other question: is there an easy way of making LogArc and LogWeight operate in base 2 logs rather than in natural logs?

Roger
RogerLevy 05 Mar 2008 - 17:52
Unfortunately, they are no easy way to do this.

Cyril
CyrilAllauzen 06 Mar 2008 - 10:44
Got it. Also, I notice that RealWeight and RealArc are not yet implemented, is this correct? (they would be convenient for transparent post-hoc inspection of normalized log-semiring FSTs.)

Best

Roger
RogerLevy 08 Mar 2008 - 15:56
No RealWeight yet. The reason was we never used the real semiring for anything else than toy example with the FSM library. But, we might decide to add it to the next release since people seem to request it.

Best,
Cyril
CyrilAllauzen 11 Mar 2008 - 11:59

Natural code for printing all strings accepted by an FST?

I have written some code for printing all strings accepted by an FST together with their weights. (Clearly a bad idea in general, but useful right now for some tests I'm running with very small FSTs.) Since I am new to both C++ and OpenFst I wonder whether there is a more natural way to write this functionality. In particular I wonder:

1) is there a better way of checking whether a state is final?

2) whether there are better ways to implement the accumulation of weights along the path through the FST, perhaps by using a Weight class directly?

void printAllStrings(VectorFst<StdArc> &fst) {
  vector<int> str(0);
  printAllStringsHelper(fst, 0, str, 0.0);
}
void printAllStringsHelper(VectorFst<StdArc> &fst, int state, vector<int> str, float cost) {
  if(fst.Final(state).Value() < 100000)
    cout << vectorToString(str) << " with cost " << (cost + fst.Final(state).Value()) << endl;
  for(ArcIterator<VectorFst<StdArc> > aiter(fst,state); ! aiter.Done(); aiter.Next()) {
    StdArc arc = aiter.Value();
    str.push_back(arc.ilabel);
    printAllStringsHelper(fst, arc.nextstate, str, cost + arc.weight.Value());
    str.pop_back();
  }
}
string itos(int i) {   // convert int to string; from Bjarne Stroustrup's FAQ
  stringstream s;
  s << i;
  return s.str();
}
string vectorToString(vector<int> v) {
  if(v.size() == 0)
    return "<>";
  string result = "<" + itos(v[0]);
  for(int i = 1; i < v.size(); i++) {
    result +=  "," + itos(v[i]);
  }
  return result + ">";
}

Many thanks in advance for any pointers!

Roger
RogerLevy 01 Mar 2008 - 21:04
hmm...TWiki has done some damage here to my double-equals equality test, to my plus-equals assignments, and also has erased some templates...hopefully the code is still reasonably readable! I've posted it at

http://idiom.ucsd.edu/~rlevy/print_all_strings

Roger
RogerLevy 01 Mar 2008 - 21:09
1. Yes! There is a much better to test if a state is final.
if (fst.Final(state) == Weight::Zero()) // if state is not final 

Where you can replace Weight by the appropriate weight class (even better do a typedef of the current weight to Weight).

2. Indeed, you should try to use the Weight class operations. You should almost never need to use weight.Value(). Even cout << weight and cin >> weight should work since the operators << and >> are defined.

Remember, the operation to combine weight along a path is Times.
So in your code you should have
TropicalWeight cost
and
  Times(cost, arc.weight) 
P.S. I fixed the formatting of your original post
CyrilAllauzen 04 Mar 2008 - 12:12
Many thanks, Cyril, this is great!

Roger
RogerLevy 04 Mar 2008 - 22:47
Roger,
1. Do you have a final version of your code to print all paths with their weights?, and
2. Would you consider sharing it?
Thanks,

Ken
KennethRBeesley 11 Jun 2008 - 11:52
Hi Ken,

It looks like the current version of my code is pretty similar to the earlier version:

// a bad idea if there are loops :)
void printAllStrings(VectorFst &fst) {
vector str(0);
TropicalWeight tw(TropicalWeight::One());
printAllStringsHelper(fst, 0, str, tw);
}

void printAllStringsHelper(VectorFst &fst, int state, vector str, TropicalWeight cost) {
if(fst.Final(state) = TropicalWeight::Zero())
cout << vectorToString(str) << " with cost " << (Times(cost,fst.Final(state))) << endl;
for(ArcIterator<VectorFst > aiter(fst,state); ! aiter.Done(); aiter.Next()) {
StdArc arc = aiter.Value();
str.push_back(arc.ilabel);
printAllStringsHelper(fst, arc.nextstate, str, Times(cost, arc.weight.Value()));
str.pop_back();
}
}

string vectorToString(vector v) {
if(v.size()
0)
return "<>";
string result = "<" + itos(v[0]);
for(int i = 1; i < v.size(); i++) {
result +
"," + itos(v[i]);
}
return result + ">";
}

string itos(int i) { // convert int to string; from Bjarne Stroustrup's FAQ
stringstream s;
s << i;
return s.str();
}

I have a similar but separate function pair for LogArc FSTs. It would be nice to have a single function pair that works for any arc type, and except for the types the code would be the same, but I don't see how to do it because there isn't a common superclass for LogArc and StdArc. If someone else sees how to do this, let me know!

And please feel free to use this code :)

Roger
RogerLevy 11 Jun 2008 - 20:17
Ah, I also forgot -- I believe that arc.weight.Value() can be replaced by arc.weight, as per Cyril's suggestion earlier. RogerLevy 11 Jun 2008 - 20:18
Roger,
Many thanks for posting this code. One comment at this point: the initial call to printAllStringsHelper(fst, 0, str, tw) assumes that the start state is state 0. After combining and optimizing fsts, this may not be the case. Better to use
printAllStringsHelper(fst, fst.Start(), str, tw).

Ken
KennethRBeesley 13 Jun 2008 - 11:52
Roger,
Another possibility: generalization to print Input vs. Output strings. Your code currently uses arc.ilabel. You could do

enum Projection { PROJECT_INPUT, PROJECT_OUTPUT } ;

void printAllStrings(VectorFst &fst, Projection proj) {
...
printAllStringsHelper(fst, fst.Start(), str, tw, proj) ;
}

....

if (proj == PROJECT_INPUT) {
str.push_back(arc.ilabel) ;
} else {
str.push_back(arc.olabel) ;
}

Ken
KennethRBeesley 13 Jun 2008 - 12:15
Roger,

In testing for a final state, I think that your code is incorrect (or garbled in the posting). You have if(fst.Final(state) = TropicalWeight::Zero())
but (as I understand it) a state is final if the exit weight is not equal to Weight::Zero(), so if (fst.Final(state) = TropicalWeight::Zero()) with a not-equal operator, is what you want.
Straighten me out as necessary.

Ken
KennethRBeesley 13 Jun 2008 - 12:37
Roger,
I see that my "correction" got corrupted the same way when posted. The test should be
if (fst.Final(state) notEqualTo TropicalWeight::Zero()) where notEqualTo is the usual exclamation mark followed by an equal-sign.
Ken
KennethRBeesley 13 Jun 2008 - 14:04

Why is TropicalWeight considered standard?

In another word, why does StdArc use TropicalWeight instead of FloatWeight or LogWeight. Is there some material that explains the significance of TropicalWeight?

Also, what's the reason that there are Plus(), Times() and Divide() implemented for FloatWeight()? Because it is too trivial? smile
JiaPu 29 Feb 2008 - 18:57
sorry, type. It should read:

".... there are NOT Plus() ..."
JiaPu 29 Feb 2008 - 18:58
sorry, typo. It should read:

".... there are NOT Plus() ..."
JiaPu 29 Feb 2008 - 18:58
StdArc use TropicalWeight and LogArc use LogWeight. StdArc could have been call TropicalArc, but since it is the most commonly used [it correspond to the Viterbi approximation] a shorter name like StdArc was prefered. If you don't like it, you can still do:
 typedef StdArc TropicalArc;

FloatWeight is just an abstract class from which LogWeight and TropicalWeight are derived. It does not correspond to a semiring, that's why it does not have Plus(), Times(), ...
CyrilAllauzen 04 Mar 2008 - 11:43
For the significance of the tropical semiring, you can check out the papers in FstBackground or this paper or that one. CyrilAllauzen 04 Mar 2008 - 11:50
Cyril,

Please give me an algebra 101. According to the definition of semiring on wiki, float (or rational number) is a semiring. And when authorizing finite state grammar for speech recognition, it is more natural to use float as weight. Although, internally log weight is preferred.
JiaPu 04 Mar 2008 - 15:23
Float in FloatWeight only means than the underlying representation is a C/C++ float type. TropicalWeight and LogWeight both use C/C++ float to represent real numbers, hence they are both derived from the partially abstract type FloatWeight. Of course, the real number with as operation the usual plus and times is also a semiring. It is not implemented in the library for now because it is not very commonly used. We might add it in a future version. If we do, it will also be as a type derived from FloatWeight.

Float just defines the set, it is not enough to define a semiring. Different times and plus operations would lead to different semiring sharing the same set of elements.

Finally, you didn't have to all the way to wikipedia, there is some documentation on here on this website.
CyrilAllauzen 04 Mar 2008 - 19:10

Compiler error on Determinize().

When compiling following code:

StdVectorFst in_fst;
... add some states and arcs to in_fst ...
StdVectorFst out_fst
Determinize(fst, out_fst);

I got compiler error:
/Volumes/Data/Projects/OpenFst/macosx/Corpus2SphinxFsg.cpp:82: error: no matching function for call to 'Determinize(fst::StdVectorFst&, fst::StdVectorFst&)'

Since StdVectorFst is derived from Fst, I don't understand that why this doesn't compile.
JiaPu 28 Feb 2008 - 12:48
Aha, never mind. Just realize that Determinize() expects a pointer instead of a reference on the second argument. But why can't we use reference on both arguments? JiaPu 28 Feb 2008 - 13:18
This is one of our coding conventions: input arguments are declared as const references and output arguments are declared as pointers. CyrilAllauzen 29 Feb 2008 - 14:26

New Release of OpenFst?

Are there any plans for a new release of OpenFst?
If so, I have some requests ...

Ken
KennethRBeesley 06 Feb 2008 - 17:20
There will be a new release of OpenFst. You're welcome to make some suggestions. CyrilAllauzen 08 Feb 2008 - 15:20
Cyril,

Very glad to hear about the new release.

I respectfully request the following:

void Copy(&In, *Out) // create a deep copy of the input automaton

Convenience boolean functions for testing an existing network:

isLabelSorted(&Fst, side)
isFunctional
isAcceptor
isWeighted
isEpsilonFree
isDeterministic

Finally, a new text input-output format that is Unicode-capable. I'll try to
write up some more specific suggestions for an XML-based notation.

Thanks for all your good work,
Ken
KennethRBeesley 13 Feb 2008 - 23:53
Hi Ken,

A deep copy can already be performed using Map() with the IdentityMapper, however this might not be obvious at first so a Copy() could indeed be something useful.
StdVectorFst in;
StdVectorFst out;
Map(in, &out, IdentityMapper<StdArc>());
For the boolean functions, I'm not so sure that that:
if(isAcceptor(fst)) ...
is much simpler than:
if(fst.Properties(kAcceptor, true)) ...

Best,
Cyril
CyrilAllauzen 29 Feb 2008 - 12:48
For functional, the issue is that the best algorithms to test for functionality are quadratic in the number of transitions. This prevented us from using it as a property bits, that's why there are no kFunctional. CyrilAllauzen 29 Feb 2008 - 12:53
Thanks Cyril,

Your fst.Properties(kAcceptor, true) example was what I needed to get oriented. No boolean function isAcceptor(fst) is needed. Other beginners like me might want to look in fst/lib/properties.h for more property names, including:

kAcceptor
kNotAcceptor

kWeighted
kNotWeighted

kEpsilons
kNoEpsilons

kIDeterministic (deterministic on the input side)
kNonIDeterministic

kODeterministic (deterministic on the output side)
kNonODeterministic

kILabelSorted (sorted with regard to input labels)
kNotILabelSorted

kOLabelSorted (sorted with regard to output labels)
kNotOLabelSorted
KennethRBeesley 29 Feb 2008 - 17:06

Another newbie question

I'm having trouble compiling another example from the Quick Tour page,
to get myself going with FST composition. here it is:

#include "fst/lib/fstlib.h"

int main()
{

using fst::StdVectorFst;
using fst::StdFst;
using fst::StdOLabelCompare;
using fst::StdILabelCompare;
using fst::ArcSort;

StdFst *input = StdFst::Read("input.fst");

StdFst *model = StdFst::Read("model.fst");

ArcSort(input, StdOLabelCompare());
ArcSort(model, StdILabelCompare());

StdVectorFst result;

Compose(*input, *model, &result);

}

Trying to compile gives me the following error:


rlevy@truffle:~/src/C++$ g++ -I /local/contrib/cpl/OpenFst Example1.cpp /local/contrib/cpl/OpenFst/fst/lib/libfst.so -lpthread
Example1.cpp: In function ‘int main()’:
Example1.cpp:16: error: no matching function for call to ‘ArcSort(fst::StdFst*&, fst::StdOLabelCompare)’
Example1.cpp:17: error: no matching function for call to ‘ArcSort(fst::StdFst*&, fst::StdILabelCompare)’

It's not obvious to me why it's not matching the function properly -- any ideas?

Many thanks!

Roger
RogerLevy 01 Feb 2008 - 01:14
This does not work because the input of ArcSort needs to be a mutable fst (of type MutableFst). Here, input and model are of type StdFst i.e. Fst and not MutableFst.

There are several solutions for this:
1. Restrict the program to only accepts as input VectorFst which is mutable:
StdVectorFst *input = StdVectorFst::Read("input.fst");
StdVectorFst *model = StdVectorFst::Read("model.fst");
Simple but not general.

2. Copy model and input into a VectorFst before sorting:
StdVectorFst input2(*input);
ArcSort(&input2, StdOLabelCompare());

3. Use on-the-fly sorting:
ArcSortFst StdOLabelCompare> input2(*input, StdOLabelCompare());

Best,
Cyril
CyrilAllauzen 29 Feb 2008 - 13:19
I am getting the same error, except I have already declared mutable FSTs. I would really appreciate some help! My code is of the form:

#include "fst/fstlib.h"

using namespace std;
using namespace fst;

int main()
{
StdVectorFst fst1;
StdVectorFst fst2;
StdVectorFst fst3;

ArcSort(fst1, StdOLabelCompare());
fst3 = ComposeFst(fst1, fst2);

return 0;
}

Specifically, the compile error is:

no matching function for call to 'ArcSort(fst::StdVectorFst&, fst::StdOLabelCompare)'


Thank you very much,
Jonathan
JonathanB 16 Oct 2009 - 12:03

Access control:

Edit | Attach | Watch | Print version | History: r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r1 - 2014-04-21 - CyrilAllauzen
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback