Pronunciations Lab -- Using OpenFST

Overview

In this lab we'll be learning about pronunciation modeling with finite state transducers.

Begin by downloading pronun_lab.tgz. Then unzip the tarball for the lab into your working directory:

  tar -xzvf pronun_lab.tgz 

  cd pronun_lab/

Find the OpenFST tools on your machine:

  which fstcompile

or download them. Add the location of the OpenFST tools to your PATH variable:

  export PATH=/home/$username/bin/OpenFST/fst/bin/:$PATH

Finite-State Transducer Dictionaries

Grapheme Dictionary

For this first part of the lab, we are going to pretend that letters are the same as phonemes, and map (English) words to letters to build a fake-phonetic dictionary (a grapheme dictionary).

Consider the toy wordlist in toy.wordlist. Using this wordlist, and pretending that each letter represents a phone, we can create a fake pronunciation dictionary, toy.lex. Take a look at this dictionary, and note that it can be easily generated from our wordlist. (You may also note that our dictionary contains a few special symbols: <s> and </s>, which indicate the start and end of a sentence, and <unk>, which indicates an unknown token. These special symbols are given special "pronunciation" symbols as well.)

From this dictionary, we would like to compile a transducer that maps from words to letters and vice versa. First, we can produce a text format for the transducer toy.fsm.txt, with words on the input side, and letters on the output side. Take a look at toy.fsm.txt, and note that it can also be easily generated from our dictionary file.

Compiling & Printing Transducers

For this part of the lab we'll be using the OpenFST tools. Hopefully you'll be able to follow our examples to complete the lab, but you can also use their "Quick Tour" for help in using these tools. In particular, we will be using fstcompile and fstcompose.

Draw the automata represented by our toy.fsm.txt file. Note that each line in the toy.fsm.txt file represents a transition from state to state. States are identified by number (the first two columns in the file), the transition input labels are in the third column, and the output labels are in the fourth column.

Compile the transducer using fstcompile, using our toy.wordlist and toy.letterlist files, as follows:

  fstcompile --isymbols=toy.wordlist --osymbols=toy.letterlist --keep_isymbols --keep_osymbols toy.fsm.txt toy.trans.fst

The OpenFST toolkit also includes some tools for generating visual representations of the automata and transducers.

  fstdraw --isymbols=toy.wordlist --osymbols=toy.letterlist toy.trans.fst | ./dot -Tps > toy.ps 

  ps2pdf toy.ps

Take a look at toy.pdf.

  acroread toy.pdf &

Does it look right? Is there anything weird about it?

Transducers as Acceptors (Automata)

One thing that can be done with transducers is to project them onto either input labels or output labels, with fstproject. This turns the transducer into an acceptor that preserves only the labels on the specified side. Note that order of the transducers matter, since it is matching input and output labels.

Compose our transducer with an example string:

  echo "John Jacob Smith" | ./str2fst.pl toy.wordlist | \
    fstcompile --isymbols=toy.wordlist --osymbols=toy.wordlist --keep_isymbols --keep_osymbols | \
    fstcompose - toy.trans.fst string1.trans.fst

Now string1.trans.fst is a tiny transducer just for our example string, and we can use fstproject to turn our transducer into a word-acceptor:

  fstproject string1.trans.fst | fstprint --isymbols=toy.wordlist

or a letter-acceptor:

  fstproject --project_output string1.trans.fst | fstprint --isymbols=toy.letterlist

by fstprojecting on either input or output labels.

What happens if you fstcompose toy.trans.fst with "John Jacob Jingleheimer Schmidt"?

  echo "John Jacob Jingleheimer Schmidt" | ./str2fst.pl toy.wordlist | \
    fstcompile --isymbols=toy.wordlist --osymbols=toy.wordlist --keep_isymbols --keep_osymbols | \
    fstcompose - toy.trans.fst | fstproject --project_output - | fstdraw | ./dot -Tps > jjjs.ps

  ps2pdf jjjs.ps

  acroread jjjs.pdf &

You should get an FST that has the following arcs: "J o h n <eps> J a c o b <eps> <unk> <eps> <unk>", because our transducer accepts "John" and "Jacob" as known input tokens but can only accept "Jingleheimer" and "Schmidt" as unknown ("unk") tokens.

Editing Transducers

Now, let's improve our transducer a bit. Add "Jingleheimer" and "Schmidt" to our transducer by editing the toy.fsm.txt file. (You could also add your name, or any other words you'd like!) Don't forget to update the toy.wordlist and toy.letterlist files as well. It'll be faster if you use some scripting to update the files, but you could edit them by hand as well...

Once you've done that, fstcompile your new transducer and use fstdraw to take a look at it.

  fstcompile ... toy.fsm.txt toy2.trans.fst 

  fstdraw | ./dot -Tps > toy2.ps

(Fill in the blanks on your own, using what you've learned about the OpenFST tools!)

Now what happens when you fstcompose toy2.trans.fst with "John Jacob Jingleheimer Schmidt", then call farprintstrings?

  echo "John Jacob Jingleheimer Schmidt" | ./str2fst.pl toy.wordlist | \
    fstcompile ... | fstcompose - toy2.trans.fst | fstproject --project_output - | fstdraw | ./dot -Tps > jjjs2.ps

  ps2pdf jjjs2.ps

  acroread jjjs2.pdf &

(Once again, fill in the blanks on your own...!)

Larger Transducers

We could use a much larger wordlist, wsj.wordlist, (extracted from sections 2-21 of the Penn Wall St. Treebank) to build a larger word to letter transducer for a larger vocabulary:

  fstcompile --isymbols=wsj.wordlist --osymbols=wsj.letterlist wsj.fsm.txt wsj.trans.fst

How many states are in this FST? (Figure this out from the wsj.fsm.txt file, not by trying to fstdraw the transducer!) How many more states is that than our toy FSTs?

Even though the WSJ FST is much larger, it doesn't help us a lot with our unknown-word problem from before:

  echo "John Jacob Jingleheimer Schmidt" | ./str2fst.pl wsj.wordlist | \
    fstcompile --isymbols=wsj.wordlist --osymbols=wsj.wordlist --keep_isymbols --keep_osymbols | \
    fstcompose - wsj.trans.fst | fstproject --project_output - | fstdraw | ./dot -Tps > jjjs3.ps

  ps2pdf jjjs3.ps

  acroread jjjs3.pdf &

Why do we still have an <unk> in our output?

Phonetic Dictionary

Now we will produce a real pronunciation dictionary!

Take a look at the PronLex Dictionary, pronlex_arpabet.txt.gz. It is in essentially the same format as our toy dictionary toy.lex, except there are phones instead of letters. The phones are the space-delimited characters starting in the second column.

Find the PronLex dictionary entry for the word 'fulllength.' Note that there are six pronunciations provided for this word. Can you find other kinds of ambiguity in this dictionary that do not exist in our toy dictionary from Part 1 of the lab?

Create a wordlist and a phonelist from your PronLex dictionary. (Don't forget our special symbols, <s>, etc.). (Yes, this will require some scripting. If you need help, check out my unix scripting tutorial.)

Convert our dictionary into FST text-format:

  zcat pronlex_arpabet.txt.gz | pronlex2fst.pl > pronlex.fsm.txt

Then compile it into a transducer, which we'll call pronlex.trans.fst. (Please don't try to optimize (i.e. invert, push, determinize) pronlex.trans.fst.)

  fstcompile --isymbols=pronlex.wordlist --osymbols=pronlex.phonelist --keep_isymbols --keep_osymbols pronlex.fsm.txt pronlex.trans.fst

Now compile the following string using this new FST:
"july 3rd is the last day of summer school for the 2008 clsp johns hopkins workshop attendees"
(Note that the PronLex dictionary always expects only lowercase letters.)

  echo <your string> | ./str2fst.pl pronlex.wordlist | \
    fstcompile --isymbols=pronlex.wordlist --osymbols=pronlex.wordlist --keep_isymbols --keep_osymbols | \
    fstcompose - pronlex.trans.fst | fstproject --project_output - | fstdraw | ./dot -Tps > pronlex1.ps

  ps2pdf pronlex1.ps

  acroread pronlex1.pdf &

(You'll have to zoom in to see the transducer in the pdf file; you may find it more convenient to use fstprint rather than fstdraw to "look at" the transducer.

Is there more than one resulting phone string? You can tell where the ambiguity is in the string by looking at the places where there is more than one path through the tranducer. You could also try using fstrandgen:

  fstrandgen --help

(Try to figure this one out on your own.)

Could you have predicted, just from your pronlex_arpabet.txt file, that there would be more than one possible phone string? What does that tell us about pronunciation modeling?

Now go back and look at the output from str2fst.pl. Are many words "<unk>"? Which ones? Can you explain for some of them why they wouldn't be in a speech recognizer dictionary?