1.6.3. Speech Recognition Tool

Sphinx 4 is an open source speech recognition toolkit, written in Java.

It can recognize audio-files or direct speech input via a microphone.

It needs some dependencies like speechmodels or a grammar to work. You can use your own dependencies or simply use the given Language Models.

You will need an acoustic model, a dictionary and a language model or a grammar. The usage of grammar excludes the usage of a language model. These parameters are set via the configuration type.

Also you can use a manipulated sphinx configuration to adjust the recognizer.

1.6.3.2. Interfaces

Input/Output:

Attention

The soundfiles you want to recognize need to be a .wav, 16khz, mono file!

  • Sphinx recognizes speech from soundfiles and stores the result in a textfile which you will find in the folder of the recognized file.
  • If you use microphone input, the result is stored at the location of -d (data-path) property.

The speech-recognition-tool can publish the speech recognition results via RSB:

Scope (Informer) Type
/scope/scope SpeechHypotheses
  • The sent result via RSB holds up to five best results.
  • The best result comes with timestamp-tags for the single words and the whole hypothesis.
  • The given timestamps represent the appearance of the recognized word in the source-data.
  • If no source data but microphone input is used, the timestamps represent the appearance of the recognized word since the programm was started plus the systemtime when the tool was startet.
  • A grammar-tree is generated if the result is generated from grammar.

1.6.3.3. Language Model Information

Language of the Language Model is german.

The two speechmodels “Verbmobil” and “Cocolab” can be found at speechmodel-repo, and are installed at <prefix>/etc/speechconfig/. Default model will be the Verbmobil-speechmodel.

1.6.3.4. Examples

The following scenario is the base for all following examples. You have got a folder /root/datafolder/ which looks like the table beneath.

/root/datafolder/
file01.wav file02.wav file03.mp3 file04.wav
newer_folder new_folder crypt_folder file05.wav
log.log picture.png file06.wav trash_folder

You stored soundfiles with speech on it in this folder.

Example 1:

Now you want to use sphinx to recognize them.

To start sphinx, you can use ./sphinxrecognizer -d /root/datafolder/. With the -d parameter you tell it, where your audio-data is.

Sphinx will then look in /root/datafolder/ for .wav data and will try to recognize them after each other. It will ignore everything else that wont have the suffix .wav.

After it is finished, you will find a new file, called parentFileName_dateAndTime in which you will find the recognized text. Result files will be created in the same folder where the recognized data is.

Example 2:

You have also got some soundfiles stored in new_folder, newer_folder and crypt_folder. So you could use ./sphinxrecognizer -d /root/datafolder/ -r true

It will then also look for .wav files in all underlying directories.

Example 3:

Also, you know that trash_folder is not containing any interesting data. So you could enhance your command line to ./sphinxrecognizer -d /root/datafolder/ -r true -e trash_folder.

Example 4:

If you want the sprechhypothese to be published via rsb, simply use ./sphinxrecognizer -d /root/datafolder/ -s /scope and it will publish its results via rsb on /scope.

Example 5:

You want to recognize a microphone input live. So you can start up the tool with ./sphinxrecognizer -d /root/resultFolder/ -i true. You will then see that the tool is waiting for your input. After it recognized something as a whole phrase, it stores the result in a file in your result-path. It will listen to you repeatedly until you close the tool.

1.6.3.5. Things to keep in mind

  • If sphinx gives you no results consider the following:
    • soundfile in wrong format.
    • soundfile too noisy
    • soundfile too long (speech blurred)
  • The usage of a grammar if notable faster than using a language model.

  • Grammar usage excludes language-model usage.

  • Sphinx has big problems recognizing files that contain noise.

  • The soundfiles you want to recognize need to be a .wav, 16khz, mono file!

  • If microphone input doesn’t work correctly, check the system default audio input device.