1.6.2. Speech Recognition

The speech recognition component based on CMU Sphinx-4 is part of the incremental speech processing toolkit InproTK. It produces Dialog Acts or Speech Hypotheses based on keyword-spotting on the ASR results. At present we have one dialog flow configuration for each interaction island (/citec/csra/home/kitchen/assistance , /citec/csra/home/hallway/entrance).

1.6.2.2. Interfaces

Scope (Listener) Type
<location>/<microphon>/audio/in/16bit/16000Hz/LE Sound Chunk

The speech recognition publishes the speech recognition results via RSB:

Scope (Informer) Type
<location>/dialogact Dialog Act
<location>/speechhypotheses Speech Hypotheses

1.6.2.3. Speech recognition visualizations

The speech recognition component provides a lot of visualizations which are shown on start up. The first window shows the current speech hypothesis of the actual speech recognition. The second window visualizes the speech state. It is possible to pause the speech recognition by clicking on the red circle button. The last window shows the prosody monitor.

../../_images/inpro1.png

The current speech hypothesis.

../../_images/inpro2.png

Voice activity detection.

../../_images/inpro3.png

Prosody monitor.

1.6.2.4. Examples

1.6.2.4.1. React to a human greeting (in the hallway)

 // RSB Listener
 Listener<TaskStateType.TaskState> listener = Factory.getInstance().
     createListener("/citec/csra/home/hallway/entrance/dialogact");
 listener.activate();

 // Add a local event handler
 listener.addHandler(new TaskHandler(){
     @Override
     public void internalNotify(Event event) {

         // Only handle dialog act types
         if (event.getData() instanceof DialogActType.DialogAct) {
             DialogActType.DialogAct dialogact = (DialogActType.DialogAct) event.getData();

             // Only react on final results
             if (dialogact.getIU().getEdittype().equals(DialogAct.EditType.COMMIT)) {

                 switch(dialogact.getType()){
                     case GREET:
                         System.out.println("Greeting");
                         break;
                     case GOODBYE:
                         System.out.println("Goodbye");
                         break;
                     default:
                         System.out.println("something else");
                 }
             }
         }
     }
 }

1.6.2.4.2. Configure own speech recognition

  1. Check out the configuration project:
git clone -b minimal https://projects.cit-ec.uni-bielefeld.de/git/lsp-csra.inprotk-conf.git
  1. Change grammar in the config folder:

src/main/resources/de/unibi/agai/inproapp/config/test.gram

  1. Change configurations, e.g., the output scope in the iu-config.xml
 <component name="speechHypInformer" type="de.unibi.agai.inproapp.module.RSBSpeechHypInformer">
     <property name="should_always_publish" value="true"/>
     <property name="speechhypscope" value="/citec/csra/home/hallway/entrance/speechhypotheses"/>
     <property name="dialogactscope" value="/citec/csra/home/hallway/entrance/dialogacts"/>
     <property name="scope_situation" value="/home/addressee/result"/>
     <property name="should_always_publish" value="true"/>
     <property name="grammarLocation" value="${grammarLocation}"/>
     <property name="grammarName" value="${grammarName}"/>
     <propertylist name="hypChangeListeners">
     </propertylist>
 </component>

1.6.2.4.3. Create own JSGF

In this example we will create a “simpleCommand.gram” grammar. The grammar is defined in a file with the .gram extension and consists of two parts, the header and the body. The header itself consists of up to three parts:

self-identification:

  • looks like: #JSGF version char-encoding locale;
  • Example: #JSGF V1.0 UTF-8 de;
  • “#JSGF” is required and “version char-encoding locale” is optional.

grammar-name:

  • looks like: grammar grammarName; or grammar packageName.grammarName;
  • Example: grammar simpleCommand;
  • The grammar-name is required.

imports:

  • looks like: import <fullyQualifiedRuleName>; or import <fullGrammarName.RuleName>;
  • Example: import <com.sun.speech.app.numbers.*>;
  • The grammar header can optionally include import declarations. An import declaration allows one or all of the public rules of another grammar to be referenced locally.
  • In the CSRA all grammars need to be in the same folder for imports!

The complete header would look like this:

#JSGF V1.0 UTF-8 de;
grammar simpleCommand;

The body contains the rules for this grammar. Every rule can be defined once, double entries will overwrite. The order of definition of rules is not significant. We structure the rules by DialogActTypes, so that we have only one public rule which defines these DialogActTypes. Only the first public rule can be used as an entry! All further public rules can just be imported into other grammars.

Step by step we will write some rules for this grammar. The patterns for rule definitions are:

<ruleName> = ruleExpansion ;
public <ruleName> = ruleExpansion ;

The components of the rule definition are an optional public declaration, the name of the rule being defined, an equals sign ‘ = ‘, the expansion of the rule, and a closing semi-colon ‘ ; ‘. The rule expansion defines how the rule may be spoken. It is a logical combination of tokens (text that may be spoken). Lets define a simple rule to say “Hello Flobi”:

#JSGF V1.0 UTF-8 de;
grammar simpleCommand;

public <simpleCommand> = <greet> ;
<greet> = Hello Flobi ;

We want to add more robots, alternatives. A rule can be defined as a set of alternative expansions separated by vertical bar characters ‘ | ‘.

#JSGF V1.0 UTF-8 de;
grammar simpleCommand;

public <simpleCommand> = <greet> ;
<greet> = Hello Flobi | Hello Meka ;

We could also use parentheses and alternatives to make it more elegant.

#JSGF V1.0 UTF-8 de;
grammar simpleCommand;

public <simpleCommand> = <greet> ;
<greet> = Hello (Flobi | Meka) ;

A rule expansion can also refer to another rule. So we could create a rule to contain all the robot-names.

#JSGF V1.0 UTF-8 de;
grammar simpleCommand;

public <simpleCommand> = <greet> ;
<greet> = Hello <robots> ;
<robots> = Flobi | Meka ;

Now we can either say “Hello Flobi” or “Hello Meka”. But we can not simply say “Hello”. So we can use optional grouping. Square brackets may be placed around any rule definition to indicate that the contents are optional.

#JSGF V1.0 UTF-8 de;
grammar simpleCommand;

public <simpleCommand> = <greet> ;
<greet> = Hello [<robots>] ;
<robots> = Flobi | Meka;

Lets add more rules beside the greeting:

#JSGF V1.0 UTF-8 de;
grammar simpleCommand;

public <simpleCommand> = <greet> | <info_request> ;
<greet> = Hello [<robots>] ;
<robots> = Flobi | Meka;
<info_request> = <name_request>
<name_request> = <robots> (wer bin ich | wie ist mein Name | wie heiße ich);

Or a little more complex one:

#JSGF V1.0 UTF-8 de;
grammar simpleCommand;

public <simpleCommand> = <greet> | <info_request> | <action_request> ;
<greet> = Hello [<robots>] ;
<robots> = Flobi | Meka;
<info_request> = <name_request>
<name_request> = <robots> (wer bin ich | wie ist mein Name | wie heiße ich);
<action_request> = <switch_light>;
<switch_light> = [<robot>] [(mach | schalte | stell)] [<location>] ([das] Licht|[die] Lichter |[die] Lampen) [<location>] <status> [<location>];
<location> = (überall | hier | alle | in der Küche | im Bad | im Wohnzimmer);
<status> = (an | aus | heller |dunkler);

Bigger expressions should be used with care since they tend to make the recognition more imprecise.

We can also modify the quantity of a expansion by using a kleene-star ‘ * ‘ or a plus symbol ‘ + ‘. A rule expansion followed by the kleene-star symbol indicates that the expansion may be spoken zero or more times and a rule expansion followed by the plus symbol indicates the expansion may be spoken one or more times.

     #JSGF V1.0 UTF-8 de;
     grammar simpleCommand;

     public <simpleCommand> = <greet> | <info_request> | <action_request> ;
     <greet> = Hello [<robots>] ;
     <robots> = Flobi | Meka;
     <info_request> = <name_request>
     <name_request> = <robots> (wer bin ich | wie ist mein Name | wie heiße ich);
     <action_request> = <switch_light>;
     <switch_light> = [<robot>] (bitte)* [(mach | schalte | stell)] [<location>] ([das] Licht|[die] Lichter |[die] Lampen) [<location>] <status> [<location>];
     <location> = (überall | hier | alle | in der Küche | im Bad | im Wohnzimmer);
     <status> = (an | aus | heller |dunkler);

We added a little politeness to the action-request switch-light, but we can say it as often as we want and also omit it. If we use a plus symbol there, we have to say it at least one time!

The grammar format also supports right-recursion, so you can make a reference in a rule to itself as the last part of its definition. We can add a filler-rule to handle hesitation-noises etc. And we want to allow filler to said multiple times.

#JSGF V1.0 UTF-8 de;
grammar simpleCommand;

public <simpleCommand> = <greet> | <info_request> | <action_request> | <filler> ;
<greet> = Hello [<robots>] ;
<robots> = Flobi | Meka;
<info_request> = <name_request>
<name_request> = <robots> (wer bin ich | wie ist mein Name | wie heiße ich);
<action_request> = <switch_light>;
<switch_light> = [<robot>] (bitte)* [(mach | schalte | stell)] [<location>] ([das] Licht|[die] Lichter |[die] Lampen) [<location>] <status> [<location>];
<location> = (überall | hier | alle | in der Küche | im Bad | im Wohnzimmer);
<status> = (an | aus | heller |dunkler);
<filler> = (ähm | hm  || aha | argh | och | oje | öh) [<filler>]

So we can say it one time and maybe as often as we want to again.

These are the basics for usage, there are more features like tagging and weighting tokens.
See https://www.w3.org/TR/jsgf/ for detailed information about jsgf.
Other tutorials: One and Two

The grammar is read by the speech-recognition-tool and by the jsgf-parser tool. They should not differ in their features, but the usage of certain features is not tested or stable. Here is a list of what is supported:

Feature sphinx jsgf-parser
rulename characters:    
text yes yes
numbers only no no
text and numbers yes yes
special characters _ $ - : , | @ % ! ^ & ~ # _ $ - : , | @ % ! ^ & ~ #
rulenames <NULL> and <VOID> yes yes
quoted Tokens yes no(?)
comments yes yes
imports yes yes
rule expansions:    
sequences yes yes
alternatives yes yes
parentheses yes yes
optional grouping yes yes
weights yes no(?)
kleene-star yes yes
plus-symbol yes yes
tags yes no(?)
right recursion yes yes

1.6.2.4.4. Creation of Dialog Acts

Here we have a very simple example grammar

     grammar example;
     public <example> = (<greet> |<goodbye> | <info_request>);
     <greet> = (hello  | nice to meet you )  Flobi;
     <goodbye> = ( Goodbye | bye | see you soon) Flobi;
     <info_request> = (what time is it | how is the weather);

Assumption: Someone said “hello Flobi” as input for this tool. A SpeechHypothesis will be generated and send.

The generated SpeechHypothesis contains:

field contains in this example
list of the words that describe the unterstood input. [hello, flobi]
confidence for the speechhypothesis. 1 (totaly sure)
grammartree containing grammar rules and spoken tokens in xml format. <EXAMPLE><GREET> hello flobi </GREET></EXAMPLE>
flag if the result is final final

The grammar-tree contains as root the name of the grammar. This tag will be ignored when inspecting the tree, so that we have a rule as root element. A DialogAct can be created from the SpeechHypothesis and its related grammar-tree. The root element of the tree will be the DialogActType. See what DialogActTypes are present.

Attention

In this example the grammar rule and the DialogActType does match exactly and it is highly recommended to create grammars that match these dialog act types! Small adaptions can be handled (is the rule a prefix of a type?) but usually the type will be OTHER, if the triggered rule can not be matched with a type!

This means if there is no grammar-tree for the SpeechHypothesis, the DialogActType will be OTHER. Final and unfinal SpeechHypotheses differ at least in their grammar-tree. The tree for unfinal hypotheses contains all possible trees for that hypotheses, so maybe more than one. For unfinal hypotheses with more than one grammar-tree only the first will be inspected. The final-flag of the hypotheses has also impact on the EditType. For final SpeechHypotheses the EditType of the DialogAct will be COMMIT, else it will be ADD.

The generated DialogAct contains:

field contains in this example
type of the current DialogAct GREET
incremental_unit, contains information about the EditType, id and more EditType = COMMIT
input- SpeechHypotheses complete SpeechHypothesis from above

The field-lists above are incomplete, see the marked links for more details.