Athena

Speech Synthesis & Speech Recognition
Using SAPI 4 High Level Interfaces

Brian Long (www.blong.com)

Table of Contents

Click here to download the files associated with this article.

If you find this article useful then please consider making a donation. It will be appreciated however big or small it might be and will encourage Brian to continue researching and writing about interesting subjects in the future.


Introduction

This article looks at adding support for speech capabilities to Microsoft Windows applications written in Delphi, using the Microsoft Speech API version 4 (SAPI 4). For an overview on the subject of speech technology please click here. For information on using SAPI 5.1 in Delphi applications click here.

The older SAPI 4 interfaces are defined in two ways. There are high level interfaces, intended to make implementation easier, but which sacrifice some of the control. These are intended for quick results but can be quite effective. There are also low level interfaces, which give full control but involve more work to get going. These are intended for the serious programmer to work with.

The high level interfaces are implemented by Microsoft in COM objects to call the lower level interfaces, taking care of all the nitty-gritty. The low level interfaces themselves are implemented by the TTS and SR engines that you obtain and install.

We will look at the high level interfaces available for TTS and SR in this article. You can find coverage of the low level interfaces by clicking here.

Grammars

Part of the process of speech recognition involves deciding what words have actually been spoken. Recognisers use a grammar to decide what has been said, where possible.

In the case of dictation, a grammar can be used to indicate some words that are likely to be spoken. It is not feasible to try and represent the entire spoken English language as a grammar, so the recogniser does its best and uses the grammar to help out. The recogniser tries to use context information from the text to work out which words are more likely than others. At its simplest, the Microsoft SR engine can use a dictation grammar like this:


[Grammar]
LangID=2057
;2057 = $809 = UK English
type=dictation

With Command and Control, the permitted words are limited to the supported commands. The grammar defines various rules that dictate what will be said and this makes the recogniser's job much easier. Rather than trying to understand anything spoken, it only needs to recognise speech that follows the supplied rules. A Command and Control grammar is typically referred to as Context-Free Grammar (CFG). A simple CFG that recognises three colours might look like this:


[Grammar]
LangID=2057
;UK English - 2057 = $809
Type=cfg

[<Start>]
<Start> = colour red
<Start> = colour green
<Start> = colour blue

Note: Start is the root point of the grammar.

Grammars support lists to make implementing many similar commands easy. For example:


LangID=2057
;UK English - 2057 = $809
Type=cfg

[<Start>]
<Start> = colour <Colour>

[<Colour>]
<Colour> = red
<Colour> = green
<Colour> = blue

You can find more details about the supported grammar syntax in the SAPI documentation

High Level Interfaces

The interfaces are made available to programmers as true COM interfaces, which Delphi's Object Pascal and C++ are both more than able to use. However they are also made available in a simplified form as Automation interfaces for less able languages such as VBA and also through ActiveX controls for use as visual controls in development environments that support them. Additionally, C++ wrapper classes are supplied for Visual C++ programmers familiar with MFC, but we won't need to pay any attention to those in this paper.

Most of the details of how the interfaces work will be covered whilst looking at the COM support and so will not be focused on quite so much during the Automation and ActiveX sections.

Note: DSR speech recognition with the high level interfaces does not require a formal grammar to be supplied. The COM objects that implement these interfaces deal with setting up suitable grammars.

COM

The high level COM APIs are described as the Voice Text API, Voice Command API, Voice Dictation API and Voice Telephony API (we won't be looking at the last one in this article). As mentioned earlier, these high level APIs are implemented to call the lower level APIs and this code resides in an out-of-process COM/Automation server.

This server resides with the other main redistributable SAPI elements speech directory under the main Windows directory (for example C:\WINNT\speech). It is called VCmd.exe and is described (by its version information) as Microsoft Voice Commands.

Voice Text API

Let's first look at TTS; the high level COM support for TTS is referred to as the Voice Text API. This API involves working with some interfaces implemented by a single COM object, referred to as the Voice Text Object by the SAPI 4 documentation but described as the Voice Text Manager in the Windows registry.

To get access to the object you can call CreateComObject from the ComObj unit, passing in the ClassID CLSID_VTxt from the speech unit. The created object supports the IVoiceText interface, which is what you use for the most common tasks. It also supports IVTxtAttributes, which allows you to control attributes such as the audio device and the speaking speed and find out if speech is in progress, and IVTxtDialogs, which allows you to invoke dialogs to configure the TTS engine (the dialogs are implemented by the engine).

Making Your Computer Talk

At its simplest level, all you need to do to get your program to speak is to create the COM object, extract the IVoiceText interface, register your application to allow you to use Voice Text and then ask it to speak. A trivial application that does this can be found in the VoiceTextAPISimple.dpr project in the files associated with this paper in the COM directory. The code looks like this:


uses
  Speech, ...

type
  TfrmVoiceTextAPI = class(TForm)
  ...
  private
    VoiceText: IVoiceText;
  end;
...
uses
  ComObj, ActiveX;

procedure TfrmVoiceTextAPI.FormCreate(Sender: TObject);
begin
  VoiceText := CreateComObject(CLSID_VTxt) as IVoiceText;
  OleCheck(VoiceText.Register(nil, PChar(Application.ExeName),
    nil, GUID_NULL, 0, nil));
end;

procedure TfrmVoiceTextAPI.Button1Click(Sender: TObject);
begin
  OleCheck(VoiceText.Speak(PChar(memText.Text), 0, nil));
end;

And there you have it: a speaking application. The call to Register takes a number of parameters that we should examine:

When requesting some speech the Speak method takes three parameters:

When the program executes it lets you type in some text in a memo and a button renders it into the spoken word.

That's the simple example out of the way, but what can we achieve if we dig a little deeper and get our hands a little dirtier? The next project, which holds the answers to these questions, can be found as VoiceTextAPI.dpr in the COM directory.

This makes use of the notification support and also uses a few more methods of the IVoiceText and IVTxtAttributes interfaces to control the generated speech. As you can see below there are buttons to play, pause and stop the speech as well as to invoke some TTS engine configuration dialogs. These are joined by a memo where a phonetic equivalent (using the TTS engine's phoneme representation) of the spoken text is inserted and also a listbox where the notifications are recorded.

Let's start with the simple stuff. We have a couple of routines that support logging information to the listbox (one takes a string parameter and the other takes the same parameters as Format).


procedure TfrmVoiceTextAPI.Log(const Msg: String);
begin
  if not Assigned(lstProgress) then
    Exit;
  lstProgress.Items.Add(Msg);
  lstProgress.ItemIndex := lstProgress.Items.Count - 1
end;

procedure TfrmVoiceTextAPI.Log(const Msg: String; const Args: array of const);
begin
  Log(Format(Msg, Args))
end;

The form's OnCreate event handler connects to the COM object as before but this time it passes in a freshly created object that will receive the notifications. The object (which we will come back to later) implements the IVTxtNotifySink interface and expects to receive all notifications (as opposed to just two of them).

Additionally the IVTxtAttributes interface is extracted in order to set up the checkbox that shows you whether Voice Text is enabled or not (you can see the checkbox event handler below), and the IVTxtDialogs interface is extracted to allow access to the engine dialogs available through buttons on the form.

The other task performed here is to add a horizontal scrollbar to the logging listbox to ensure strings longer than its current width can be viewed.


procedure TfrmVoiceTextAPI.FormCreate(Sender: TObject);
var
  Enabled: DWord;
begin
  SendMessage(lstProgress.Handle, LB_SETHORIZONTALEXTENT, Width, 0);
  VoiceText := CreateComObject(CLSID_VTxt) as IVoiceText;
  OleCheck(VoiceText.Register(nil, PChar(Application.ExeName),
    TVTxtNotifySink.Create(Self), IVTxtNotifySink, VTXTF_ALLMESSAGES, nil));
  TxtAttrs := VoiceText as IVTxtAttributes;
  OleCheck(TxtAttrs.EnabledGet(Enabled));
  chkEnabled.Checked := Bool(Enabled);
  TxtDlgs := VoiceText as IVTxtDialogs;
end;

procedure TfrmVoiceTextAPI.chkEnabledClick(Sender: TObject);
begin
  OleCheck(TxtAttrs.EnabledSet(DWord(chkEnabled.Checked)))
end;

The buttons that play, pause and stop simply call methods IVoiceText methods. The play button can start the speech and the pause button can suspend it. The play button can then continue the speech but it has to do so using a different method than it used to initiate the speech (the BeenPaused flag is used to help manage this). The stop button completely stops the current speech.


procedure TfrmVoiceTextAPI.btnPlayClick(Sender: TObject);
begin
  if not BeenPaused then
    OleCheck(VoiceText.Speak(PChar(reText.Text),
      VTXTST_STATEMENT or VTXTSP_NORMAL, nil))
  else
  begin
    OleCheck(VoiceText.AudioResume);
    BeenPaused := False
  end;
end;

procedure TfrmVoiceTextAPI.btnPauseClick(Sender: TObject);
begin
  if VoiceText.AudioPause = NOERROR then
    BeenPaused := True
end;

procedure TfrmVoiceTextAPI.btnStopClick(Sender: TObject);
begin
  OleCheck(VoiceText.StopSpeaking);
  BeenPaused := False;
end;
Engine Dialogs

The buttons that invoke the various dialogs each use much the same code. Each one makes a call to an appropriate method of the dialogs interface, passing the form window handle and a blank string (so it uses the default caption).


procedure TfrmVoiceTextAPI.btnAboutClick(Sender: TObject);
begin
  OleCheck(TxtDlgs.AboutDlg(Handle, nil))
end;

procedure TfrmVoiceTextAPI.btnGeneralClick(Sender: TObject);
begin
  OleCheck(TxtDlgs.GeneralDlg(Handle, nil))
end;

procedure TfrmVoiceTextAPI.btnLexiconClick(Sender: TObject);
begin
  OleCheck(TxtDlgs.LexiconDlg(Handle, nil))
end;

procedure TfrmVoiceTextAPI.btnTranslateClick(Sender: TObject);
begin
  OleCheck(TxtDlgs.TranslateDlg(Handle, nil))
end;

In the case of the Microsoft TTS engine, the About and General dialogs are both the same:

The Lexicon dialog offers the user a wizard to add new words to the internal dictionary and specify their correct pronunciation:

The Translation dialog is not implemented by the MS TTS engine.

Voice Text Notifications

The rest of the code is made up of the class that is designed to receive the notifications. Such a class is called a notification sink and must implement the IVTxtNotifySink notification interface as shown below.


type
  TVTxtNotifySink = class(TInterfacedObject, IVTxtNotifySink)
  private
    FForm: TfrmVoiceTextAPI;
  protected
    function AttribChanged(dwAttribute: DWORD): HResult; stdcall;
    function Visual(cIPAPhoneme: WideChar; cEnginePhoneme: AnsiChar;
      dwHints: DWORD; pTTSMouth: PTTSMOUTH): HResult; stdcall;
    function Speak(pszText: PAnsiChar; pszApplication: PAnsiChar;
      dwType: DWORD): HResult; stdcall;
    function SpeakingStarted: HResult; stdcall;
    function SpeakingDone: HResult; stdcall;
  public
    constructor Create(Form: TfrmVoiceTextAPI);
  end;

constructor TVTxtNotifySink.Create(Form: TfrmVoiceTextAPI);
begin
  inherited Create;
  FForm := Form
end;

function TVTxtNotifySink.AttribChanged(dwAttribute: DWORD): HResult;
var
  S: String;
begin
  Result := S_OK;
  case dwAttribute of
    TTSNSAC_REALTIME : S := 'Realtime';
    TTSNSAC_PITCH    : S := 'Pitch';
    TTSNSAC_SPEED    : S := 'Speed';
    TTSNSAC_VOLUME   : S := 'Volume';
  else
    S := 'unknown'
  end;
  FForm.Log('Engine Event AttribChanged: %s changed', [S]);
end;

function TVTxtNotifySink.Speak(pszText, pszApplication: PAnsiChar;
  dwType: DWORD): HResult;
begin
  Result := S_OK;
  FForm.Log('Engine Event Speak');
  FForm.memEnginePhonemes.Clear
end;

function TVTxtNotifySink.SpeakingDone: HResult;
begin
  Result := S_OK;
  FForm.Log('Engine Event SpeakingDone');
end;

function TVTxtNotifySink.SpeakingStarted: HResult;
begin
  Result := S_OK;
  FForm.Log('Engine Event SpeakingStarted');
end;

function TVTxtNotifySink.Visual(cIPAPhoneme: WideChar;
  cEnginePhoneme: AnsiChar; dwHints: DWORD; pTTSMouth: PTTSMOUTH): HResult;
var
  Hint: String;
begin
  Result := S_OK;
  Hint := '';
  if dwHints <> 0 then
  begin
    if dwHints and TTSNSHINT_QUESTION <> 0 then
      Hint := 'Question ';
    if dwHints and TTSNSHINT_STATEMENT <> 0 then
      Hint := Hint + 'Statement ';
    if dwHints and TTSNSHINT_COMMAND <> 0 then
      Hint := Hint + 'Command ';
    if dwHints and TTSNSHINT_EXCLAMATION <> 0 then
      Hint := Hint + 'Exclamation ';
    if dwHints and TTSNSHINT_EMPHASIS <> 0 then
      Hint := Hint + 'Emphasis';
  end
  else
    Hint := 'none';
  FForm.Log('Engine Event Visual: hint = %s', [Hint]);
  if cEnginePhoneme <> #32 then
    FForm.memEnginePhonemes.Text :=
      FForm.memEnginePhonemes.Text + cEnginePhoneme
end;

As you can see, the interface defines a total of five methods that are used to notify your program of various things such as a speech attribute changing, a speech request and indications that speech has either started or stopped. There is also the aforementioned notification that describes the phoneme being spoken and allows the creative to animate the speech (the Visual method). Don't fret if you lack the creative touch as later options provide animated mouths anyway.

Note: all five of these notification methods are likely to be called if you pass VTXTF_ALLMESSAGES as the penultimate parameter when registering to use voice text. However if you pass 0 instead, only the SpeakingStarted and SpeakingDone notification methods will be called.

As you can see, most of these notification methods just log a string into the listbox, although the Speak notification clears the phoneme memo. The Visual notification offers the most information. In this case the dwHints parameter is examined to see if it tells us anything and the engine phoneme is added to the memo, but everything else is ignored.

Speaking Dialogs

As an example of using the Voice Text API you can make all your VCL dialogs talk to you using this small piece of code.


uses
  VTxtAuto_TLB;

var
  Voice: IVTxtAuto;

procedure TForm1.FormCreate(Sender: TObject);
begin
  Screen.OnActiveFormChange := ScreenFormChange;
end;

procedure TForm1.ReadVCLDialog(Form: TCustomForm);
var
  I: Integer;
  ButtonCaptions, LabelCaption, DialogText: string;
begin
  try
    if not Assigned(Voice) then
    begin
      Voice := CoVTxtAuto_.Create;
      Voice.Register('', Application.ExeName);
    end;
    for I := 0 to Form.ComponentCount - 1 do
      if Form.Components[I] is TLabel then
        LabelCaption := TLabel(Form.Components[I]).Caption
      else
        if Form.Components[I] is TButton then
          ButtonCaptions := Format('%s%s, ',
            [ButtonCaptions, TButton(Form.Components[I]).Caption]);
    ButtonCaptions := StringReplace(ButtonCaptions,'&','', [rfReplaceAll]);
    DialogText := Format('%s.%s%s.%s%s',
      [Form.Caption, sLineBreak, LabelCaption, sLineBreak, ButtonCaptions]);
    Memo1.Text := DialogText;
    Voice.Speak(DialogText, 0)
  except
    //pretend everything is okay
  end
end;

procedure TForm1.ScreenFormChange(Sender: TObject);
begin
  if Assigned(Screen.ActiveForm) and
     (Screen.ActiveForm.ClassName = 'TMessageForm') then
    ReadVCLDialog(Screen.ActiveForm)
end;

The form's OnCreate event handler sets up an OnActiveFormChange event handler for the screen object. This is triggered each time a new form is displayed, which includes VCL dialogs. Any call to ShowMessage, MessageDlg or related routines causes a TMessageForm to be displayed so the code checks for this. If the form type is found, a textual version of what's on the dialog is built up and then spoken through the Voice Text API Automation component.

A statement such as:


MessageDlg('Save changes?', mtConfirmation, mbYesNoCancel, 0)

causes the ReadVCLDialog routine to build up and say this text:


Confirm.
Save changes?.
Yes, No, Cancel,

Notice the full stops at the end of each line to briefly pause the speech engine at that point before moving on.

Voice Command API

The Voice Command API allows you to implement command-and-control SR in your application and operates through a COM object referred to as the Voice Command Object by the SAPI 4 documentation but described as the Voice Command Manager in the Windows registry.

You use the ClassID CLSID_VCmd from the Speech unit to initialise it and the created object supports the IVoiceCmd, IVCmdDialogs and IVCmdAttributes interfaces. The application must register itself through the IVoiceCmd interface Register method and must then create a command menu, represented by the IVCmdMenu interface. This menu contains details of all the commands the user can utter and is analogous to a Windows menu in that regard.

Menus And Commands

The concepts of command and control SR support are similar to the commands in normal Windows menus. You define menus containing commands and can then control which are active and which are not. All Voice Command client applications share the Microsoft Voice Commands Automation server and so there is a central "repository" of all the defined command menus.

Depending how the menus are created, the server may store the menu in a database file (a file in the speech directory beneath your main Windows directory with the spec named as vcmd*.vcd). Also, the menu can be specified as local to a particular window (it is automatically activated and deactivated as the window gains and loses focus) or global (it is always active).

When a command from an active menu is recognised you are told about it through another notification interface, IVCmdNotifySink. The primary notification of interest is the CommandRecognize method that tells you which of your commands was spoken. If another application's command is recognised by the Voice Commands Automation server, the CommandOther notification is fired.

Also available is a command menu enumerator interface, IVCmdEnum, which allows you to enumerate all the voice menus in the Voice Menu database.

Speech recognition can be either enabled or disabled for an entire site, but it can also be awake or asleep (temporarily paused). Information about these states can be obtained from the IVCmdAttributes interface. You can send Voice Commands to sleep if you want it to ignore almost everything you say for a while, except for a dedicated sleep menu (that contains a command such as Start Listening or Wake Up). If you disable Voice Commands you cannot reactivate it by voice.

A sample Voice Command API application can be found as VoiceCommandAPI.dpr in the COM directory.

The OnCreate event handler connects to the Voice Command Object and extracts the IVCmdAttributes interface. This is used to ascertain whether speech recognition is awake and enabled for the Voice Command site. Two checkboxes on the form are used to record this information (and they have OnClick event handlers to update these states as well). The IVCmdDialogs interface is also extracted to enable access to the engine dialogs through buttons on the form.


uses
  Speech, ...

type
  TfrmVoiceCommandAPI = class(TForm)
  ...
  private
    VoiceCmd: IVoiceCmd;
    CmdAttrs: IVCmdAttributes;
    CmdMenu: IVCmdMenu;
    CmdDlgs: IVCmdDialogs;
  ...
  end; 
...
procedure TfrmVoiceCommandAPI.FormCreate(Sender: TObject);
var
  Enabled, Awake: DWord;
begin
  VoiceCmd := CreateComObject(CLSID_VCmd) as IVoiceCmd;
  CmdAttrs := VoiceCmd as IVCmdAttributes;
  OleCheck(VoiceCmd.Register(nil, TVCmdNotifySink.Create(Self),
    IVCmdNotifySink, VCMDRF_ALLMESSAGES, nil));
  CreateCommandMenu;
  CmdAttrs.EnabledGet(Enabled);
  chkEnabled.Checked := Bool(Enabled);
  CmdAttrs.AwakeStateGet(Awake);
  chkAwake.Checked := Bool(Awake);
  CmdDlgs := VoiceCmd as IVCmdDialogs;
end;

procedure TfrmVoiceCommandAPI.chkEnabledClick(Sender: TObject);
begin
  OleCheck(CmdAttrs.EnabledSet(Integer(chkEnabled.Checked)))
end;

procedure TfrmVoiceCommandAPI.chkAwakeClick(Sender: TObject);
begin
  OleCheck(CmdAttrs.AwakeStateSet(Integer(chkAwake.Checked)))
end;
Requesting Notifications

The next job is to register the application in order to use Voice Commands. Part of the registration process involves passing a reference to a callback object and (in this case) requesting that all notifications be sent to it.

There are various flags to control how many notifications are sent:

Before looking at the callback object we should see what is involved in setting up a command menu. This is all wrapped up in the form's CreateCommandMenu method.


procedure TfrmVoiceCommandAPI.CreateCommandMenu;

  procedure AddCommand(ID: Integer; const Command, Category, 
    Description: String);
  var
    MemReq: Integer;
    CmdCommand: PVCmdCommand;
    SData: TSData;
    Dest: PChar;
    CmdStart: DWord;
  begin
    MemReq := SizeOf(TVCmdCommand) + Succ(Length(Command)) +
      Succ(Length(Category)) + Succ(Length(Description));
    CmdCommand := AllocMem(MemReq);
    try
      SData.pData := CmdCommand;
      SData.dwSize := MemReq;
      CmdCommand.dwSize := MemReq;
      CmdCommand.dwID := ID;

      Dest := PChar(CmdCommand) + SizeOf(TVCmdCommand);
      CmdCommand.dwCommand := SizeOf(TVCmdCommand);
      StrCopy(Dest, PChar(Command));

      Inc(Dest, Succ(Length(Command)));
      CmdCommand.dwCategory :=
        SizeOf(TVCmdCommand) + DWord(Succ(Length(Description)));
      StrCopy(Dest, PChar(Category));

      Inc(Dest, Succ(Length(Category)));
      CmdCommand.dwDescription :=
        CmdCommand.dwCommand + DWord(Succ(Length(Command)));
      StrCopy(Dest, PChar(Description));

      OleCheck(CmdMenu.Add(1, SData, CmdStart));
    finally
      FreeMem(CmdCommand);
    end;
  end;

var
  VCMDName: TVCMDName;
begin
  StrPCopy(VCMDName.szApplication, ExtractFileName(Application.ExeName));
  StrPCopy(VCMDName.szState, 'Main');
  OleCheck(VoiceCmd.MenuCreate(
    @VCMDName, nil, VCMDMC_CREATE_TEMP, CmdMenu));
  AddCommand(1, 'Red', 'FormColour', 'Change form colour to red');
  AddCommand(2, 'Green', 'FormColour', 'Change form colour to green');
  AddCommand(3, 'Blue', 'FormColour', 'Change form colour to blue');
  OleCheck(CmdMenu.Activate(Handle, 0));
end;

A command menu is created by first setting up a TVCMDName record with a unique application name (the actual application name is used here) and "state" to describe the menu. In this case the state is Main, to indicate that the menu related to the application's main form.

The menu is then created with the IVoiceCmd.MenuCreate method. The command menu interface is returned in the last parameter (CmdMenu) and the menu is specified as being temporary (it won't be stored in the Voice Menu database). Having got a menu, commands are then added using a helper routine called AddCommand, which expects the command's ID, name, category and description to be passed.

The details of what AddCommand does are not too important, but it will suffice to say that it allocates enough memory for a TVCmdCommand structure plus the strings that specify the command's name, category and description. The results are added to the command menu and the allocated memory is freed.

Note: there are reports of issues with permanent Voice Command menus due to the underlying database sometimes disappearing without warning. Therefore it is recommended to create temporary menus (using the VCMDMC_CREATE_TEMP flag), which may take slightly more time than reusing an existing menu from the database.

Command Options

One option not taken advantage of in AddCommand is command verification. If a command warrants verification (such as Format Disk or Delete File) you should add in the appropriate flag, as in:


CmdCommand.dwFlags := VCMDCMD_VERIFY;

This causes the same flag to be passed along to the CommandRecognize notification method so the verification can be actioned. However, there is another problem there. The JEDI SAPI import unit has an error in one of the parameter definitions in the Voice Command notification sink interface. Unless you need command verification you can ignore it, but we'll look at what the issue is when we get down to the notification methods.

Another possible flag is VCMDCMD_CANTRENAME, which ensures the command cannot be renamed by applications supporting that feature (such as Microsoft Voice).

Note: Voice Commands supports both simple commands (such as Red) and command that support lists. This allows you to define one command, such as Start <App>, where <App> represents a value from the App list, which itself could be a large list of potential programs to start.

The support of lists allows many similar commands to be set up in two steps: add the command with IVCmdMenu.Add, then add the list with IVCmdMenu.ListSet. At the COM level ListSet is a little involved and so, like Add, warrants a helper function. You can use this one if you need to add list commands:


procedure AddList(Menu: IVCmdMenu; const ListName: String;
  const ListItems: array of String);
var
  List: String;
  I: Integer;
  Data: TSData;
begin
  List := '';
  if High(ListItems) < Low(ListItems) then
    Exit;
  for I := Low(ListItems) to High(ListItems) do
    List := List + ListItems[I] + #0;
  Data.pData := @List[1];
  Data.dwSize := Succ(Length(List));
  Menu.ListSet('App', Succ(High(ListItems) - Low(ListItems)), Data);
end;

Once the menu is complete it is activated by calling the Activate method. This takes two arguments:

Voice Command Notifications

The callback object simply logs any information passed to its methods when they are called by the Voice Command object. The only notable methods are listed below and show how to identify which speech recognition attribute has changed (if the awake or enabled attributes have changed, the checkboxes are updated accordingly), what type of interference was encountered, and a simple way to display the VU meter information (using a progress bar).


type
  TVCmdNotifySink = class(TInterfacedObject, IVCmdNotifySink)
  private
    FForm: TfrmVoiceCommandAPI;
  protected
    function CommandRecognize(dwID: DWORD; pvCmdName: PVCmdNameA;
      pdwFlags: PDWORD;
      dwActionSize: DWORD; pAction: pointer; dwNumLists: DWORD;
      pszListValues: PAnsiChar; pszCommand: PAnsiChar): HResult; stdcall;
    function CommandOther(pName: PVCmdNameA;
      pszCommand: PAnsiChar): HResult; stdcall;
    function CommandStart: HResult; stdcall;
    function MenuActivate(pName: PVCmdNameA;
      bActive: BOOL): HResult; stdcall;
    function UtteranceBegin: HResult; stdcall;
    function UtteranceEnd: HResult; stdcall;
    function VUMeter(wLevel: WORD): HResult; stdcall;
    function AttribChanged(dwAttribute: DWORD): HResult; stdcall;
    function Interference(dwType: DWORD): HResult; stdcall;
  public
    constructor Create(Form: TfrmVoiceCommandAPI);
  end;
...
function TVCmdNotifySink.AttribChanged(dwAttribute: DWORD): HResult;
var
  S: String;
  Enabled, Awake: DWord;
begin
  Result := S_OK;
  if dwAttribute <> 0 then
  begin
    if dwAttribute and IVCNSAC_AUTOGAINENABLE > 0 then
      S := 'automatic gain, ';
    if dwAttribute and IVCNSAC_ENABLED > 0 then
    begin
      S := S + 'enabled, ';
      OleCheck(FForm.CmdAttrs.EnabledGet(Enabled));
      FForm.chkEnabled.Checked := Bool(Enabled);
    end;
    if dwAttribute and IVCNSAC_AWAKE > 0 then
    begin
      S := S + 'awake, ';
      OleCheck(FForm.CmdAttrs.AwakeStateGet(Awake));
      FForm.chkAwake.Checked := Bool(Awake);
    end;
    if dwAttribute and IVCNSAC_DEVICE > 0 then
      S := S + 'audio device, ';
    if dwAttribute and IVCNSAC_MICROPHONE > 0 then
      S := S + 'current microphone, ';
    if dwAttribute and IVCNSAC_SPEAKER > 0 then
      S := S + 'speaker, ';
    if dwAttribute and IVCNSAC_SRMODE > 0 then
      S := S + 'SR mode, ';
    if dwAttribute and IVCNSAC_THRESHOLD > 0 then
      S := S + 'threshold, ';
    if dwAttribute and IVCNSAC_ORIGINAPP > 0 then
      S := S + 'from this app';
  end
  else
    S := 'none';
  FForm.Log('Attribute changed: ' + S)
end;

function TVCmdNotifySink.CommandRecognize(dwID: DWORD;
  pvCmdName: PVCmdNameA; pdwFlags: PDWORD; dwActionSize: DWORD;
  pAction: pointer; dwNumLists: DWORD; pszListValues,
  pszCommand: PAnsiChar): HResult;
begin
  Result := S_OK;
  FForm.Log('Command: app = %s, state = %s, cmd = %s, id = %d',
    [pvCmdName.szApplication, pvCmdName.szState, pszCommand, dwId]);
  case dwID of
    1: FForm.Color := clRed;
    2: FForm.Color := clGreen;
    3: FForm.Color := clBlue;
  end
end;

function TVCmdNotifySink.Interference(dwType: DWORD): HResult;
var
  S: String;
begin
  Result := S_OK;
  case dwType of
    SRMSGINT_NOISE: S := 'background noise too high';
    SRMSGINT_NOSIGNAL:
      S := 'engine cannot detect a signal (mic unplugged?)';
    SRMSGINT_TOOLOUD:
      S := 'speaker is too loud; recognition results may be degraded';
    SRMSGINT_TOOQUIET:
      S := 'speaker is too quiet; recognition results may be degraded';
    SRMSGINT_AUDIODATA_STOPPED,
    SRMSGINT_IAUDIO_STOPPED:
      S := 'engine has stopped receiving audio data from the audio source';
    SRMSGINT_AUDIODATA_STARTED,
    SRMSGINT_IAUDIO_STARTED:
      S := 'engine has resumed receiving audio data from the audio source';
  else
    S := Format('type %d', [dwType])
  end;
  FForm.Log('Interference: %s', [S])
end;

function TVCmdNotifySink.VUMeter(wLevel: WORD): HResult;
begin
  Result := S_OK;
  FForm.ProgressBar.Position := wLevel;
  FForm.lblVU.Caption := IntToStr(wLevel);
end;

The most important method is, of course, CommandRecognize. This fires when the user speaks a command that is recognised by the Voice Command object. As you can see, it gets passed a number of parameters, but the most important one is ID, which allows you to respond to the commands as you like using a case statement.

Note: If you request command verification, as described earlier, you will need to be aware of a problem in the JEDI SAPI 4 import unit. The CommandRecognize method (defined in the IVCmdNotifySinkA and IVCmdNotifySinkW interfaces) declares the flags parameter as:


pdwFlags: PDWORD;

where it should actually be:


dwFlags: DWORD;

This has been reported and so should be fixed at some point. In the meantime, the simplest way to overcome this is to check DWord(pdwFlags).

Note: CommandRecognize does not get passed the category or description of the command, so there is little point in setting them up here (although some of the other mechanisms for using the Voice Command object do utilise this information).

Note: If a command is recognised and that command is defined in terms of a list, pszCommand will have the full command as spoken by the user, pszListValues refers to the list item on its own and dwNumLists tells you how many bytes the list item takes up (including the null terminator). Since we are looking at an ANSI notification interface, dwNumLists will be one greater than the number of characters in the list item.

You can see how this application looks after a few commands have been spoken below.

Engine Dialogs

As you can see above there are five potential dialogs available from an SR engine, each being invoked much as with the TTS dialogs.


procedure TfrmVoiceCommandAPI.btnAboutClick(Sender: TObject);
begin
  OleCheck(CmdDlgs.AboutDlg(Handle, nil))
end;

procedure TfrmVoiceCommandAPI.btnGeneralClick(Sender: TObject);
begin
  OleCheck(CmdDlgs.GeneralDlg(Handle, nil))
end;

procedure TfrmVoiceCommandAPI.btnLexiconClick(Sender: TObject);
begin
  OleCheck(CmdDlgs.LexiconDlg(Handle, nil))
end;

procedure TfrmVoiceCommandAPI.btnTrainGeneralClick(Sender: TObject);
begin
  OleCheck(CmdDlgs.TrainGeneralDlg(Handle, nil))
end;

procedure TfrmVoiceCommandAPI.btnTrainMicClick(Sender: TObject);
begin
  OleCheck(CmdDlgs.TrainMicDlg(Handle, nil))
end;

In the case of the Microsoft SR engine, the About dialog gives version information:

The General dialog allows you to set the accuracy of speech recognition:

The Lexicon dialog is much the same as with the TTS engine but the training dialog allows you to read various passages of text to train the SR engine to your voice:

Voice Dictation API

The Voice Dictation API allows you to implement dictation SR in your application and operates through a COM object referred to as the Voice Dictation Object by the SAPI 4 documentation but described as the Voice Dictation Manager in the Windows registry.

You use the ClassID CLSID_VDct from the Speech unit to initialise it and the created object supports numerous interfaces, including IVoiceDictation (which you use to register your application) as well as IVDctAttributes and IVDctDialogs.

You identify what has been spoken by the user through another notification interface, IVDctNotifySink. The primary notification of interest is the PhraseFinish method, which tells you which phrase was spoken (or the engine's best guess).

A sample Voice Dictation API application can be found as VoiceDictationAPI.dpr in the COM directory.

The OnCreate event handler connects to the Voice Dictation Object and registers with it. Then the IVDctDialogs interface is extracted in order to allow access to the dialogs through buttons on the form and the IVDctAttributes interface is also extracted. This is used to set a speaker so the speech recognition training for that speaker can be used. It would be better to store the speaker name with your application state data (registry or an INI file) than to hardcode it as in this example.

The Voice Dictation session is then activated for whenever the form is focused. Finally the SR mode is set to support voice commands and dictation, meaning this program will work and any Voice Command applications will also continue to function. Note that this will only work if the engine supports simultaneous command and control and dictation.


uses
  Speech, ...

type
  TfrmVoiceDictationAPI = class(TForm)
  ...
  private
    VoiceDct: IVoiceDictation;
    DctAttrs: IVDctAttributes;
    DctDlgs: IVDctDialogs;
  ...
  end; 
...
procedure TfrmVoiceDictationAPI.FormCreate(Sender: TObject);
begin
  VoiceDct := CreateComObject(CLSID_VDct) as IVoiceDictation;
  OleCheck(VoiceDct.Register(
    PChar(ExtractFileName(Application.ExeName)), 'My Topic', nil, nil,
    TVDctNotifySink.Create(Self), IVDctNotifySink, VCMDRF_ALLMESSAGES));
  DctDlgs := VoiceDct as IVDctDialogs;
  DctAttrs := VoiceDct as IVDctAttributes;
  OleCheck(DctAttrs.SpeakerSet('blong'));
  OleCheck(VoiceDct.Activate(Handle));
  OleCheck(DctAttrs.ModeSet(VSRMODE_CMDANDDCT));
end;

As with the Voice Command API, a callback object is passed to the registration routine and requests all notifications be passed to it.

Voice Dictation Notifications

The callback object logs information passed to the notification methods.


type
  TVDctNotifySink = class(TInterfacedObject, IVDctNotifySink)
  private
    FForm: TfrmVoiceDictationAPI;
    function PhraseToStr(pSRPhrase: PSRPhraseA): String;
  protected
    function CommandBuiltIn(pszCommand: PAnsiChar): HResult; stdcall;
    function CommandOther(pszCommand: PAnsiChar): HResult; stdcall;
    function CommandRecognize(dwID: DWord; pdwFlags: PDWord;
      dwActionSize: DWORD; pAction: Pointer;
      pszCommand: PAnsiChar): HResult; stdcall;
    function TextSelChanged: HResult; stdcall;
    function TextChanged(dwReason: DWORD): HResult; stdcall;
    function TextBookmarkChanged(dwID: DWORD): HResult; stdcall;
    function PhraseStart: HResult; stdcall;
    function PhraseFinish(dwFlags: DWORD;
      pSRPhrase: PSRPhraseA): HResult; stdcall;
    function PhraseHypothesis(dwFlags: DWORD;
      pSRPhrase: PSRPhraseA): HResult; stdcall;
    function UtteranceBegin: HResult; stdcall;
    function UtteranceEnd: HResult; stdcall;
    function VUMeter(wLevel: WORD): HResult; stdcall;
    function AttribChanged(dwAttribute: DWORD): HResult; stdcall;
    function Interference(dwType: DWORD): HResult; stdcall;
    function Training(dwTrain: DWORD): HResult; stdcall;
    function Dictating(pszApp: PAnsiChar;
      fDictating: BOOL): HResult; stdcall;
  public
    constructor Create(Form: TfrmVoiceDictationAPI);
  end;

The main method is PhraseFinish, which is called when the SR engine has decided what has been spoken. Whilst working it out, it will likely call the PhraseHypothesis method several times.


function TVDctNotifySink.PhraseFinish(dwFlags: DWORD;
  pSRPhrase: PSRPhraseA): HResult;
begin
  Result := S_OK;
  FForm.Log('PhraseFinish: %s', [PhraseToStr(pSRPhrase)]);
  FForm.memText.SelText := PhraseToStr(pSRPhrase);
  //Since PhraseStart never seems to trigger, this will
  //clear the old hypothese list on the first new hypothesis
  FPhraseDone := True
end;

function TVDctNotifySink.PhraseHypothesis(dwFlags: DWORD;
  pSRPhrase: PSRPhraseA): HResult;
begin
  Result := S_OK;
  //Since PhraseStart never seems to trigger, this will
  //clear the old hypothese list on the first new hypothesis
  if FPhraseDone then
  begin
    FForm.lstHypotheses.Clear;
    FPhraseDone := False
  end;
  FForm.lstHypotheses.Items.Add(PhraseToStr(pSRPhrase));
  FForm.lstHypotheses.ItemIndex := FForm.lstHypotheses.Items.Count - 1
end;

Both these methods are passed a pointer to a TSRPhrase record that contains the words in the phrase being reported. A helper routine is used to turn this into a normal string. Finished phrases are added to a memo on the form and whilst a phrase is being worked out the hypotheses are added to a list box so you can see how the SR engine made its decision. Each time a new phrase is started, the hypothesis list is cleared.

Note: the hypothesis list is cleared in the PhraseHypothesis notification method, if a specified flag is True. It would be more sensible to clear it in the PhraseStart method but that notification method is never called.


function TVDctNotifySink.PhraseToStr(pSRPhrase: PSRPhraseA): String;
var
  ToGo: Integer;
  PSRW: PSRWord;
begin
  Result := '';
  if pSRPhrase = nil then
    Exit;
  ToGo := pSRPhrase.dwSize - SizeOf(pSRPhrase.dwSize);
  PSRW := @pSRPhrase.abWords;
  while ToGo > 0 do
  begin
    Result := Result + PChar(@PSRW.szWord) + #32;
    Dec(ToGo, PSRW.dwSize);
    Inc(PChar(PSRW), PSRW.dwSize)
  end;
end;

You can see the list of hypotheses building up in this screenshot of the program running.

Note: the API routines for invoking the dialogs have a slight issue. The method IVDctDialogs.TrainGeneralDlg method actually invokes the microphone training dialog and IVDctDialogs.TrainMicDlg actually invokes the general speech recognition training dialog.

Note: for dictation to work acceptably you should spend the time doing several voice training sessions. You should also invest in a quality microphone (a close-talk headset microphone is best).

Built-In Commands

The application has a popup menu that shows if you right-click on the form. The only menu item causes the Voice Dictation Object's built-in command grammar to be dumped into the memo.


procedure TfrmVoiceDictationAPI.Showbuiltincommands1Click(Sender: TObject);
var
  Buf: PChar;
  BufSize: DWord;
begin
  //Get built-in commands
  with VoiceDct as IVDctCommandsBuiltIn do
  begin
    TextGet(Buf, BufSize);
    memText.Text := Buf;
    CoTaskMemFree(Buf);
  end
end;

You can see a small portion of the grammar that sets up these commands here:

Automation

The high level SAPI Automation objects are also available for control through Automation. The Automation objects themselves are implemented in the Microsoft Voice Commands Automation server along with the high level COM objects. Being Automation objects, their capabilities are described in type libraries.

The Voice Text Object Automation interface is described in the vtxtauto.tlb type library whilst the Voice Command Object Automation interface is described by vcauto.tlb. Both these type libraries can be found in the Windows speech directory.

Note: there is no Automation interface to the Voice Dictation Object.

The speech notifications are still available when using Automation, although they are set up in a different way. Rather than implementing an internal notification sink object and passing that along to a registration method, you must implement a registered Automation object and assign its ProgID to the speech object's Callback string property.

Note: there is an important point about these callback objects regarding application shutdown. Since the speech Automation object is a client to your voice-enabled application (it instantiates the Automation callback object specified by the ProgID) something important happens when you try to shut your application.

Because the speech object still has a reference to your Automation object you will get a warning dialog invoked:

You can usually rectify the problem by assigning Unassigned to the Variant that represents the speech object (or objects) in the main form's OnClose event handler.

An alternative is to prevent the warning dialog from being displayed in the first place. This can be achieved by adding this line to the initialisation section of the unit that implements your Automation callback object:


ComServer.UIInteractive := False

This is perfectly safe, as the normal destruction of your main form will drop the reference to the speech object, which will then drop the reference to your callback object, allowing it to be destroyed.

Voice Text Automation

Late Bound Automation

To use Automation against the Voice Text Object through a Variant (late bound access) you access it through the ProgID Speech.VoiceText. As with the Voice Text API you must register your application before you can use the TTS functionality.


uses
  ComObj,
...
type
  TfrmVTxtAutoLateBound = class(TForm)
  ...
  private
    VTxt: Variant;
  ...
  end;
...
  VTxt := CreateOleObject('Speech.VoiceText');
  VTxt.Register('', Application.ExeName);

The Automation object can notify you when speaking has started and when it stops through an Automation object that you implement and register. It must implement two parameterless methods: SpeakingStarted and SpeakingDone. You assign its ProgID to the Voice Text object's Callback property.

The sample project VoiceTextAutoVar.dpr in the Automation directory contains an Automation object that implements these methods and its ProgID is VoiceTextAutoVar.VoiceCallBack.


VTxt.Callback := 'VoiceTextAutoVar.VoiceCallback';

The methods available for controlling the speech progress are much the same as with the Voice Text API but there is an additional property, IsSpeaking, which is useful for working out if speech is currently in progress. You can get this information with the Voice Text API, but it involves calling a method of the IVTxtAttributes interface so this Automation object clearly surfaces parts of at least two of the Voice Text API interfaces (IVoiceText and IVTxtAttributes).


procedure TfrmVTxtAutoLateBound.btnPlayClick(Sender: TObject);
begin
  if not BeenPaused then
    VTxt.Speak(memText.Text, 0)
  else
  begin
    VTxt.AudioResume;
    BeenPaused := False
  end
end;

procedure TfrmVTxtAutoLateBound.btnPauseClick(Sender: TObject);
begin
  if VTxt.IsSpeaking then
  begin
    VTxt.AudioPause;
    BeenPaused := True
  end
end;

procedure TfrmVTxtAutoLateBound.btnStopClick(Sender: TObject);
begin
  VTxt.StopSpeaking;
end;

The callback object simply logs messages to the listbox on the main form when its two notification methods are called.

There is another demo of late bound Voice Text Automation in the same directory in the project VoiceTextAutoVarReadWordDoc.dpr. As the name suggests, this sample reads out loud from a Word document. It uses Automation to control Microsoft Word and also to control the Voice Text object.

The demo is inspired by a sample VB application from Reference 1. However the original VB code used the WordBasic Automation interface, which did not work so well with more recent versions of Word, so it has been re-written to use the Word VBA interface. Other changes have also been made.


type
  TfrmVTxtAutoLateBound = class(TForm)
  ...
  private
    VTxt, MSWord: Variant;
  end;
...
procedure TfrmVTxtAutoLateBound.FormCreate(Sender: TObject);
begin
  VTxt := CreateOleObject('Speech.VoiceText');
  VTxt.Register('', Application.ExeName);
  MSWord := CreateOleObject('Word.Application');
end;

procedure TfrmVTxtAutoLateBound.btnReadDocClick(Sender: TObject);
const
// Constants for enum WdUnits
  wdCharacter = $00000001;
  wdParagraph = $00000004;
// Constants for enum WdMovementType
  wdExtend = $00000001;
var
  Moved: Integer;
  Txt: String;
begin
  (Sender as TButton).Enabled := False;
  if dlgOpenDoc.Execute then
  begin
    MSWord.Documents.Open(FileName := dlgOpenDoc.FileName);
    Moved := 2;
    while Moved > 1 do
    begin
      //Select next paragraph
      Moved := MSWord.Selection.EndOf(Unit:=wdParagraph, Extend:=wdExtend);
      if Moved > 1 then
      begin
        MSWord.Selection.Copy;
        Txt := Trim(ClipBoard.AsText);
        if Length(Txt) > 0 then
          VTxt.Speak(pszBuffer := Txt, dwFlags := 0);
        Application.ProcessMessages;
        //Move to start of next paragraph
        MSWord.Selection.MoveRight(Unit:=wdCharacter);
      end
    end;
  end;
  MSWord.ActiveDocument.Close;
  TButton(Sender).Enabled := True;
end;

procedure TfrmVTxtAutoLateBound.btnStopClick(Sender: TObject);
begin
  if VTxt.IsSpeaking then
    VTxt.StopSpeaking
end;

procedure TfrmVTxtAutoLateBound.FormDestroy(Sender: TObject);
begin
  btnStop.Click;
  MSWord.Quit;
  MSWord := Unassigned;
end;

Note: the example uses the clipboard to copy each paragraph from the document to be read. It would be more sensible to simply read the selected text directly, but that causes strange hanging problems with long paragraphs.

Early Bound Automation

To use Automation against the Voice Text Object through interfaces (early bound) requires you to import its type library to get a Pascal representation of all its interfaces and supporting constants and types. In Delphi you do this with Project | Import Type Library..., but since the sought library is not registered you will need to press the Add... button and locate it manually (vtxtauto.tlb in the Windows speech directory).

Pressing the Create Unit button generates a type library import unit called VTxtAuto_TLB.pas.

Note: you might normally press Install... to ensure any generated component wrappers for exposed Automation objects are installed on the Component Palette. However these example all work with the Automation objects using normal Automation coding and don't make use of component wrapper classes, so there is little point making the components (they save you very little).

Ready made packages for Delphi 5, 6 and 7 containing the type library import unit can be found in appropriately named subdirectories under SAPI 4 in the accompanying files.

You access the Automation object using the ClassID CLASS_VTxtAuto_ and the implemented interface is IVTxtAuto. Both the ClassID and the interface are defined in the type library import unit.

The following code comes from the sample project VoiceTextAuto.dpr in the Automation directory.


uses
  VTxtAuto_TLB, ComObj,
...
type
  TfrmVTxtAutoEarlyBound = class(TForm)
  ...
  private
    VTxt: IVTxtAuto;
    BeenPaused: Boolean;
  ...
  end;
...
procedure TfrmVTxtAutoEarlyBound.FormCreate(Sender: TObject);
begin
  SendMessage(lstProgress.Handle, LB_SETHORIZONTALEXTENT, Width, 0);
  //VTxt := CreateOleObject('Speech.VoiceText') as IVTxtAuto;
  //VTxt := CoVTxtAuto_.Create;
  VTxt := CreateComObject(CLASS_VTxtAuto_) as IVTxtAuto;
  VTxt.Register('', Application.ExeName);
  //The callback object specified by the ProgID
  //below implements the notification interface
  VTxt.Callback := 'VoiceTextAuto.VoiceCallback';
end;

Note: as well as using CreateComObject and passing the ClassID (which requires you to query for the appropriate interface) the code above shows two alternatives. You can use the helper class, CoVTxtAuto, defined in the type library import unit. This does exactly the same as the code that is being used (but involves less typing). Alternatively you can call CreateOleObject, passing the ProgID, and query for the interface.

The code for the buttons and the callback object is just the same as for the late bound version.

Speaking Dialogs

As an example of using automating the Voice Text API you can make all your VCL dialogs talk to you using this small piece of code.


var
  Voice: Variant;

procedure TForm1.FormCreate(Sender: TObject);
begin
  Screen.OnActiveFormChange := ScreenFormChange;
end;

procedure TForm1.ReadVCLDialog(Form: TCustomForm);
var
  I: Integer;
  ButtonCaptions, LabelCaption, DialogText: string;
begin
  try
    if VarType(Voice) <> varDispatch then
    begin
      Voice := CreateOleObject('Speech.VoiceText');
      Voice.Register('', Application.ExeName);
    end;
    for I := 0 to Form.ComponentCount - 1 do
      if Form.Components[I] is TLabel then
        LabelCaption := TLabel(Form.Components[I]).Caption
      else
        if Form.Components[I] is TButton then
          ButtonCaptions := Format('%s%s, ',
            [ButtonCaptions, TButton(Form.Components[I]).Caption]);
    ButtonCaptions := StringReplace(ButtonCaptions,'&','', [rfReplaceAll]);
    DialogText := Format('%s.%s%s.%s%s',
      [Form.Caption, sLineBreak, LabelCaption, sLineBreak, ButtonCaptions]);
    Memo1.Text := DialogText;
    Voice.Speak(DialogText, 0)
  except
    //pretend eveyrthing is okay
  end
end;

procedure TForm1.ScreenFormChange(Sender: TObject);
begin
  if Assigned(Screen.ActiveForm) and
     (Screen.ActiveForm.ClassName = 'TMessageForm') then
    ReadVCLDialog(Screen.ActiveForm)
end;

The form's OnCreate event handler sets up an OnActiveFormChange event handler for the screen object. This is triggered each time a new form is displayed, which includes VCL dialogs. Any call to ShowMessage, MessageDlg or related routines causes a TMessageForm to be displayed so the code checks for this. If the form type is found, a textual version of what's on the dialog is built up and then spoken through the Voice Text API Automation component.

A statement such as:


MessageDlg('Save changes?', mtConfirmation, mbYesNoCancel, 0)

causes the ReadVCLDialog routine to build up and say this text:


Confirm.
Save changes?.
Yes, No, Cancel,

Notice the full stops at the end of each line to briefly pause the speech engine at that point before moving on.

Voice Command Automation

Late Bound Automation

To use Automation against the Voice Command Object through a Variant (late bound access) you access it through the ProgID Speech.VoiceCommand. A sample project called VoiceCommandAutoVar.dpr can be found in the Automation directory.

This example project shows how you can take advantage of the similarities between voice command menus and Windows menus by allowing voice control of your menu system. The COM example above, by comparison showed how to define arbitrary commands.

Just as with the Voice Text Automation object, the Voice Command Automation object is instantiated and then the SR state is examined. You do not get supplied with the global enabled state when using Automation, but you do have access to the awake state through the Awake property, although there is no notification to tell you when it changes (you can use a timer to keep updated of changes if needed).


type
  TfrmVoiceCommandAutomation = class(TForm)
  ...
  private
    VCmd, VMenu: Variant;
  ...
  end;
...
procedure TfrmVoiceCommandAutomation.FormCreate(Sender: TObject);
begin
  SendMessage(lstProgress.Handle, LB_SETHORIZONTALEXTENT, Width, 0);
  VCmd := CreateOleObject('Speech.VoiceCommand');
  chkListening.Checked := VCmd.Awake;
  VCmd.Register('');
  VCmd.Callback := 'VoiceCommandAutoVar.ListenCallback';
  CreateCommandMenu;
end;

procedure TfrmVoiceCommandAutomation.chkAwakeClick(Sender: TObject);
begin
  VCmd.Awake := chkAwake.Checked;
end;

The next step is to register for use of Voice Commands on the default site. Whilst this example uses a callback, you don't strictly need to in order to respond to voice commands. Instead you can regularly check the VCmd.CommandSpoken property. If a command has been recognised this will give you the command ID, otherwise it returns 0.

In this case a callback object is set up by assigning the relevant ProgID to the Callback property. The Automation object chosen as the callback will receive two notifications and so must implement two methods declared as follows:


procedure CommandRecognize(const sCommand: WideString; ID: Integer); safecall;
procedure CommandOther(const sCommand, sApp, sState: WideString); safecall;

The important method here is CommandRecognize, which is triggered when one of this application's commands is registered.

Back to the OnCreate event handler now; this goes on to call a helper routine, CreateCommandMenu, to set up the command menu. You will see that the Automation interface makes it much simpler to add commands than with the COM interface; just pass the information directly to a single method (you cannot request command verification through the Automation interface). You are able to set up list commands using the ListSet method but the callback doesn't give you any specific information about them.

CreateCommandMenu creates a temporary Voice Menu and then adds in commands corresponding to each menu item on the main form's main menu. It is careful to recurse down menus and submenus and only adds commands for proper menu items (it ignores separators and submenus), however it pays no attention to whether menus are enabled or not.

For a menu item such as Help | About..., the command string in the Voice Menu is set to Help About. This means any menu can be invoked by reading out the path through the menu hierarchy necessary to reach it.


procedure TfrmVoiceCommandAutomation.CreateCommandMenu;

  procedure AddMenuCommands(Item: TMenuItem; const ParentPath: String);
  var
    I: Integer;
    Path: String;
  begin
    Path := ParentPath + StripHotKey(Item.Caption);
    //Recurse through subitems, if any
    if (Item.Count > 0) then
      for I := 0 to Item.Count - 1 do
        AddMenuCommands(Item.Items[I], Path + #32)
    //Otherwise add this item, if appropriate
    else
      if (Item.Caption <> '') and (Item.Caption <> '-') then
        VMenu.Add(Item.Command, Path, ' ', Item.Hint);
  end;

begin
  //Must pass a non-blank strings for all WideString parameters
  VMenu := VCmd.MenuCreate(ExtractFileName(Application.ExeName),
    'Main Menu', LANG_ENGLISH, ' ', vcmdmc_CREATE_TEMP);
  AddMenuCommands(Menu.Items, '');
  VMenu.hWndMenu := Handle;
  VMenu.Active := True;
end;

Note: these Automation methods are very sensitive to blank strings. The menu is created with a LANG_ENGLISH parameter to specify the language and the following parameter is an optional string you can use to specify the dialect. If you pass a blank string the Voice Command object will throw an exception so it is important to pass non-blank strings for these string parameters. The same can be seen with the call to add a command to the menu; the category parameter is omitted but a non-empty string is passed to keep things working smoothly.

The command ID for each Voice Menu command is extracted from the menu item's Command property, which is something we can use to trigger the menu item via a message if the command is recognised. The description is taken from the menu item's Hint property.

Once all menu items have had commands added for them the menu is told to restrict itself to the current form (in other words it is only active if the form has focus) and then activated.

The important code now is in the callback object's CommandRecognize method. As you can see, if a command is recognised a message is sent to the form to emulate a menu selection (as well as log the command in the message log).


procedure TListenCallback.CommandRecognize(const sCommand: WideString;
  ID: Integer);
begin
  frmVoiceCommandAutomation.Log('Our command: %s, id: %d', [sCommand, ID]);
  frmVoiceCommandAutomation.Perform(WM_COMMAND, ID, 0)
end;

The menu items on this form allow the user to close the application, minimise, maximise and restore the form, clear the message log, change the form colour and invoke an About dialog. You can check the project for their implementations but the following screenshot shows the application after a few commands have been spoken.

The last job for the OnCreate handler is to use the Voice Command Object's Awake property to set up the checkbox that tells the user if speech recognition is enabled or not. The checkbox OnClick event handler toggles this property for full user control.

Early Bound Automation

To use Automation against the Voice Command Object through interfaces (early bound) requires you to import its type library, vcauto.tlb in the Windows speech directory. This generates a type library import unit called VCmdAuto_TLB.pas.

Ready made packages for Delphi 5 and Delphi 6 containing the type library import unit can be found in appropriately named subdirectories under SAPI 4 in the accompanying files.

You access the Automation object using the ClassID CLASS_VCmdAuto_ and the implemented interface is IVCmdAuto. Both the ClassID and the interface are defined in the type library import unit. Much as with the COM APIs, commands are set up using a menu and the IVMenuAuto interface supports this.

A sample project called VoiceCommandAuto.dpr can be found in the Automation directory. The logic is much the same as for the late bound version, other than the types used to access the Voice Command and Voice Menu objects.


uses
  VCmdAuto_TLB, ...
type
  TfrmVoiceCommandAutomation = class(TForm)
    ...
  private
    VCmd: IVCmdAuto;
    VMenu: IVMenuAuto;
    ...
  end;
...
procedure TfrmVoiceCommandAutomation.FormCreate(Sender: TObject);
begin
  SendMessage(lstProgress.Handle, LB_SETHORIZONTALEXTENT, Width, 0);
  //VCmd := CreateOleObject('Speech.VoiceCommand') as IVCmdAuto;
  //VCmd := CoVCmdAuto_.Create;
  VCmd := CreateComObject(CLASS_VCmdAuto_) as IVCmdAuto;
  VCmd.Register('');
  VCmd.Callback := 'VoiceCommandAuto.ListenCallback';
  CreateCommandMenu;
  chkAwake.Checked := VCmd.Awake;
end;

procedure TfrmVoiceCommandAutomation.chkAwakeClick(Sender: TObject);
begin
  VCmd.Set_Awake(chkAwake.Checked);
end;

Note: as well as using CreateComObject and passing the ClassID (which requires you to query for the appropriate interface) the code above shows two alternatives. You can use the helper class, CoVCmdAuto, defined in the type library import unit. This does exactly the same as the code that is being used (but involves less typing). Alternatively you can call CreateOleObject, passing the ProgID, and query for the interface. Other than that, the rest of the code is the same as with the late bound Automation example.

ActiveX

When SAPI 4 is installed it does a two step installation. First of all it installs all the normal COM/Automation support along with the help files and so on. When that's all done it installs the ActiveX controls that can simplify the process of building SAPI applications. The following sections look at how these ActiveX controls can be used.

Ready made packages for Delphi 5 and Delphi 6 containing the ActiveX units can be found in appropriately named subdirectories under SAPI 4 in the accompanying files.

TextToSpeech Control

The Microsoft TextToSpeech control is an ActiveX that wraps up the high level Voice Text API (and does more besides as we shall see). To use it you must first import the ActiveX into Delphi with Component | Import ActiveX... This presents you with a list of all the registered ActiveX controls on your system. The one you are looking for is described as Microsoft Voice Text (Version 1.0).

Pressing Install... will take you through the process of adding the generated type library import unit (HTTSLib_TLB.pas) to a package and having it compiled and installed. The import unit contains the Object Pascal component wrapper for the ActiveX, which is called TTextToSpeech and this component will by default be installed on the ActiveX page of the Component Palette.

The ActiveX is implemented in Vtext.dll in the Windows speech directory (whose version information describes it as the High-Level Text To Speech Module) and the primary interface implemented is ITextToSpeech.

You can programmatically work with this ActiveX control using the ProgID TextToSpeech.TextToSpeech or the ClassID CLASS_TextToSpeech from the HTTSLib_TLB unit. The Windows registry describes this class as the TextToSpeech Class.

Normally you will want to simply place the ActiveX on a form for use and that is what the sample project VoiceTextControl.dpr in the ActiveX directory does. The pleasant surprise you get when you place the ActiveX component on the form is that it shows up as a colourful mouth.

When the ActiveX is asked to speak, the mouth animates in sync with the spoken phonemes. The effect is rather difficult to get across through the written word and in screenshots so I encourage you to try using this ActiveX control. It's pretty cool (and saves you the trouble of working out how to do it yourself)!

In this project the play, pause and stop buttons to much the same as before, although the methods they call have slightly different names. Just as with the Automation interface, the ActiveX surfaces parts of the IVoiceText and IVTxtAttributes interfaces from the Voice Text API and also the IVTxtDialogs interface (methods are exposed to invoke the four dialogs).


procedure TfrmTextToSpeechControl.btnPlayClick(Sender: TObject);
begin
  if not BeenPaused then
    TextToSpeech.Speak(memText.Text)
  else
  begin
    TextToSpeech.Resume;
    BeenPaused := False
  end
end;

procedure TfrmTextToSpeechControl.btnPauseClick(Sender: TObject);
begin
  if Bool(TextToSpeech.IsSpeaking) then
  begin
    TextToSpeech.Pause;
    BeenPaused := True
  end
end;

procedure TfrmTextToSpeechControl.btnStopClick(Sender: TObject);
begin
  TextToSpeech.StopSpeaking;
end;

procedure TfrmTextToSpeechControl.btnAboutClick(Sender: TObject);
begin
  TextToSpeech.AboutDlg(Handle, '');
end;

procedure TfrmTextToSpeechControl.btnGeneralClick(Sender: TObject);
begin
  TextToSpeech.GeneralDlg(Handle, '');
end;

procedure TfrmTextToSpeechControl.btnLexiconClick(Sender: TObject);
begin
  TextToSpeech.LexiconDlg(Handle, '');
end;

procedure TfrmTextToSpeechControl.btnTranslateClick(Sender: TObject);
begin
  TextToSpeech.TranslateDlg(Handle, '');
end;

The ActiveX also delivers the same notifications as on offer in the IVTxtNotifySink interface, although responding to them is now a no-brainer thanks to them being exposed as normal Delphi events.


procedure TfrmTextToSpeechControl.TextToSpeechAttribChanged(Sender: TObject;
  attrib: Integer);
var
  S: String;
begin
  case attrib of
    TTSNSAC_REALTIME : S := 'Realtime';
    TTSNSAC_PITCH    : S := 'Pitch';
    TTSNSAC_SPEED    : S := 'Speed';
    TTSNSAC_VOLUME   : S := 'Volume';
  end;
  Log('OnAttribChanged: %s changed', [S]);
end;

procedure TfrmTextToSpeechControl.TextToSpeechSpeak(Sender: TObject; const Text,
  App: WideString; thetype: Integer);
begin
  Log('OnSpeak');
  memEnginePhonemes.Clear
end;

procedure TfrmTextToSpeechControl.TextToSpeechSpeakingStarted(Sender: TObject);
begin
  Log('OnSpeakingStarted');
end;

procedure TfrmTextToSpeechControl.TextToSpeechSpeakingDone(Sender: TObject);
begin
  Log('OnSpeakingDone')
end;

procedure TfrmTextToSpeechControl.TextToSpeechVisual(Sender: TObject;
  Phoneme, EnginePhoneme: Smallint; hints: Integer; MouthHeight,
  bMouthWidth, bMouthUpturn, bJawOpen, TeethUpperVisible,
  TeethLowerVisible, TonguePosn, LipTension: Smallint);
var
  Hint: String;
begin
  Hint := '';
  if hints <> 0 then
  begin
    if hints and TTSNSHINT_QUESTION <> 0 then
      Hint := 'Question ';
    if hints and TTSNSHINT_STATEMENT <> 0 then
      Hint := Hint + 'Statement ';
    if hints and TTSNSHINT_COMMAND <> 0 then
      Hint := Hint + 'Command ';
    if hints and TTSNSHINT_EXCLAMATION <> 0 then
      Hint := Hint + 'Exclamation ';
    if hints and TTSNSHINT_EMPHASIS <> 0 then
      Hint := Hint + 'Emphasis';
  end
  else
    Hint := 'none';
  Log('OnVisual: hint = %s', [Hint]);
  if Char(EnginePhoneme) <> #32 then
    memEnginePhonemes.Text := memEnginePhonemes.Text + Char(EnginePhoneme)
end;

Note that the OnVisual event offers plenty of information that is already used by the ActiveX mouth animation. The only use I see for much of this is if you were to hide the ActiveX in order to perform your own animation.

Voice Commands Control

The Microsoft Voice Commands control is an ActiveX that wraps up the high level Voice Command API. To use it you must first import the ActiveX into Delphi; you will find it described as Microsoft Voice Commands (Version 1.0).

This will generate and install a type library import unit called HSRLib_TLB.pas. The import unit contains the ActiveX component wrapper class called TVcommand.

The ActiveX is implemented in Xcommand.dll in the Windows speech directory (whose version information describes it as the High-Level Speech Recognition Module) and the primary interface implemented is IVcommand. However it also surfaces parts of other interfaces such as IVCmdAttributes and IVCmdDialogs.

You can programmatically work with this ActiveX control using the ProgID Vcommand.Vcommand or the ClassID CLASS_Vcommand from the HSRLib_TLB unit. The Windows registry describes this class as the VCommand Class.

Alternatively (and more typically) you can simply drop the ActiveX component on a form. This is done in the sample project VoiceCommandsControl.dpr project in the ActiveX directory. This project does much the same job as the Voice Command Automation examples (although we have access to both the enabled and awake states with the ActiveX):


procedure TfrmVoiceCommandsControl.FormCreate(Sender: TObject);
begin
  Vcommand.Initialized := Integer(True);
  chkEnabled.Checked := Bool(Vcommand.Enabled);
  chkAwake.Checked := Bool(Vcommand.AwakeState);
  CreateCommandMenu;
end;

procedure TfrmVoiceCommandsControl.chkEnabledClick(Sender: TObject);
begin
  Vcommand.Enabled := Integer(chkEnabled.Checked);
end;

procedure TfrmVoiceCommandsControl.chkAwakeClick(Sender: TObject);
begin
  Vcommand.AwakeState := Integer(chkAwake.Checked);
end;

procedure TfrmVoiceCommandsControl.CreateCommandMenu;

  procedure AddMenuCommands(Item: TMenuItem; const ParentPath: String);
  var
    I: Integer;
    Path: String;
  begin
    Path := ParentPath + StripHotKey(Item.Caption);
    //Recurse through subitems, if any
    if (Item.Count > 0) then
      for I := 0 to Item.Count - 1 do
        AddMenuCommands(Item.Items[I], Path + #32)
    //Otherwise add this item, if appropriate
    else
      if (Item.Caption <> '') and (Item.Caption <> '-') then
        //Must pass a non-blank strings for all WideString parameters
        Vcommand.AddCommand(CmdMenu, Item.Command, Path, Item.Hint,
          'Menu', VCMDCMD_CANTRENAME, ' ');
  end;

begin
  CmdMenu := VCommand.MenuCreate[ExtractFileName(Application.ExeName),
    'Main', VCMDMC_CREATE_TEMP];
  AddMenuCommands(Menu.Items, '');
  Vcommand.Activate(CmdMenu);
end;

The call to Vcommand.AddCommand passes a command flag as the penultimate parameter. The VCMDCMD_CANTRENAME flag means that this command cannot be renamed by applications that let users customise Voice Commands (such as Microsoft Voice). Since none of these commands require verification, the VCMDCMD_VERIFY flag is not passed.

Note: you can set up list commands with the ActiveX control but when a list command is interpreted, the OnCommandRecognize will not tell you which list item was said (the ListValues parameter will be blank). However, the NumLists parameter is set up correctly to tell you the number of bytes taken up by the list item. Since the Voice Commands Control implements the Unicode notification interface you must divide this by SizeOf(WideChar) and then subtract one (the null terminator) to get the number of characters. If your list command is simple enough this may be enough for you to work out the list item that was spoken.

The ActiveX control sends more notifications so we have the chance to reinstate the VU meter this time.


procedure TfrmVoiceCommandsControl.VcommandVUMeter(Sender: TObject; Level: Integer);
begin
  ProgressBar.Position := Level;
  lblVU.Caption := IntToStr(Level);
end;

Note: there appears to be no way to specify whether the Voice Menus should be local or global. Since this Voice Menu should be local, the program itself disables the menu when the application loses focus and re-enables it when the application gains focus. To do this it uses a TApplicationEvents component (a component that surfaces the Application object's events to the Object Inspector) and its OnActivate and OnDeactivate event handlers.


procedure TfrmVoiceCommandsControl.ApplicationEvents1Activate(
  Sender: TObject);
begin
  Vcommand.Activate(CmdMenu);
end;

procedure TfrmVoiceCommandsControl.ApplicationEvents1Deactivate(
  Sender: TObject);
begin
  Vcommand.Deactivate(CmdMenu);
end;

Voice Dictation Control

The Microsoft Voice Dictation control is an ActiveX that wraps up the high level Voice Dictation API. To use it you must first import the ActiveX into Delphi; you will find it described as Microsoft Voice Dictation (Version 1.0).

This will generate and install a type library import unit called DICTLib_TLB.pas. The import unit contains the ActiveX component wrapper class called TVdict.

The ActiveX is implemented in Vdict.dll in the Windows speech directory (whose version information describes it as the Voice Dictation Module) and the primary interface implemented is IVdict. However it also surfaces parts of other interfaces such as IVDctAttributes and IVDctDialogs.

You can programmatically work with this ActiveX control using the ProgID Vdict.Vdict or the ClassID CLASS_Vdict from the DICTLib_TLB unit. The Windows registry describes this class as the Vdict Class.

Alternatively (and more typically) you can simply drop the ActiveX component on a form. A sample project using this control can be found in the accompanying files, VoiceDictationControl.dpr in the ActiveX directory. It offers the same functionality as the Voice Dictation API sample from earlier.

Note: until I had used the Voice Dictation API successfully this control caused Delphi to hang as soon as the control was dropped on a form. It starts up a copy of the Microsoft Voice Commands Automation server and then both that and the Delphi IDE start consuming apparently endless chunks of virtual memory. I never got to the bottom of why this was, but it no longer happens.

Note: the OnPhraseStart event is not triggered with this control (as with the Voice Command API).

Speech Recognition Troubleshooting

If you get issues of SR stopping (or not starting) unexpectedly, or other weird SR issues, check your recording settings have the microphone enabled.

SAPI 4 Deployment

When distributing SAPI 4 applications you will need to supply the redistributable components (available as spchapi.exe from http://www.microsoft.com/speech/download/old). It would be advisable to also deploy the Speech Control Panel application (available as spchcpl.exe from http://www.microsoft.com/msagent/downloads.htm), however this Control Panel applet will not install on any version of Windows later than Windows 2000.

The Microsoft SAPI 4 compliant TTS engine can be downloaded from various sites (although not Microsoft's), such as http://misterhouse.net:81/public/speech or http://www.cs.cofc.edu/~manaris/SUITEKeys.

As well as the Microsoft TTS engine, you can also download additional TTS engines from Lernout & Hauspie (which include one that uses a British English voice) from http://www.microsoft.com/msagent/downloads.htm. Note that if you plan to use any of these engines from applications running under user accounts without user privileges, you need to do some registry tweaking, described in http://www.microsoft.com/msagent/detail/tts3000deploy.htm.

You can download the Microsoft Speech Recognition engine for use with SAPI 4 from http://www.microsoft.com/msagent/downloads.htm.

References/Further Reading

The following is a list of useful articles and papers that I found on SAPI 4 development during my research on this subject.

  1. Using Microsoft OLE Automation Servers to Develop Solutions by Ken Lassesen, MSDN Office Development (General) Technical Articles, October 1995.
    This shows VB Automation against the Voice Text Object and Voice Command Object (no callbacks used).
  2. A High-Level Look at Text-to-Speech via the Microsoft Voice Text Object by Robert Coleridge, MSDN Windows User Interface Technical Articles, October 1995.
    Shows a VB example of using Automation against Voice Text Object.
  3. An Overview of the Microsoft Speech API by Mike Rozak, November 1998.
    Looks briefly at the high level and low level SR and TTS interfaces in the SAPI 4 SDK.
  4. Talk to Your Computer and Have It Answer Back with the Microsoft Speech API by Mike Rozak, Microsoft Systems Journal, January 1996.
    Uses the Voice Command and Voice Text APIs to implement a clock that tells the time when asked.
  5. Making Delphi Talk: Using Speech Technology with your Delphi Apps by Glenn Stephens, Unofficial Newsletter of Delphi Users, January 1999.
    Uses late-bound Automation against the Voice Text Object available through the Speech.VoiceText ProgID, including setting up the callback object.
  6. Making Delphi Listen by Glenn Stephens, Unofficial Newsletter of Delphi Users, January 1999.
    Uses early-bound Automation against the Voice Command Object to implement command and control.

About Brian Long

Brian Long used to work at Borland UK, performing a number of duties including Technical Support on all the programming tools. Since leaving in 1995, Brian has spent the intervening years as a trainer, trouble-shooter and mentor focusing on the use of the C#, Delphi and C++ languages, and of the Win32 and .NET platforms. In his spare time Brian actively researches and employs strategies for the convenient identification, isolation and removal of malware. If you need training in these areas or need solutions to problems you have with them, please get in touch or visit Brian's Web site.

Brian authored a Borland Pascal problem-solving book in 1994 and occasionally acts as a Technical Editor for Wiley (previously Sybex); he was the Technical Editor for Mastering Delphi 7 and Mastering Delphi 2005 and also contributed a chapter to Delphi for .NET Developer Guide. Brian is a regular columnist in The Delphi Magazine and has had numerous articles published in Developer's Review, Computing, Delphi Developer's Journal and EXE Magazine. He was nominated for the Spirit of Delphi award in 2000.


Go to the speech capabilities overview

Go back to the top of this SAPI 4 High Level Interfaces article

Go to the SAPI 4 Low Level Interfaces article

Go to the SAPI 5.1 article