C/C++

Speech Synthesis in C++

By Neil G. Rowland, August 01, 1994

Neil presents a C++ class library for speech synthesis using the Windows 3.1 Multimedia API. With this library, you can write a Windows app that generates speech on any MPC-compatible sound card.

AUG94: Speech Synthesis in C++

A library for generating speech under Windows 3.1

Neil is a programmer at Gradient Technologies. He can be reached on either CompuServe at 72133,426, the Internet as [email protected], or Channel One as Neil Rowland.

You'd think that because human speech is made up of discrete sounds ("phonemes"), all you'd have to do to write a speech-synthesis application is translate text into a string of phonemes, then output each phoneme from a table of sampled waveforms. In practice, however, programs that use this approach sound terrible and are nearly impossible to understand. Obviously, there's much more to speech than just phonemes. Inflection (or change in pitch) of an accented syllable or the end of a sentence is another factor that affects the quality of speech. Speech also involves coarticulation--the short interval of time in which sound gradually changes from one phoneme to another--such that there is no sharp dividing line between one sound and the next.

This article presents a C++ class library for speech synthesis using the Windows 3.1 Multimedia API. Every step in the synthesis process is represented, from parsing English text to the calls into the sound driver. With this library, you can write a Windows app that generates speech on any MPC-compatible sound card.

Coarticulation on the Cheap

The usual method of coarticulation is to store a sample (a "diphone") of every possible pair of sounds and the transition between them. This approach yields good results, but at a price. For one thing, it requires a lot of samples, which means a great deal of storage space. Inflection is also an issue. If you simply sampled the waveform for a phoneme, you could change the pitch by varying the playback rate of the stored waveform. The rate of speech can be regulated by changing how long you repeat the pattern before going on to the next phoneme. Unfortunately, there's no way to separate these two in a stored diphone. As you increase the rate of playback, both the pitch and rate of speech increase. Compounding this problem, each sound in the diphone may be at a different pitch. For instance, consider a word where the first sound is in a stressed syllable and the second is in a nonstressed syllable occurring just after the stressed syllable. A stored diphone can't easily change pitch as it's playing back without complicating the timing issues.

The approach I present here is a cheaper solution. While the results aren't quite as smooth as those you get with diphones, the overhead is much more reasonable. I start with a waveform sample for each phoneme. When playing a phoneme, I play back the sample repeatedly, at the appropriate rate for the desired pitch, until the desired time has elapsed. This way, pitch and rate of speech are individually controllable. For the transition between each phoneme, I insert an in-between sound--an existing phoneme that sounds halfway between the two phonemes--instead of sampling and storing a table of all possible in-between sounds. In other words, I'm coarticulating. If the two phonemes are of a different pitch, I output the bridge waveform at a pitch that is the average of these two pitches.

Synthesizing Speech

I've incorporated my coarticulation method into a class library which will input a buffer of ASCII English text and output intelligible speech. I've divided the sounds into three categories, each of which is handled differently. The first category, "tonals," contains all vowel sounds (and some we don't ordinarily think of as vowels). A tonal has a definite pitch, and the particular pitch it's played back at depends on the inflection of the voice. The most obvious coarticulation occurs between two tonals.

The speech library implements two kinds of inflection--syllable inflection and phrase inflection. The former is an accent or stress on a particular syllable, such as the "a" in "tomato," while the latter is the overall pitch pattern of a sentence (question, exclamation, or ordinary sentence).

The other two types of sounds, "percussive" and "atonal," are much simpler than tonal sounds since they don't have any pitch. I don't implement coarticulation with these because it would be more difficult. Since they're less common than the tonals, the cost-to-benefit ratio is much less favorable. A percussive is a short sound, like p or t. An atonal is any sound that has no tone, but may have an arbitrary duration, such as s or f. How long an atonal lasts depends on the speed at which the speaker is speaking. A percussive is always very short, no matter what the rate of speech.

The four layers of the code are encapsulated into respective classes; see Listings One and Two . These classes are unusual because they each can have only one instance--there's no need for more than one. Therefore, I declare all members and methods static, which results in smaller and more efficient object code. The top level, SPEECH_READER, takes an ASCII text, translates it into a phoneme string and inflection data, and passes this to the next level, SPEECH_PHRASER. SPEECH_PHRASER parses the phoneme string and applies the inflection and coarticulation, using primitives to generate each sound individually. These primitives are entry points to the third layer, SPEECH_SOUNDER, which synthesizes a given moment of speech. The samples for all the phonemes are in this module. WAVEPLAYER, available electronically (see "Availability," page 3), is a front end to the Windows waveform API.

SPEECH_SOUNDER

SPEECH_SOUNDER, the low-level class of the speech-synthesis engine, contains all the sample tables for the various sounds and is rather large. It's split among several modules, including SPEECLIB.CPP (see Listing Three), SPEEATON.CPP (for atonals), SPEECTON.CPP (tonals), and SPEEPERC.CPP (percussives), all of which are available electronically. The application can set SPEECH_SOUNDER's speech rate by changing the value of static member svoicespeed. SPEECH_SOUNDER uses a quantum unit of time called a "tick," and svoicespeed controls the length of the tick. A tick is svoicespeed/11025 seconds long.

The sTonal() method, which plays all tonal sounds, takes three parameters: _code is an index into a table of sampled tonal sounds (see Table 1); _ticks is the duration of the sound in ticks; and _pitch is the pitch of the sound, where a higher number means a higher pitch. Note that SPEECH_SOUNDER has no knowledge of coarticulation or inflection. These are implemented in SPEECH_PHRASER. SPEECH_SOUNDER's sTonal() method repeatedly plays the sample waveform, until the time specified in ticks expires. However, it doesn't play the waveform samples consecutively. It may skip or repeat samples as needed to give the desired pitch.

The sAtonal() method plays all atonal sounds. It takes _code, an index into a table of atonals, and _tick, the duration. Like sTonal, it repeats the sampled waveform until the time is up. It's much simpler than sTonal because it doesn't have to adjust the pitch and can spit out the samples consecutively.

The sPercussive() method takes only _code as an index into a table of percussives. Since a percussive is of short duration, sPercussive() plays through the waveform only once. sSilence() inserts a length of silence (that occurs between words or sentences). It takes one parameter, _tick, to indicate the duration. sFlush() flushes the output stage (WAVEPLAYER) to prevent buffer overflow. Because some sound drivers pause between buffers introducing an unwanted silence, it's best to call this only when you might want a silence. For example, sSilence calls sFlush. sFlushMaybe() will flush the buffer if it is reasonably close to full. Call this whenever an unwanted silence would not be objectionable, such as between words. Note that real speech doesn't usually have a full silence between words, but it doesn't hurt to have one.

SPEECH_PHRASER

SPEECH_PHRASER accepts a string of phonemes and a phrase-inflection code. It parses the string, speaks the phonemes by means of SPEECH_SOUNDER, and implements coarticulation. SPEECH_PHRASER::sPhrase() is the main method for this module. The other methods, sTonal(), sAtonal(), sPercussive(), and sSilence overload the corresponding SPEECH_SOUNDER methods. They implement the pitch changes that go with inflection, and sTonal() also sticks the coarticulation in between sounds, all in a way largely transparent to sPhrase.

sPhrase() takes a pointer to the phoneme string and an inflection code. Each phoneme in the string begins with an uppercase letter, and may also have a second, lowercase letter (see Table 1). A caret (^) before a phoneme means to stress that sound, by raising its pitch slightly (syllable inflection). The inflection code is a single ASCII character indicating how to inflect the phrase as a whole. It is a punctuation mark and is interpreted just like punctuation marks in written English. For example, a question mark gives a rising inflection at the end of the phrase, and a period gives a falling inflection. An exclamation mark raises the overall pitch at which the phrase is spoken. A comma, which is for clauses in a sentence, gives no inflection, but does put a pause at the end of the phrase.

sPhrase stores the inflection in the static member sinflection and sets up scount for the convenience of sTonal(). scount is a rough estimate of how close SPEECH_PHRASER is to the end of the phrase. Finally, sPhrase() applies the syllable accent to a long tonal (vowel), by adjusting the pitch argument it passes to sTonal. The #defines TSHORT and TLONG are the number of ticks for a short and long tonal, respectively. A long tonal is a vowel. A short tonal is a sound normally thought of as a consonant, such as the "r" sound.

sTonal() is where the work of inflection and coarticulation goes on. It derives the right inflection for the current tick from sinflection and scount. It uses svoicepitch as a baseline pitch and applies inflection to it to derive the pitch that it passes on to its counterpart routine in SPEECH_SOUNDER. sTonal() does coarticulation by substituting the in-between sound for the first tick in every call. It also uses a pitch that is halfway between the pitch of the previous tonal tick and the pitch it would otherwise use for this tick. sprevtonal always has the code of the tonal that the previous call to sTonal() spoke, and sprevpitch has its pitch. coart[] is the lookup table, indexed by sprevtonal and the currently passed tonal code.

When two percussives occur consecutively, like the s and t in the word stop, it's difficult to distinguish them unless there's a brief space in between them. SPEECH_PHRASER's sPercussive() method uses the fPrevwasperc flag to catch this and insert the brief space when needed.

SPEECH_READER

SPEECH_READER is the highest-level module of the library. It takes a string of ASCII text, translates it into phoneme strings, and feeds it to SPEECH_PHRASER to be said out loud. Its code is in the module SPEEREAD.CPP (see "Availability," page 3). The entry point sSayText() parses the input text into clauses by looking for the punctuation marks that mark the end of a clause. It then feeds each clause in turn to sText2Phonemes(), which translates the clause into a phoneme string. Finally, it feeds the phoneme string to SPEECH_PHRASER and loops back for the next clause.

sText2Phonemes() first preprocesses the text to get rid of unpronounceable punctuation and lower case. It also pads the beginning and end of the string with spaces to help match against patterns with spaces. The preprocessor stashes its result into pszTemp. Then we do pattern substitution on pszTemp. HardRules[] is the table of pronunciation rules. You go through this from beginning to end, looking for matches and performing substitutions. To ensure correct precedence, this table should be sorted with long patterns first. Many patterns have spaces in them to accommodate the beginnings and ends of words.

WAVEPLAYER

The WAVEPLAYER class is a refined version of the WAVEPLAYER class described in my article "Compressing Waveform Audio Files" (Dr. Dobb's Sourcebook of Multimedia Programming, Winter 1994). This module is not specific to speech synthesis; it can be used in any application that needs to output digitized sounds. You feed this class a string of samples, one at a time. The WAVEPLAYER object handles the buffering. The code calling WAVEPLAYER needn't bother with any part of the Windows waveform API calls.

The Blocking Hook

Every Windows application that does time-consuming calls should have a mechanism to process system messages. Otherwise, all other system activity could come to a halt while the operation goes on. Since this can be annoying, you'll want a Cancel button to halt it. However, Cancel buttons won't work unless the app is dispatching messages. In WAVEPLAY.CPP, I've created a "blocking hook" mechanism that has two functions: It processes messages to allow other apps to run, and it detects a user cancel. When the user selects a Cancel button, it returns True to the caller; otherwise, it returns False.

While waiting for the system to accept more output, the WAVEPLAYER library from time to time calls a blocking-hook routine. The pointer to the function is in sfnBlocking, so the user can change it if necessary. If the user leaves this pointer alone, there is a default blocking hook, sDfltBlocking(), which will work in most cases. The blocking hook is called through sDoBlocking(). This routine also sets a flag, sfBlocking, to indicate when the blocking function is in progress. Since the blocking hook dispatches control messages, it could also dispatch a message that leads to the app using the WAVEPLAYER. But if the WAVEPLAYER is already in use, we have a reentrancy problem. The main program can consult this flag to avoid reentrancy.

Conclusion

You might consider adding longer samples. In my experience, a longer sample usually yields better quality and a less robotic-sounding voice. I've kept my samples short as an economy measure. You may also want to provide the ability to add pronunciation rules on the fly. A program would look in any .INI file for rules to add and insert them into the rules list at the proper points. Either way, this library should serve as a good starting point for an affordable, all-software, speech-synthesis solution. All that is needed is a user interface.

Table 1: Phonemes and their codes. (a) Tonal (usually long); (b) tonal (usually short); (c) atonal; (d) percussive.

(a) 00 Ah                  (c) 00 S
    01 Aw                      01 CH
    02 Ee                      02 SH
    03 Oo                      03 F
                               04 H
(b) 04 A (cat)                 05 TH
    05 Eh (short E)
    06 Ih (it)             (d) 00 T
    07 Uh (short U)            01 K
    08 Ue (foot)               02 B
    09 R                       03 P
    10 Z                       04 D
    11 Jh (g in fudge)         05 G
    12 Zh (pleasure)
    13 M
    14 N
    15 Ng
    16 V
    17:Dh (this)
    18:L
    19:Oe (long O)

Listing One


//***************************** SPEECLIB.H **********************************
// Header file for SPEECH library. Copyright (c) 1994 by Neil G. Rowland, Jr.

#ifndef __SPEECLIB_H
#define __SPEECLIB_H
extern "C"
    {
    #include <windows.h>
    #include <mmsystem.h>
    }
extern int WavelibErrno;                // last error code in WAVELIB
void WavelibErrorBox();                 // report last error to user.
typedef struct
    { // Play waveform output
    static HWAVEOUT    shwaveout;
    static HANDLE      shHdr;
    static LPWAVEHDR   slpHdr;
    static HANDLE      shBuf;           // current waveform buffer.
    // blocking function...
    static BOOL     (*sfnBlocking)();       // return TRUE to cancel output.
    static BOOL     sfBlocking;             // TRUE when in blocking function.
    static BOOL     sfUserCancelled;        // set when we see IDCANCEL.
    static BOOL     sDfltBlocking();        // default blocking function;
    // accum buffer for feeding in a sample at a time...
    static HANDLE   shBufS;
    static LPSTR    slpBufS;
    static unsigned scountS;
    static BOOL     sOpen(PCMWAVEFORMAT* _pFmt);
    static void     sClose();
    static void     sPlaySample(WORD _sample);
    static void     sFlush();
  protected:
    static void     sDoBlocking();
    static inline void iCloseoutBuffer();
    static inline void iCloseoutSampleBuffer();
    static inline void iPlay(MMIOINFO* _pInfo);
    static inline void iPlay(HANDLE _hbuf, LPSTR _lpBuf, int _len);
    }
WAVEPLAYER;
//+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
#define MAXTONAL 19
#define MAXATONAL 5
#define MAXPERC 5
class SPEECH_SOUNDER  {
  public:
    SPEECH_SOUNDER();  ~SPEECH_SOUNDER();
    static WAVEPLAYER sOut;
    static unsigned svoicespeed;    // in quanta (1/11025sec) per tick.
    static void sFlush();
    static void sFlushMaybe();
    static void sSilence(BYTE _ticks);
    static void sPercussive(BYTE _code);
    static void sAtonal(BYTE _code, BYTE _ticks);
    static void sTonal(BYTE _code, BYTE _ticks, BYTE _pitch);
    };
class SPEECH_PHRASER : public SPEECH_SOUNDER  {
  public:
    static BYTE svoicepitch;        // overall pitch of voice
    static void sPhrase(const char* _pszPhon, char _inflection);
    static void sSilence(BYTE _ticks);      // overload
    static void sTonal(BYTE _code, BYTE _ticks, BYTE _pitch);   // overload
    static void sPercussive(BYTE _code);    // overload
    static void sAtonal(BYTE _code, BYTE _ticks);   // overload
  protected:
    static char sinflection;
    static int scount;      // of chars to go in current phrase
    static int sinf;        // inflection differential
    static BYTE sprevtonal; // used in co-articulation
    static BYTE sprevpitch; // used in co-articulation
    static BOOL fPrevwasperc; // for dtecting consecutive percussives.
    };
// This is the phrase that read English text...
class SPEECH_READER : public SPEECH_PHRASER  {
 public:
  static void sText2Phonemes(char* _pszPhon,int _lenPhon,const char* _pszText);
  static void sSayText(const char* _pszText);
  static void sSayCliptext();
   };
#pragma hdrstop
#endif

Listing Two


//*************************** SPEECH.CPP **********************************
// Main app for SPEECH speech synthesizer. Copyright (c) 1994 Neil Rowland, Jr.
extern "C"  {
    #include <math.h>
    #include <stdlib.h>
    #include <string.h>
    }
#include "speeclib.h"
#define IDM_ABOUT       11          // menu items

//***************************************************************************
int PASCAL WinMain(HANDLE hInst,HANDLE hPrev, LPSTR lpszCmdLine, int iCmdShow);
BOOL FAR PASCAL AboutDlgProc(HWND hwnd,unsigned wMsg,WORD wParam, LONG lParam);
BOOL FAR PASCAL MainDlgProc(HWND hwnd,unsigned wMsg,WORD wParam, LONG lParam);

//****************************************************************************
char        gszAppName[] = "Speech demo";   // for title bar, etc.
HANDLE      ghInst;                     // app's instance handle
void PhonemeDoIt()
    {
    if (SPEECH_PHRASER::sOut.sfBlocking)  return;   // re-entrancy.
    SPEECH_PHRASER T;
    WORD code, pitch;
    // Test out the SPEECH_PHRASER stage...
    T.svoicespeed = 350;
    T.svoicepitch = 70;
    for (int c=0; c<2; c++)  {
        T.sSilence(4);
        T.sPhrase("W^AhW", '!');
        T.sPhrase("Uh T^AwLKEeNg KUhMPY^OoTEr", '?');
        T.sPhrase("IZ DhIS IMPR^ESIV", ',');
        T.sPhrase("OeR W^AhT", '.');
        T.svoicespeed += 40;
        T.svoicepitch -= 5;
        }
    T.sPhrase("EeN^UhF AwLR^EhDEe", '.');
    }
void TextDoIt()
    {
    if (SPEECH_PHRASER::sOut.sfBlocking)  return;   // re-entrancy.
    SPEECH_READER T;
    // Now take SPEECH_READER out for a spin...
    T.svoicespeed = 350;
    T.svoicepitch = 70;
    T.sSayText("This is a test of the Emergency Broadcasting System. "
               "This is only a test, so calm down already.");
    }
void ClipDoIt()
    {
    if (SPEECH_PHRASER::sOut.sfBlocking)  return;   // re-entrancy.
    SPEECH_READER T;
    // Now take SPEECH_READER out for a spin...
    T.svoicespeed = 350;
    T.svoicepitch = 70;
    T.sSayCliptext();
    }
//****************************************************************************
int PASCAL WinMain(HANDLE hInst, HANDLE hPrev, LPSTR lpszCmdLine, int iCmdShow)
    {
    FARPROC fpfn;
    HWND    hwd;
    MSG     msg;
    // Save instance handle for dialog boxes.
    ghInst = hInst;
    // Display our dialog box.
    fpfn = MakeProcInstance((FARPROC) MainDlgProc, ghInst);
    if (!fpfn)  goto erret;
    hwd = CreateDialog(ghInst, "LOWPASSBOX", NULL, fpfn);
    if (!hwd)  goto erret;
    ShowWindow(hwd, TRUE);  UpdateWindow(hwd);
    while (GetMessage(&msg, NULL, 0, 0))
        if (!IsDialogMessage(hwd, &msg))
            DispatchMessage(&msg);
    DestroyWindow(hwd);
    FreeProcInstance(fpfn);
    return TRUE;
  erret:
    MessageBeep(0);
    return FALSE;
    }
// MainDlgProc - Dialog procedure function.
BOOL FAR PASCAL MainDlgProc(HWND hWnd, unsigned wMsg, WORD wParam, LONG lParam)
    {
    FARPROC     fpfn;
    HMENU       hmenuSystem;    // system menu
    HCURSOR     ghcurSave;      // previous cursor
    switch (wMsg)  {
    case WM_INITDIALOG:
        // Append "About" menu item to system menu.
        hmenuSystem = GetSystemMenu(hWnd, FALSE);
        AppendMenu(hmenuSystem, MF_SEPARATOR, 0, NULL);
        AppendMenu(hmenuSystem, MF_STRING, IDM_ABOUT,
            "&About LowPass...");
        return TRUE;
    case WM_COMMAND:
        switch (wParam)  {
        case IDOK:          // Phomene demo
               // Set "busy" cursor, filter input file, restore cursor.
            ghcurSave = SetCursor(LoadCursor(NULL, IDC_WAIT));
            PhonemeDoIt();
            SetCursor(ghcurSave);
            break;
        case 3:          // Text demo
               // Set "busy" cursor, filter input file, restore cursor.
            ghcurSave = SetCursor(LoadCursor(NULL, IDC_WAIT));
            TextDoIt();
            SetCursor(ghcurSave);
            break;
        case 4:          // Clipboard demo
               // Set "busy" cursor, filter input file, restore cursor.
            ghcurSave = SetCursor(LoadCursor(NULL, IDC_WAIT));
            ClipDoIt();
            SetCursor(ghcurSave);
            break;
        case IDCANCEL:      // "Shut Up"
            SPEECH_SOUNDER::sOut.sfUserCancelled = TRUE;
            break;
        case WM_USER+IDCANCEL:      // "Done"
            SPEECH_SOUNDER::sOut.sfUserCancelled = TRUE;
            PostQuitMessage(0);
            break;
        }
        break;
    }
    return FALSE;
}

Listing Three


//******************************** SPEECLIB.CPP *******************************
// SPEECH speech synthesizer library. Copyright (c) 1994 by Neil G. Rowland, Jr
extern "C"  {
    #include <math.h>
    #include <stdlib.h>
    #include <string.h>
    }
#include "speeclib.h"

//****************************************************************************
BOOL fInited  = FALSE;
//*************************** SPEECH_SOUNDER ***********************
WAVEPLAYER SPEECH_SOUNDER::sOut;
unsigned SPEECH_SOUNDER::svoicespeed = 500;
//------------------------------------------------------------------------------
SPEECH_SOUNDER::SPEECH_SOUNDER()
    {
    static PCMWAVEFORMAT Fmt = {{WAVE_FORMAT_PCM, 1, 11025L, 11025L, 1},8};
    if (fInited)  return;
    if (sOut.sfBlocking)  return;
    if (!sOut.sOpen(&Fmt))  { WavelibErrorBox();  return; }
    fInited = TRUE;
    }
SPEECH_SOUNDER::~SPEECH_SOUNDER()
    {
    if (!fInited)  return;
    sOut.sClose();
    fInited = FALSE;
    }
//****************************************************************************
void SPEECH_SOUNDER::sFlush()
    { sOut.sFlush(); }
void SPEECH_SOUNDER::sFlushMaybe()
    { // Here for between words...
      // Sometimes a flush leads to an audible pause. Put between words.
    if (sOut.scountS > 15000)
        sOut.sFlush();
    }
void SPEECH_SOUNDER::sSilence(BYTE _ticks)
    {
    unsigned t; 
    unsigned len = svoicespeed * _ticks;
    if (_ticks > 1)  sOut.sFlush();
    for (t=0; t < len; t++)
        sOut.sPlaySample(0x8000);
    if (_ticks > 1)  sOut.sFlush();
    }
//****************************** SPEECH_PHRASER ************************************************
BYTE SPEECH_PHRASER::svoicepitch = 30;
char SPEECH_PHRASER::sinflection = '.';
int SPEECH_PHRASER::scount = 0;
int SPEECH_PHRASER::sinf = 0;
BYTE SPEECH_PHRASER::sprevtonal = 255;  // 255 means no immed. prev tonal
BYTE SPEECH_PHRASER::sprevpitch = 30;
BOOL SPEECH_PHRASER::fPrevwasperc = FALSE; 
                                     // for detecting consecutive percussives.
//------------------------------------------------------------------------------
void SPEECH_PHRASER::sPhrase(const char* _szPhon, char _inflection)
    {
    if (!fInited)  return;
    if (!_szPhon)  return;
    if (sOut.sfBlocking)  return;   // re-entrancy.
  #define TSHORT 3      // ticks for a short sound
  #define TLONG 5       // ticks for a long (vowel) sound
    int     accent = svoicepitch/17;    // how much pitch rises on an accent;
    char    cPhon[2];
    const char* szPhon = _szPhon;
    char    cnext;
    int     pitchbase = svoicepitch;
    BOOL    fLongTonal = FALSE;
    BOOL    fAccent = FALSE;        // whether current vowel is being stressed
    int     longtonal;
    // Init variables for benefit of sTonal...
    sinflection = _inflection;
    sprevtonal = 255;
    sprevpitch = svoicepitch;
    fPrevwasperc = FALSE;
    scount = strlen(_szPhon);
    if (accent < 1)  accent = 1;
    while (*szPhon && !sOut.sfUserCancelled)  { // Parse the phoneme string...
        cPhon[1] = '\0';
        cPhon[0] = *szPhon;
        szPhon++;
        cnext = *szPhon;
        if (cnext>='a' && cnext<='z')  { // part 2 of a 2-char phoneme
            cPhon[1] = cnext;
            szPhon++;  scount--;
            }
        // Now look it up...
        switch (cPhon[0])  {
            case ' ':  sSilence(1);  sFlushMaybe();  break; 
                                                       // space between words
            case '^':  fAccent = TRUE;  break;
            case 'A':
                switch (cPhon[1])  {
                    case 'h':  longtonal=0;  fLongTonal=TRUE;  break;
                    case 'w':  longtonal=1;  fLongTonal=TRUE;  break;
                    default:  sTonal(4,TSHORT,pitchbase);  break;
                    }
                break;
            case 'B':  sPercussive(2);  break;
            case 'C':  sAtonal(1, 1);  break;
              if (cPhon[1]=='h')  sAtonal(1, 1);  else  sPercussive(1);  break;
            case 'D':
              if (cPhon[1]=='h')  sTonal(17,TSHORT,pitchbase);  
                                                 else  sPercussive(4);  break;
            case 'E':
                switch (cPhon[1])  {
                    case 'e':  longtonal=2;  fLongTonal=TRUE;  break;
                    case 'r':  longtonal=9;  fLongTonal=TRUE;  break;
                    default: sTonal(5,TSHORT,pitchbase);  break;
                    }
                break;
            case 'F':  sAtonal(3, 1);  break;
            case 'G':  sPercussive(5);  break;
            case 'H':  sAtonal(4, 1);  break;
            case 'I':  longtonal=6;  fLongTonal=TRUE;  break;
            case 'J':  sTonal(11,TSHORT,pitchbase);  break;
            case 'K':  sPercussive(1);  break;
            case 'L':  sTonal(13,TSHORT,pitchbase);  break;
            case 'M':  sTonal(13,TSHORT,pitchbase);  break;
            case 'N':  sTonal((cPhon[1]=='g')?15:14, TSHORT, pitchbase); break;
            case 'O':  
                if (cPhon[1]=='o')   { longtonal=3;  fLongTonal=TRUE; }
                else  if (cPhon[1]=='e')   { longtonal=19;  fLongTonal=TRUE; }
                else  sTonal(1, TSHORT,pitchbase);  break;
            case 'P':  sPercussive(3);  break;
            case 'R':  sTonal(9,TSHORT,pitchbase);  break;
            case 'S':  sAtonal((cPhon[1]=='h')? 2:0, 1);  break;
            case 'T':
                if (cPhon[1]=='h') sAtonal(5, 1); else sPercussive(0); break;
            case 'U':
                if (cPhon[1]=='e') sTonal(8,TSHORT,pitchbase); 
                                       else sTonal(7,TSHORT,pitchbase); break;
            case 'V':  sTonal(16,TSHORT,pitchbase);  break;
            case 'W':  sTonal(3,TSHORT,pitchbase);  break;
            case 'X':  break;
            case 'Y':  sTonal(2,TSHORT,pitchbase);  break;
            case 'Z':  sTonal((cPhon[1]=='h')?12:10, TSHORT, pitchbase); break;
            }
        if (fLongTonal)  { // play long tonal, applying any (syllable) accent.
            sTonal(longtonal, 1, pitchbase + (fAccent?accent/2:0));
            sTonal(longtonal, TLONG-2, pitchbase + (fAccent?accent:0));
            sTonal(longtonal, 1, pitchbase + (fAccent?accent/2:0));
            fLongTonal = FALSE;  fAccent = FALSE;
            }
        scount--;
        }
    sSilence(TLONG);
    }
static BYTE coart[MAXTONAL+1][MAXTONAL+1] =
    { // coarticulation table...  code //
// Ah Aw Ee Oo A Eh Ih Uh Ue  R   Z  Jh  Zh  M  N  Ng V  Dh L  Oe  // prevtonal
   0, 5, 5, 8, 4, 5, 5, 1, 8, 8, 10, 11, 12, 8, 8, 5, 5, 5, 3,  7, // 0 = Ah
   5, 1, 5, 3, 4, 5, 5, 8, 8, 8, 10, 11, 12, 8,  8, 14, 16, 17, 3,  7, // 1=Aw
   5, 5, 2, 7, 4, 5, 5, 5, 8, 8, 10, 11, 12, 8,  8, 14, 16, 17, 6,  8, // 2=Ee
   1, 1, 5, 3, 4, 5, 8, 8, 8, 9, 10, 11, 12,13, 14, 14,  5, 17, 7,  7, // 3=Oo
   4, 1, 2, 3, 4, 5, 6, 8, 8, 8, 10, 11, 12, 8,  8, 14, 16, 17, 3,  5, // 4=A
   5, 1, 2, 3, 4, 5, 6, 8, 8, 8, 10, 11, 12, 8,  8, 14, 16, 17,18,  6, // 5=Eh
   5, 5, 2, 3, 5, 5, 6, 5, 8, 8, 10, 11, 12, 5,  5,  2,  5,  5, 5,  7, // 6=Ih
   1, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,13, 14, 14, 16, 17, 8, 19, // 7=Uh
   8, 1, 2, 3, 4, 5, 6, 7, 8, 3, 10, 11, 12, 8,  8, 14, 16, 17,18, 19, // 8=Ue
   5, 7, 5, 3, 6, 5, 2, 8, 3, 9, 10, 11, 12, 8,  8, 14, 16, 17,18,  8, // 9=R
   0, 1, 2, 3, 4, 5, 6, 5, 8, 9, 10, 11, 12, 8,  8, 14, 16, 17,18, 19, // 10=Z
   0, 1, 2, 3, 4, 5, 6, 5, 8, 9, 10, 11, 12, 8,  8, 14, 16, 17,18, 19, // 11=Jh
   0, 1, 2, 3, 4, 5, 6, 5, 8, 9, 10, 11, 12, 8,  8, 14, 16, 17,18, 19, // 12=Zh
   7, 0, 7, 7, 5, 5, 6, 8, 8, 9, 10, 11, 12,13, 14, 14, 16, 17,18,  8, // 13=M
   7, 0, 7, 7, 5, 5, 6, 8, 8, 9, 10, 11, 12,13, 14, 14, 16, 17,18,  8, // 14=N
  14, 1, 5, 3, 4, 5, 5, 8, 8, 9, 10, 11, 12,13, 14, 15, 16, 17,18, 19, // 15=Ng
   7, 1, 2, 3, 4, 5, 6, 5, 8, 9, 10, 11, 12, 8, 14, 14, 16, 17,18, 19, // 16=V
   5, 1, 2, 3, 4, 5, 6, 5, 8, 9, 10, 11, 12, 8, 14, 14, 16, 17,18,  7, // 17=Dh

   0, 1, 5, 7, 4, 5, 5, 7, 7, 8, 10, 11, 12,13, 14, 14, 16, 17,18, 19, // 18=L
   8, 8, 8, 7, 5, 5, 5, 8, 8, 8, 10, 11, 12, 3,  3,  7,  8,  8,18, 19  // 19=Oe
   };
void SPEECH_PHRASER::sTonal(BYTE _code, BYTE _ticks, BYTE _pitch)
    { // sTonal, applying end-of-sentence inflection and co-articulation... 
    BYTE pitch;
    BOOL fFirstTick = TRUE;
    if (sinflection == '!')  _pitch += svoicepitch/20;      // talk excited.
    if (scount > 4)  sinf = 0;
    else  if (sinf == 0)  sinf = 1;         // start phrase inflecting.
    while (_ticks--)  {
        pitch = _pitch;
        if (sinf)  {
            switch (sinflection)  {
                case '?':  pitch += sinf;  break;       // rise on question.
                case '.':  case '!':  pitch -= sinf;  break;
                default:  break;
                }
            sinf++;
            }
        if (fFirstTick && sprevtonal!=255)  { // co-articulate
            SPEECH_SOUNDER::sTonal(coart[sprevtonal][_code], 1, 
                                                         (pitch+sprevpitch)/2);
            fFirstTick = FALSE;
            }
        else  SPEECH_SOUNDER::sTonal(_code, 1, pitch);
        }
    sprevtonal = _code;     // for next time.
    sprevpitch = pitch;     // "
    fPrevwasperc = FALSE;
    }
void SPEECH_PHRASER::sAtonal(BYTE _code, BYTE _ticks)
    { // overload, applying coarticulation...
    SPEECH_SOUNDER::sAtonal(_code, _ticks);
    sprevtonal = 255;
    fPrevwasperc = FALSE;
    }
void SPEECH_PHRASER::sPercussive(BYTE _code)
    { // overload, applying coarticulation...
    if (fPrevwasperc)  sSilence(1);    // keep consecutive percs distinct
    SPEECH_SOUNDER::sPercussive(_code);
    sprevtonal = 255;
    fPrevwasperc = TRUE;
    }
void SPEECH_PHRASER::sSilence(BYTE _ticks)
    { // overload, applying coarticulation...
    SPEECH_SOUNDER::sSilence(_ticks);
    sprevtonal = 255;
    }

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

C/C++

Speech Synthesis in C++

A library for generating speech under Windows 3.1

Coarticulation on the Cheap

Synthesizing Speech

SPEECH_SOUNDER

SPEECH_PHRASER

SPEECH_READER

WAVEPLAYER

The Blocking Hook

Conclusion

Table 1: Phonemes and their codes. (a) Tonal (usually long); (b) tonal (usually short); (c) atonal; (d) percussive.

Listing One

Listing Two

Listing Three

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

C/C++ Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content

C/C++

Speech Synthesis in C++

Related Reading

More Insights

White Papers

Reports

Webcasts

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

C/C++ Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content