Embedded Systems

Real-Time Music Synthesis & Embedded Applications

, January 01, 2002

Building an interactive musical toy that uses real-time music synthesis to generate on-the-fly music is as much a software as a hardware challenge. From the scoregenerator to the microcontroller, Max and Joe describe the system they built.

Jan02: Real-Time Music Synthesis & Embedded Applications

Real-Time Music Synthesis & Embedded Applications

Dr. Dobb's Journal January 2002

Optimization is the key

By Max I. Fomitchev and Joe Hershberger

Max is a developer for the Ultrason Research Group and Joe an electrical engineering student at Oklahoma State University. Max can be contacted at [email protected].

When we set out to develop an interactive musical toy that used real-time music synthesis to generate music on the fly, we only had a vague idea of what it would actually be. However, we did know that it had to:

Generate score, which then must be synthesized into music for playback.
Provide controls that somehow affect score generation, while lights indicated the notes played.
Have instruments that combine the music encoded digitally and stored in the instrument bank — wavetable memory, in other words.

After laying out the design concepts (Figure 1) that reflected these requirements, we started writing simulation software to implement the details, followed by the synthesis and score-generation code. We then transformed these ideas into hardware and created an embedded system for real-time music synthesis.

Sound Synthesizer

Wavetable-based musical synthesis is straightforward to implement — you pick an appropriate sample from the wavetable memory, resample it according to the currently played note, and add the result to the mixing buffer (Figure 2). Before getting to this point, however, you have to make some important decisions up front.

Perhaps the most important decision affecting the synthesis is the number of channels (or polyphony). The number of channels determines the maximum number of concurrently playing instruments or notes. Large polyphony creates better music, yet requires more computational resources proportionally to the number of channels. Most soundcards allow for 64-note polyphony and employ note stealing to remove the oldest notes in the polyphonic stack when a new note must be played and max polyphony is already reached. The note-stealing technique was too intricate for something as simple as a musical toy and we quickly decided to have four monophonic channels (no note stealing), allocating a channel for drums, bass, lead, and second voice. Thus, in the worst-case scenario, we would have to add only four different samples to the mixing buffer. Furthermore, if the samples are normalized and the sample bit-width is the same as the mixing buffer bit-width, the results must be scaled down (divided by 4) to avoid numerical overflow. Example 1 implements the resulting music-synthesis algorithm.

The code assumes that each sample is recorded at the same base frequency (BASE_FREQ), where second octave C (C2) equals 65 Hz. For correct operation, the samples and mixing buffer must be declared as arrays of signed integers. We used the simplest and crudest method of resampling — the nearest neighbor algorithm — and traditional double-buffering for sound synthesis and playback. When the first buffer finished playing, the synthesis and playback buffers were swapped.

Score Generator

Because the potential success of the toy seemed to be in the quality of the generative music, we next focused on the score-generation algorithm. We wanted the music to be pleasant, not random, so we decided to use cadential chord progressions and generate notes that were consonant to the current playing chord. The concept may look similar to Microsoft's DirectMusic Producer, except that you don't have to specify allowable chord progressions and musical patterns as required by Producer. In this case, there was no prior information. Both chords and musical patterns were generated on-the-fly. After much experimentation, we devised a set of parameters that influenced music generation:

Tempo, cadence, or rhythm.
Time signature (3/4, 4/4, 6/8, and so on) also defines measure note count (3, 4, 6, or 8 notes in the measure).
Mutation probability, which determines how likely the score is to change from measure to measure. Typically, there is some repetition found in any music, while the change corresponds to the development of a theme.
Note density, which specifies probability of filling in the entire measure with notes.
Note spread, which specifies the maximum interval between subsequent notes in the measure.
Transposition key, the key of the currently playing chord.

We assigned slider controls for each parameter on the program's user interface and added provisions to load samples (wave files). At this point, the simulation software was written in C++, ran under Windows, and relied on DirectSound for playback. We were using 16-bit 44.1-KHz samples and did not care how much computational power was required for the synthesis. A 600-MHz Athlon CPU was fast enough for the job.

Hardware

Our hardware requirements for the device were logical. The device had to be:

Inexpensive — less than $20.00.
Easy to develop.
Operate on a 9V battery for several hours.
Play music reasonably loud (0.5 Wt).

Figure 3 is the hardware block diagram we developed. The CPU/ microcontroller selection was critical and dependent upon several factors:

Software performance requirements. Given samples that are S bytes long, re-sampling and mixing of N channels would require N X S X (1 addition + 1 multiplication + 2 division operations) or 4 X N X S arithmetic operations. This number did not include indexing and memory/register movement overhead. Scaling down or division by 4 could be replaced with a shift operation (>>2 in C). Alternately, the samples could be stored in scaled-down format. In this case, the scale-down operation becomes unnecessary. Still, the low limit on the number of operations for four-channel music synthesis is 16 X S. Sample size, S, depends on sample duration and sampling frequency. Given short one-second samples recorded at 44 KHz, the total number of operations becomes 16 X 44100=705.6K. To account for memory movement and index calculation overhead, another factor of 10 was added, arriving at 7 MIPS as a fair performance estimate. The estimate coincided nicely with the old DOS Scream Tracker requirements of a 12-MHz PC AT system for decent sound quality--assuming that the CPU would complete each arithmetic operation in one cycle.
External memory. We wanted to have external ROM and RAM because we envisioned music synthesis on a per-measure basis; for example, each measure consisting of 3-16 notes will be synthesized in a mixing buffer, then played. Given uniform note duration of 0.1 second and 16 notes on the high side, a buffer big enough for 1.6 seconds of audio was needed — at 8 KHz, ~16x2K of RAM (x2 comes from double buffering) was required, plus ~32K of ROM to store four one-second long samples.

We settled on the Microchip PIC 18-series microprocessor (http://www.microchip.com/14010/lit/pline/picmicro/families/18cxxx/index.htm), which supports hardware multiply and table-lookup instructions and runs at 40 MHz/10 MIPS. Unfortunately, it has no EEPROM. Because the Microchip MPLAB-C18 C compiler (http://www.microchip.com/14010/tools/picmicro/code/mplab18/index.htm) was also buggy, we decided to add an additional factor of 10 to the performance estimates because the hardware divide operation was not supported (and to account for the anticipated overhead from C compiler inefficiencies). We also had overlooked the fact that it would take far more than one cycle to fetch data from external ROM and SRAM, meaning that our additional x10 could still be an underestimation. Internally, the 18-series PIC could accommodate up to 32K ROM and only 1536 bytes of RAM.

To satisfy our external RAM and ROM requirements, we added another chip to the design — the Scenix SX microcontroller (http://www.ubicom.com/pdf_files/techdocs/pb18_20_28.pdf), which only complicated matters. Although Scenix could run at 50 or 75 MHz and was capable of 50 and 75 MIPS, respectively, it was cumbersome (it did not have enough I/O pins). This meant we would have to implement SRAM and ROM access serially. It would take approximately 100 instructions just to read a byte from a random location in the wavetable ROM or mixing buffer RAM. Addresses had to be output, then latched before data could be read. To complicate things further, we had to deal with the dual-controller design, which was more costly, more power consuming, and harder to develop (because we would need tools for both chips).

Clearly, this was not working out. Consequently, instead of complicating the design, we simplified it. We would use a single 18-series PIC microcontroller, get rid of external ROM and SRAM altogether, and use only internal PIC resources for all software needs. Performance problems would have to be overcome in software, which meant that we would have to optimize the music synthesis algorithm.

Optimization

Since internal PIC ROM was only 32 KB, all the samples had to be squeezed into 24 KB (we reserved 8 KB for program memory), we converted samples into 8-bit 11-KHz format and cut their length below one second. (Sonic Foundry Sound Forge did a great job converting our original 16-bit 44-KHz sound files into 8-bit format, minimizing the noise.) Shorter samples also meant faster synthesis. Because of the 8-bit format, samples were not stored prescaled because additional loss of resolution resulted in noticeable noise increase. The synthesis then had to be reworked to use no more than 256 bytes of RAM. Since PIC is an 8-bit microcontroller, it can address only 256 bytes of RAM. Bank switching was necessary to access more memory. The idea of synthesizing the entire 3-16 note measure at once was abandoned and it was decided to synthesize as much audio that could fit in 256 bytes (0.02 seconds at 11 KHz). A timer routine responsible for playback was set up. The routine was outputting values from the playback buffer to the ADC. As soon as the entire buffer was played, the routine would switch banks and call the Synthesize() function (available electronically; see "Resource Center," page 5) to fill the finished buffer. We were using two 256-byte mixing buffers located in separate memory banks — one for playback, another for synthesis. The synthesize routine kept track of time and was performing score generation as necessary. Most of the time the Synthesize() routine was calling the doSynth() function to complete the synthesis of the currently playing measure; see Example 2.

Gradually, we converted all of the synthesis code into pure C so we could compile it to the PIC18x C compiler to produce firmware. We then created a new version of the simulation software that modeled exactly how new hardware would work. DirectSound was still used for playback; however, this time we manipulated sound buffers directly — trusting DirectX to play back one buffer while synthesizing another. The new simulation software closely modeled the hardware and we used it throughout the final stages of development to verify music synthesis and score generation changes, as well as to find bugs in the modified code. If the PC simulation worked, so should the firmware version of the software — at least in principle.

Work still remained to optimize the doSynth() routine. We had to keep track of how far along a sample was synthesized and resampled in each channel. For this purpose we created the SNOTE structure in Example 3. Short long is a PIC18x C compiler syntax for 3-byte integers. When a new note was synthesized in the channel, pos was reset to zero. Once the note finished playing, pos equals the sample size.

Unfortunately, our synthesis code was still not fast enough and the device squealed instead of playing music. The problem was in resampling. The index calculation [k*NewFreq/BASE_FREQ] was too expensive. The calculations were streamlined by eliminating multiplication by storing (k*NewFreq) instead and incrementing this pointer by NewFreq with each iteration and by eliminating division by using 64 Hz as the base frequency instead of 65 Hz. This was sufficient to scale the pointer down by 6 bits and arrive at the correct sample index. Listing One is the final synthesis code. Some notes about the code:

The drum channel does not use resampling since drums play at the same pitch all the time.
A 16-bit accumulator S is used.
Four 8-bit signed integers are used to get signed 12-bit integers that are scaled down and converted into unsigned form before writing to the mixing buffer, pBuffer. The accumulator postscaling results in much less noise caused by the loss of resolution in comparison with prescaling. For this reason, we decided not to store samples in prescaled 6-bit form.
When no notes are playing the mixing buffer is filled with zeros (silence).
All the loops were unrolled for better performance. Indexing can be an expensive operation when using a simple microcontroller.
64-Hz base frequency meant slightly detuned notes. C2 on the device sounds a little bit flat. However, this does not make an audible difference. Although each note is a little bit flat, all the intervals in an octave are correct.

Conclusion

To meet the cost and ease of implementation requirements, we completely changed the original mixing algorithm and eliminated most arithmetic operations, simply by optimizing the C code and without resorting to assembler (somehow PIC assembler seemed too hard to understand and master in such a short time). The optimization of software made possible an economical single-chip hardware design that does not rely on external ROM and SRAM. The final version of software requires only 5 MIPS of PIC power. The remaining 5 MIPS can be used to control lights and process user input. A lower speed chip can be used to save power and increase battery life. To further reduce cost, all sliders on the user interface were replaced with buttons. The project is complete and one thing we learned is that optimization pays off.

DDJ

Listing One

void doSynth()
{
    WORD Index;
    short S;
    WORD SampSize[4];
    NOTE ActualNote;
    TNOTE* pTonalityNote;
    rom SAMPLE* pSampData[4];
    SAMPLE* pBuffer;

    // Select buffer
    if ( nBuffer )
        pBuffer = pMixBuffer2;
    else
        pBuffer = pMixBuffer1;
    // Ready to add next note the channel
    if ( !timeCounter )
    {
        // Add notes to drum channel 0
        ActualNote = pPattern[0].Notes[noteCounter];
        if ( ActualNote.key )
        {
            // Add note to channel
            pTonalityNote = &pTonality[ActualNote.key - 1];
            ActualNote.key = pTonalityNote->key;
            ActualNote.octave += pTonalityNote->octave;
            chan[0].newFreq = 0;
            chan[0].pos = 0;
            chan[0].patch = ActualNote.patch;
        }
        else
            lights[0] = 2;
        // Add notes to channel 1
        ...
        // Add notes to channel 2
        ...
        // Add notes to channel 3
        ...
    }
    k = 0;
    pSampData[0] = pRomSample[chan[0].patch].pData + chan[0].pos;
    pSampData[1] = pRomSample[chan[1].patch].pData;
    pSampData[2] = pRomSample[chan[2].patch].pData;
    pSampData[3] = pRomSample[chan[3].patch].pData;
    SampSize[0]  = pRomSample[chan[0].patch].size;
    SampSize[1]  = pRomSample[chan[1].patch].size;
    SampSize[2]  = pRomSample[chan[2].patch].size;
    SampSize[3]  = pRomSample[chan[3].patch].size;
    // Add channel samples to the buffer
    do {
        // Drum channel 0
        if ( chan[0].pos < SampSize[0] )
        {
            S = *pSampData[0];
            pSampData[0]++;
            chan[0].pos++;
        }
        else
            S = 0; // Silence
        // Synthesize channel 1
        Index = chan[1].pos>>6;
        if ( Index < SampSize[1] )
        {
            // Nearest neighbour resampling
            S += pSampData[1][Index];
            chan[1].pos += chan[1].newFreq;
        }
        // Synthesize channel 2
        Index = chan[2].pos>>6;
        if ( Index < SampSize[2] )
        {
            // Nearest neighbour resampling
            S += pSampData[2][Index];
            chan[2].pos += chan[2].newFreq;
        }
        // Synthesize channel 3
       Index = chan[3].pos>>6;
        if ( Index < SampSize[3] )
        {
            // Nearest neighbour resampling
            S += pSampData[3][Index];
            chan[3].pos += chan[3].newFreq;
        }
        // Downscale and make unsigned (higher quality mixing)
        S >>= 2;
        S += 128;
        *pBuffer = (BYTE)S;
        pBuffer++;
        k++;
    } while ( k != 0U );
}

Back to Article

1 2 3 4 5 6 7 Next

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

Embedded Systems

Real-Time Music Synthesis & Embedded Applications

Real-Time Music Synthesis & Embedded Applications