Surround Sound

By Don Morgan, July 01, 2005

Surround sound lets you experience sound coming from all directions.

July, 2005: Surround Sound

Don is a senior engineer at Ultra Stereo Labs and and author of Numerical Methods for DSP Systems in C. He can be contacted at [email protected].

Multichannel audio has a long history that varies depending upon who is reciting it. Inventors and/or inventions come and go and are important or unimportant, depending upon the viewpoint, as is the product. A particular format may sound good to one and not so good to another.

One thing is clear: Monophonic sound came first, followed by stereo. Stereo was popular not because people had not imagined more channels, but that was what they could reliably fit on a record. Quadraphonic sound arrived with a special encoding scheme that allowed it to coexist (without loss) with the two main channels, but this format did not receive much support and soon passed. In the cinema, however, another form of multichannel took hold—digital Surround Sound, in which you experience sound coming from all directions.

The real magic of digital audio is its noise immunity. The fact that is it digitized and processed on a computing system of any word length is actually a drawback. Audio is part of the continuous spectrum, and as such, has infinite resolution; digital is horribly finite and coarse and does not. But infinite resolution means that every nanovolt of an audio signal has meaning and is amplified—you can hear it. As a result, even a small amount of noise on this signal can be heard, whereas with a digital data stream, you must actually force a 1 to 0 or vice versa to impress noise on it, or employ terribly bad arithmetic.

In the early years of cinema, there was really only about 20 dB of dynamic range for the audio above the noise floor created by the hiss of the recording techniques involved. It is easy to understand why these film sound tracks sound dull compared to today's sound tracks where 80 dB of dynamic range is common and 120 dB is the goal.

With the boom in home studios, home movie making, and garage bands producing high-quality DVDs for auditions and submissions, the interest in good quality, nonprofessional audio has become important. Thus, these technologies are important.

In this article, I present a surround-sound technique that requires no license and produces excellent results. It is based on an early technique whereby two extra channels are impressed on the original left and right stereo channels without degrading the quality of those channels. Properly done, you can expect separation of 60 to 70 db referenced to 0 dBFS. That means the involvement of any one channel with another can be as little as 1/10th of 1 percent relative to full scale. That is pretty good.

You also have some choices. You can determine to process the audio offline in a passive mode or "live" in real time. Perhaps you are capturing your audio/video live and want to encode it for transmission to another location where it is decoded live for an audience. If so, then you will probably opt for a DSP implementation. If you are recording a soundtrack, say of a music group, for later mixing and encoding, then you are free to choose to process the data on a PC of most any speed. And yes, if you encode the audio as described, it sounds excellent on your stereo-only equipment and decodes correctly on a Pro-Audio decoder, though you may want to implement your own decoder to improve the quality.

What Are Decibels, Anyway?

Decibels provide engineers with an almost unitless means of measuring signal power. Instead of needing to understand voltage, current, or even g-forces, we agree to understand a system based on the exponential. Yes, we all agreed—but with a minor difference: Electrical engineers define the decibel as dB=10log₁₀(output/input), which represents the magnitude of the signal ratio. On the other hand, audio engineers are likely to define it as dB=20log₁₀(output/input). The difference is that audio engineers are interested in signal power, which is equal to the magnitude squared of the signal ratio—signal power, in other words. In this article, I use the audio engineering definition.

Left, Right, Center, Subwoofer, and Surround

Figure 1 illustrates the recommended speaker placement for a surround-sound system. It is worth noting that theaters rarely achieve this placement, opting instead for speakers along the wall and behind the screen, making up for the difference with delays in the audio path.

You might suppose that Left and Right are recorded by putting a microphone stage right and stage left, respectively, Center is created somehow as a combination of the two, Subwoofer is a low frequency Center, and Surrounds are captured with microphones to the rear of the reference point. While there are special microphones designed for capturing surround audio, the truth is that on a set or recording environment, there are many microphones, and it is in the mixing that a particular track is placed in the Left or Right, Center, Sub, and Surrounds.

Both DTS and Dolby use proprietary schemes for encoding the many channels required for making movies. Dolby's technique is grosser, opting to throw away more audio information than DTS to attain higher compression levels that let it be written to the film, while DTS distributes its sound tracks on CD or DVD. Neither is lossless. Here, I focus on a virtually lossless approach in which the Surrounds and Center channels are encoded and added to Left and Right, while the Subwoofer is created in the decoder from the existing Left and Right.

How Do You Get Four Channels In the Space of Two?

In Figure 2, Left and Right pass through the encoder without treatment and represent themselves. As for the magic, the Center channel is divided by two and added to both Left and Right. Simple enough, right? It might surprise you to know that in some cinemas and many home systems, there is no Center speaker at all. Instead, engineers rely on something called "phantom center," which is the illusion of a Center channel produced because the identical content in phase and magnitude on both Left and Right appear to come from a point halfway between the two.

You divide the Surround channel by two, then phase shift it to create two identical signals 180 degrees out of phase with each other, which means that they cancel and go unheard in the Left and Right. You do this hocus-pocus with the help of the Hilbert Transform, which, by itself, gives you a phase shift of -90 degrees over the input signal. This -90 degree signal is subtracted from the Left and added to the Right.

Figure 2 also presents a simplified—but accurate—diagram of how Center and Surround are treated and placed on the Left and Right channels. To be sure, this is not the only phase-oriented encoding scheme, but it was dominant from the 1980s on. If you are willing to do a patent search, you will find many that were designed to impress even more channels on Left and Right, but I know of none that worked quite as well as this one.

Also in Figure 2, the Surrounds are low passed with a 100-Hz high-pass filter. This band limits the content and aids in noise reduction.

Dolby B

Next in line is a type of noise reduction known as "Modified Dolby B." In the simplest encoders, the encoder boosts the high end of your signal (everything above 7 KHz) before it is written to the film; in a synergistic fashion, the decoder attenuates the high end, thereby returning the actual audio to its previous level and cutting any media noise by the same amount. For example, say you have a 16-KHz tone at -5 dB. If you record it without the noise reduction but with a contribution of media noise at -20 dB, you have a -5-dB tone with a -20-dB hiss in the background when you play it back. Recording the same audio with this noise-reduction technique, the 16-KHz tone is boosted to 0 dB. The media noise is still -20 dB, as before. But on playback, a decoder cuts the signals, so your tone goes back to being its original -5 dB but the noise is cut to -25 dB instead of being -20 dB. This makes the noise almost half of what it was before. This kind of encode may be satisfactory for your purposes and indeed is still used in some encoders.

A more advanced form of this noise reduction is also possible and exists in better equipment. An encoder using modified Dolby B employs a means of tracking where the audio power is in the signal and boosts the remainder of the band beyond that point by 5 dB; that way, it loses as little of the audio as possible while suppressing media noise. Incidentally, this is why it is called "modified Dolby B"—standard Dolby B encoding boosts the signal by 10 dB.

To do this, you use a combination of a shelving filter with a moveable corner frequency, thereby making it a sliding filter. A shelving filter is one that either produces gain or attenuation at a predetermined (by design) point; this specified point is sometimes called the "corner frequency." The shelving filter does not continue to attenuate or apply gain forever; instead, when it hits a predetermined level, it stops and "shelves"—it continues to maintain that amplitude throughout the rest of the spectrum.

A sliding filter has a moveable corner frequency. In a Dolby B encoder, you determine where the energy in the audio is, then move the corner frequency of the filter to enclose that point. The spectrum is then boosted beyond that point by 5 dB, or about 156 percent of the original signal.

The idea is to slide the filter out to enclose any actual audio energy in the spectrum and move it in when none exists. How do you know when there is energy out there? One way is to design a high-pass filter that has a pass band at the farthest extent of your sliding filter and a long transition band that comes close to intersecting the point the shelf begins when the sliding filter is all the way in, in this case 7 KHz. Square the output of your control filter and proportion it to drive the coefficient generator for your sliding filter, such that when the energy in the spectrum is in the pass band of the control filter, you are generating coefficients for the farthest point of your slider. When there is no energy beyond 7 KHz, your coefficient generator should produce values for 7 KHz.

The Hilbert Transform

The last step in producing the Surround channel is the Hilbert Transform. This transform produces a version of the signal that is exactly -90 degrees out of phase with the original without affecting the magnitude response. Technically, the Hilbert Transform is an ideal (and unrealizable) all-pass filter that passes the signal magnitude without change, but shifts the phase of the signal -90 degrees.

There are two popular methods engineers use to obtain the Hilbert Transform of an audio track—one method involves the Finite Impulse Response (FIR) filter and the other the Fast Fourier Transform (FFT). The FFT is a common replacement for the FIR when the lengths of the filters become overly long and time consuming.

The FIR filter is a special form of filter that does not employ feedback, and as a result, can produce a filter with linear phase. That is, the phase of the output is merely a delayed facsimile of the phase of the input; with a group delay that is equal to (N-1)/2 (where N equals the number of taps or coefficients) by delaying all processing so that the other signals are in phase with this, you can produce the apparency of zero phase.

In simple_fir.c (available electronically; see "Resource Center," page 3), you find the coefficients and a simple model for the FIR Hilbert Transform. The FIR filter is a convolution (or dot product) of an impulse response with an input data stream; this dot product is a series of multiplications between time-reversed coefficients and input samples, each multiply followed by a running accumulation. This is illustrated in simple_fir.c with the lines:

// Dot Product
accum = 0;
for (i = 0; i < ntaps; i++) {
accum += h[i] * z[i];
}

accum stores the running sum, ntaps contains the number of coefficients in the filter, h[] are the coefficients, and z[] are the samples in the history buffer (this buffer will be the same size as the number of coefficients).

An impulse response is a complete description of a system obtained by a unit impulse that is both infinitely short and of infinite amplitude. The idea is to set the system ringing at its natural frequency without interfering with that natural frequency, then sample the ringing and use it in a dot product with an input stream to obtain the same results as would be obtained if the input stream was somehow processed by the system we are modeling.

The coefficients for this FIR-based Hilbert Transform may be found with Figure 3. I chose 20 Hz and 23.980 KHz as band edges for this filter and used a Kaiser window to attain the greatest suppression of side lobes I could. I also required <= 0.1 percent ripple.

The filter actually has 1023 taps—a daunting thought considering the amount of time available to process each sample at 48 KHz, which is approximately 11 microseconds per sample. At 96 KHz and 192 KHz, the requirements are proportionately higher.

To help overcome some of the difficulties this presents, I made the lower band edge (20 Hz) and upper band edge equal; that is: F_L=0.5F_S-FH. F_L is the lower band edge, F_H is the upper band edge, and 0.5F_S represents the Nyquist frequency, which is half the sample rate—48 KHz in this example. This made the frequency response symmetric, resulting in every other coefficient equaling zero. Thus, we really only need half as many multiply accumulate operations.

Not only this, but the coefficients of a symmetrical FIR filter reflect identically (except for a sign change) about the center or the coefficient at (N+1)/2, so you really only need 1/2 the storage space for these coefficients, if you need it at all. The example in simple_fir.c does not possess these enhancements. You need to add them if you want to use them. (To find out more about how to calculate these coefficients or the Hilbert Transform, see Theory and Application of Digital Signal Processing, by Lawrence R. Rabiner and Bernard Gold, Prentice-Hall, 1975; ISBN 0139141014.)

DSPs have hardware that handles circular buffers. You will give it a mask representing the size of the buffer and create a pointer, and the DSP takes care of the wrap for you. But C does not have a scheme for this. As far as I am aware, there is no compiler written for a DSP that handles these hardware buffers. It is left to you to come up with what you need if you want to use them. As a result, the FIR filter in simple_fir.c was written for a processor that has no DSP-specific hardware, so that you can implement it quickly to see how it works. Therefore, you make up for the lack of a hardware circular buffer manually:

// shift delay line */
for (i = ntaps - 2; i >= 0; i—) {
z[i + 1] = z[i];
}

Before running this algorithm, initialize all z[] to zero. This is a straightforward transform to implement on a sample-by-sample basis, and it works well for offline as well as real-time applications.

Of course, an impulse I spoke of earlier that is both infinitely narrow and of infinite amplitude is not available in practical engineering. Consequently, we often use the Fourier Transform to generate this impulse response. The historical basis for this is extensive and threads through Newton, Weierstrass, and Fourier, to name a few.

You model in the frequency domain the features you choose for your impulse response. How do you know what these features are supposed to be? That depends upon the application. Say that you want a band-pass filter that passes 100 Hz to 7 KHz. Figure 4 is the spectrum for this filter. If you sample that spectrum and perform an inverse Fourier Transform on your samples, you have the impulse response for the FIR filter that does that job. That is the simple explanation and you probably guessed that there is more to it, but not much. (For more information, see Introduction to Signal Processing by Sophocles J. Orfanidis, Prentice-Hall, 1995; ISBN 0132091720.)

That said, it comes as no surprise that the Hilbert Transform is also possible directly with the use of the FFT. To implement an FFT Hilbert Transform, you only need to perform the FFT of the data stream, zero the negative frequency bins (those beyond Nyquist), and follow it with an inverse FFT. The signal is available in the imaginary part of the result. And for a high-quality Hilbert Transform, this is really the most computationally efficient way to do it, as you can perform an FFT with far fewer multiplies than a time-domain convolution. By the same token, you must decide whether it is worth the time to write this complex FFT and the accompanying block processing with overlap and add. I don't present code for a Fourier Transform because it is a common tool that it is freely available on the Internet.

Next, you perform the transform on the band-passed and noise-reduced signal, then add the output of the transform to the Right and subtract it from the Left. This is important because when you decode the signal, you unwrap it that way, and if you get it backwards, the phase won't match the original.

Referring to Figure 2, you will notice delays in both Left and Right channels immediately preceding the final summing nodes. These delays are to compensate for the delays in the NR and Hilbert Transform. This is the encoder. The only caveats I present are that you must:

Halve the inputs to the Center and Surround channels so that the signals are correctly proportioned.
Maintain the order of processing in the Surround channel to get the highest quality audio out. If you reverse the order of the NR and low-pass filter, you may experience artifacts in the output that cannot be removed.
Pay attention to the phasing of the signals added to the Left and Right when you complete the processing on the Surround channel.

Let's Listen to It

You can listen to the output of this encoder on a stereo-only system without artifact; it also decodes on a standard Pro-Logic decoder. I present here a description of a decoder so that you can better understand what is happening, and perhaps even build one—it is possible to produce better results than the standard Pro-Logic decoder.

Figure 5 is the block diagram of the decode block, sometimes called the "Matrix." One of the first things you notice about this scheme is that there is a delay on every channel. This is due to the Subwoofer.

The Subwoofer is the sum of 1/2 Left and 1/2 Right low-passed at about 125 Hz. For a true cinema decoder, this should be programmable and allow up to 225 Hz. This is largely due to the character of the speakers used in the theater. Regardless of your choice of low-pass frequency, you need to know that every filter, whether it is an Infinite Impulse Response (IIR) filter or Finite Impulse Response (FIR) filter, introduces a delay. The time delays in Figure 4 compensate for this delay. If you are using FIR filters, the delay can compensate completely for delays introduced by the Subwoofer filter because FIR filters produce a linear phase output. If you are using IIR filters, and this is the most common approach, then the delays are set to maintain the phase until the point at which the Subwoofer filter starts to attenuate.

You also see the Dolby B decode block. This works the same way as the encode block, except that the signal is not boosted past the sliding filter corner frequency but attenuated 5 dB.

The Simple Matrix

As you can see, the Left and Right channels pass through the matrix relatively untouched, except for the delays. The Center channel is the addition of Left and Right. There will be some in-phase components related to Left and Right in this output but the Surround data will be completely cancelled because it is 180 degrees out of phase with itself on Left and Right.

Subtracting the Lt channel from the Rt channel develops the Surrounds. Again, there will be some components of Right and Left that are out of phase in this signal along with the dominant Surround encoding—and there will be noise due to the subtraction. This is one of the main reasons you band limited the Surrounds during encoding. Now you can low pass the differenced signal and recover it with a minimum of media noise and numerical noise.

The design equations for the simple matrix are (the "*" signifies convolution):

Left = LT*Delay
Right = RT*Delay
Center = (Rt+Lt)/2 * Delay
Surround = (Rt-Lt)/2 * Delay
Subwoof - (Rt+Lt)/2 * Lp(125Hz)

A Better Matrix

The problem with the simple matrix is that there is still content in the audio belonging to other channels. This limits the separation between channels, and although many people use this technique, it does not result in the "lifelike" quality many people would like.

The equations that Dolby uses to decode Pro-Logic Surround Sound are:

L'=Lt
C'=C+0.7L+0.7R
R'=Rt
S'=S+0.7L=0.7R

Not very exciting stuff. So how do you produce a better sense of separation?

An obvious answer is to subtract the content of the Left and Right channels from the Center, and subtract the Center content from Left and Right. You can also subtract any signals common to the Left and Right from the Surrounds. The only caveat is that we must maintain signal power of the output. Signal power can be calculated as a center of gravity value as the square root of the sum of the squares of Left and Right. If you then low pass the result, you have a useable control for maintaining that signal power.

To understand what this means, look again at Figure 1. In a true, well-balanced Surround Sound recreation, if a signal (say, a helicopter) is in the Left channel, you do not hear it in the Center, Surrounds, or Right. You may hear echoes or phase-delayed images of it in those channels if that is what was recorded and that was the way it was mixed. Still, the main audio is only in the Left. As the helicopter moves from Left toward the Right, you hear it moving between Left and Center without any drop in signal power. In other words, what is coming from the Left and Center sums to the full power of the audio you heard only in the Left earlier. This is what produces a real sense of image and position.

Conclusion

Now you can play with it. Make it better; produce your own encoding topology. There are really quite a few possibilities here, including separate Surround Left and Surround Right by using similar techniques to those I've described here.

DDJ

1 2 3 4 5 6 Next

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.