Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

Channels ▼


Designing a VoIP Media Phone Framework

Tommy Long and Andrew Duignan are network software engineers for Intel.

The IP phone market is growing rapidly, thanks in part to increased broadband availability and the cost/feature benefit of VoIP (Voice over Internet Protocol) versus standard analog and proprietary digital phones. Moreover, the increase in bandwidth availability for home and business has allowed an ever increasing number of services to be provided over IP -- streaming radio, video streaming, news feeds, and the like.

Given this increase in available services and Internet availability, opportunities have arisen for new converged products that provide a range of services and Internet accessibility in a single system. One of these products is the media phone. In this article, we present the design of a Linux-based VoIP media phone framework that has the ability to scale up to support media gateway and IP PBX applications. In the process, we address the use of a multi-threaded framework to take advantage of multicore IA systems.

Framework Design

Voice processing frameworks can range from IP phone endpoints at the low end, through gateways in the mid range, and on to IP PBXs in the high end. Each of these applications has a different set of requirements. Endpoint applications will generally only be required to support a small number of channels, run on systems with a single CPU, and have one type of PCM input. On the other hand, an IP PBX is expected to support from 10 (SMB market) to 10,000 (enterprise market) channels and run on multicore systems. IP PBX calls are generally over Ethernet using SIP as a signaling protocol. The key to supporting all this in a single framework is scalability and flexibility. The framework needs to be able to scale up or down depending on the targeted application. By making the framework modular, the flexibility needed to support the various configurations can be provided. The definition of modularity can be divided into two separate levels: system level and framework level.

At a system level, the core framework should be made interface-agnostic. All interface-related code (hardware or network) should be kept outside of the core framework. This offers the advantage that the core framework does not need to be changed to support different interfaces. The voice processing framework (VPF) only works on one packet at a time. This means that any queuing of data should occur within the Interface code. The framework should support up- and down-sampling of data inputs as necessary. Figure 1 shows the component interaction and interfaces into the VPF. (A full system setup would include other blocks such as SIP, but we don't discuss them here.)

Figure 1: Voice Processing Framework Block Layout.

Again, all queuing of data occurs in the interface blocks. This means that there are only two options available when selecting the method of communication between the VPF and the interface blocks:

  • Framework Callback. The PCM/IP Interface registers a callback function with the VPF. The framework calls this callback when it requires input data or to present output data.
  • Framework API. The framework provides an API that the PCM/IP Interface calls prior to sending/receiving data from the framework.

Both methods can work equally well, but the Framework Callback method has some slight advantages. Using this method, the VPF can be used to control the timing of the entire system. This negates the need for the framework to have to buffer multiple data packets. One strict condition that the callback functions must meet is that they must not block. If these callbacks block then they will slow down the framework and cause deadline misses.

The PCM interface to the framework should be kept agnostic. This gives the advantage that any hardware interface can be plugged into the framework without having to modify the framework. On Linux there are numerous options available when interfacing hardware to the framework:

  • ALSA (Advanced Linux Sound Architecture). Many audio interfaces can be supported simply by changing the hardware ID and changing the sample rate.
  • Custom Interface. When interfacing cards not supported by ALSA (such as FXO/FXS), a custom interface block must be written.

Linux is not a true real-time operating system and there is a possibility that the framework will miss a processing deadline. This can lead to a buildup in latency on the call if the scenario is not handled properly. For example, ALSA contains a read and write software buffer, the size of which can be set on initialization of the ALSA channel. ALSA will fill the read buffer before it passes any samples to the client application. The simplest implementation of the framework inbound callback would just read from the ALSA buffer using the snd_pcm_readi() API every time the callback is called. With this solution, every time a processing deadline is missed by the framework, the number of buffers in the ALSA hardware queue would increase by one and the latency on the call would increase accordingly. One solution to this problem would be to use double buffering for the PCM interface.

The IP interface consists of an Real Time Protocol (RTP) stack. As the framework only handles one packet at a time, the jitter buffer must reside in the IP interface. The benefit of this design is that the jitter buffer implementation can be modified and changed without having to modify the VPF. A custom RTP stack may be used or an open source stack such as Open RTP (ORTP) may be used. ORTP provides a full RTP/SRTP stack and a jitter buffer implementation. If Secure Real Time Transport Protocol(SRTP) is needed, IA processors such as the Intel EP80579 Integrated Processor can provide SRTP hardware acceleration support.

Memory allocation is an expensive operation in a real-time system. Allocating large chunks of memory can take a lot of time and a real time system can be intolerant of such large delays. Allocating memory on-the-fly also increases the length of time it takes to set up a call. The options available for memory allocation are:

  • Static Allocation. Using this method, all memory, buffer pools, call configuration storage, etc., needed for the system is allocated up-front at initialization time. With this system, no memory allocation takes place on the fly and therefore no major delays are introduced during call setup. The disadvantage of this system is that the maximum amount of memory needed by the system is always allocated no matter how busy the system is.
  • Dynamic Allocation. This option is the complete opposite of static allocation. Using this method, all the memory needed by a call is allocated on call setup. With this method, only the necessary amount of memory is allocated by the system at any one time. The disadvantage of this option is that call-setup time can be very long, which can lead to unacceptable delays in large PBX systems. High rates of call setup/teardown can also have an adverse effect on the overall system.
  • Static and Dynamic Allocation. This option combines both static and dynamic allocations. Common system memory is allocated up-front at initialization while call-specific memory is only allocated at call time. This limits the memory usage when the system is running with low call levels and also can help to reduce call setup times.

Historically memory was expensive and many hardware platforms had limited memory available so the memory footprint of applications was an important consideration. Today, however, memory is relatively cheap and, for this reason, memory footprint size is not a limiting factor on modern platforms. This means that static allocation should be used as this gives the shortest call setup time. The lack of dynamic allocation also reduces the risk of real-time processing deadlines not being met.

A timing source is required to maintain synchronization between input and output streams. Linux is not a real-time operating system, so we must make our system approximate real-time behavior. There needs to be some timing source to start the data processing thread(s) at regular fixed intervals. If the data processing starts at a fixed interval, and data processing takes less than the length of the interval, the system should operate well.

There are several options for a timing source:

  • Hardware. This is the most reliable option. An external piece of hardware provides the timing synchronization. The hardware could be a PCM audio device (for example, a sound card). This should provide an accurate way to time the system. A dedicated hardware timer could also be used. Available on Intel silicon, High Precision Event Timers (HPET) are specifically designed as media timers and have the effect of improving the timing source.
  • Software. There are various timing sources available:
    • POSIX RT timers. Sends a signal to a process on the expiration of a timer. The timer can be set to nanosecond granularity. However, as Linux is not an RTOS, this timer may drift, depending on Kernel tasks. For example, if there were many hardware interrupts, the Kernel would have to handle these, and the timing signal may be delayed compared to the gettimeofday clock. Note: Linux Real Time Extensions need to be enabled for nanosecond granularity. If this is not done, the granularity will be to the nearest millisecond.
    • setitimer. Similar to POSIX RT timer. Granularity is in microseconds.
    • Real Time Clock. This creates a file descriptor that blocks on a read until an interval expires. On expiration, the read unblocks. A separate on-board hardware clock is used for this timer. However, the timing interval can only be set for intervals of the order of 2 nanoseconds.
    • timerfd_create.This creates a file descriptor that blocks on a read until an interval expires. On expiration, the read unblocks. This can be set to nanosecond granularity. This is not standard in all versions of Linux however.

The hardware timing mechanism is preferred over the software timing mechanism. Any IP phone with a PCM interface can use this as the timing source. If there is no PCM interface available (for example, IP-IP calls only), a software timing mechanism can be used. However, if the system is under heavy load, software timers are not sufficiently accurate. The user should be able to define the timing source; it should be external to the IP phone framework software. The framework could call a user-defined callback function, leaving the implementation of the timing up to the user.

Multicore CPUs are now becoming more commonplace. Also, some single-core CPUs such as select Intel Atom processors support Intel Hyper-Threading Technology (Intel HT Technology). With hyper-threading, there can be two "virtual cores" on one core -- there are two hardware threads running on one core. Use of multiple threads can take advantage of the parallelization made possible by multiple cores.

For example, if an application is I/O-bound, the CPU is delayed while waiting for I/O to complete, it may be beneficial to make the application multi-threaded. Once multithreaded, a thread of execution awaiting I/O can halt, thus freeing the CPU to process other data. This should improve throughput in an application. Different functions of the IP phone application should be run on different threads. Control and data paths can be kept on separate threads (control path meaning commands from the user, and messages to the user. Data path meaning processing of voice audio traffic). There may be a thread dedicated to synchronizing the data processing threads. The greatest benefit of parallelization can be realized if there are multiple data processing threads.

In the single data processing thread model, there is only one thread dedicated to processing voice traffic. This is the simplest model to implement, but it fails to take advantage of many of the features of modern CPUs. This model is most suitable for a single-endpoint application with one voice channel.

In the multiple data processing threads model, the number of threads can be configurable by the user. This can depend on how many channels the system will support, what type of application -- PBX, conference server, etc.

It is possible to have a configuration that uses one data thread per channel. But this model is inefficient and leads to excessive thread switching, which affects performance.

There should be at least one data thread of execution per CPU core. If hyperthreading is used, there should be one data thread of execution per virtual core. For example, on a 4-core CPU with hyper-threading, at least eight data threads should be used.

The OS should schedule the execution of threads. Note: For an IP phone application, it is important to give the data processing threads a very high priority to ensure quality of service. This prevents a background application running on the platform from taking CPU cycles.

All of the data processing threads should have the same priority.

Lip Synchronization Support

If the VPF is to be used in a video phone application, it must at least provide the capability to support lip synchronization. In a video call, the video and audio streams are transmitted separately, and may not necessarily arrive simultaneously. Lip synchronization is necessary to compensate for any delta between the arrival of the voice data stream and the video data stream; it helps ensure that both streams are played in sync. Without synchronization, a noticeable lag between the video and audio output would be experienced by the user.

The VPF must be designed in such a way that packets can be traced through the framework. This is done by providing a gettimeofday field in the data buffer header. This field remains untouched by the VPF. The PCM interface sets this field when the voice data packet is received from the hardware. The data packet is then processed by the hardware. The IP interface can then use this gettimeofday field to match the VPF-generated voice timestamp for the data packet to the corresponding video packet timestamp for RCTP.

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.