Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Channels ▼
RSS

Bitstream Parsing in C++


October, 2005: Bitstream Parsing in C++

Edward Bishop writes test applications for TVs and set-top boxes for Sony Electronics. He can be reached at [email protected].


Bitstream parsing via low-level mask and shift operations is error prone, tedious, and not conducive to rapid prototyping and development. In this article, I present strategies to make things easier. For the purpose of this discussion, a bitstream protocol is a protocol defined by composite data structures. Fields may be atomic, consisting of some fixed numbers of bits, or may themselves be composite. Although the techniques that I describe apply both to parsing and composition, for simplicity, I concentrate on parsing.

For the examples presented here, I have selected the Master Guide Table (MGT) from the Program Service Information Protocol [1]; see Table 1. I don't want to get bogged down in the details of the MGT; suffice to say that it consists of a number of bit fields and also two variable-length loops, the elements of which are themselves composite structures.

How would you go about parsing an Mgt from an array of bytes? To extract the version_number, for example, the simplistic approach would be to calculate the byte offset, a mask for the bits, and a shift amount. The code might look something like this:

unsigned int version = data[5];
version &= 0x3E;
version = version >> 1;

This sort of code is very error prone. Even careful programmers occasionally calculate the wrong offset, mask, or shift amount. It is also tedious and inflexible. A change in the definition of just one field can trigger a cascade of changes to subsequent lines. This makes such code unsuitable for rapid protocol prototyping and development.

Defining extract_bits()

You can improve on the simplistic approach with the aid of a simple function to calculate the shift and mask based on the the offset and number of bits; see Example 1.

While this code is better than the simplistic approach, it still requires too much handcoding. Having eliminated the need to calculate the shift and mask, it would be nice if you could eliminate having to compute the bit offset as well.

(The source code for extract_bits() is derived from an article that I found while researching this topic. Surprisingly, the source code as published was severely flawed, having been tested on both Little- and Big-endian, but not, apparently, on byte-swapped machines.)

Defining operator>>(ibstream&)

Streams are stateful, so it makes sense to have a stream class to keep track of bit and byte offsets. See the source code for my definition of ibstream, a simple bit-oriented stream class, and also the definition of operator>>(ibstream& , bitset <N> & ) (http://www.cuj.com/code/). The ibstream constructor takes a streambuf parameter to facilitate converting an fstream or an strstream into an ibstream. (The C++ Standard Library definition of operator>>(istream&, bitset<N>) works on character strings of 1s and 0s, so in the unlikely case that your data is represented that way, you don't need the ibstream class.)

To read an MGT from an ibstream, I define class Mgt using bitset<N> for the fixed-length members and vector<MgtDefinedTable> and vector<SIDescriptor> for the variable-length members. Assuming that operator>> has already been defined for those subtypes, it is straightforward to define operator>>(ibstream& , Mgt&); see Listing 1.

The resulting code is neat and flexible. For example, to change version_number from 5 bits to 9 bits, just change bitset<5> to bitset<9>. The rest of the code stays the same.

Still, it is not quite satisfactory. Notice that the format of operator>>() basically mirrors the format of class Mgt itself. Such similarity suggests the possibility of further refactoring and code reduction, which leads us to...

The Extended Composite Pattern

Figure 1 defines a Composite pattern. Component is the base class from which the leaf and composite classes are derived. Component has a virtual Read() method, and the Composite::Read() method delegates to each of its children.

I derive class Mgt from Composite. I define class Bitset for the bit fields. Because tables and descriptors are variable-length arrays, the challenge is to somehow inform their respective read methods of the constraints imposed by the parent Mgt. The challenge is compounded by the fact that the constraints, tables_defined and descriptors_length, are of different types, the first being an enumeration and the second being a length. With that in mind, I define a special kind of Composite called a List. The List constructor takes a "Creator" parameter, which encapsulates the parent pointer along with a parent method. This extension of the standard Composite pattern might best be described as "Composite with Rules for Children" or "Composite with Guidance from the Creator."

Listing 2 shows the definition of class Mgt. The private methods KeepReadingTables() and KeepReadingDescriptors() set the bounds on the number of tables and descriptors to read, respectively. The LIST_INFO() macro hides some of the syntax needed to create an auto_ptr to a CreationFunctor. Individual fields are accessed by name via string lookup.

This design has several nice features. The code is concise, easy to read, and closely mirrors the original specification. The input method is handled automatically by the base class. It's easy to define new types and the Creator gives lots of flexibility in handling subtypes. (Exercise for the reader: Similar to class List, define class Conditional to represent conditional subtypes.)

This design also has some drawbacks. There is no way to extract just one field without reading and parsing the whole structure. The space is allocated on the heap at runtime. Member access is via string lookup, which is likely to be slow compared to other methods.

Preprocessor Metaprogramming

Now here's a completely different approach, in which as much computation as possible is done at compile time (actually, preprocess-time).

The Boost Preprocessor library [2] contains macros for basic programming constructs including loops and conditional statements. It defines operations on sequences, where sequences are defined by parenthesized expressions. Using the Boost macros, I have defined a macro called T() that takes a specification and generates code for the corresponding class definition. In broad terms, given an input specification S, T(S) will:

generate code for the class constructor
foreach subfield {
    generate code for bit and byte offsets
    generate code for get() methods
    if the subfield is a variable-length array
        generate code to access elements by index
}

In Listing 3, I use T to generate class Mgt. The specification consists of the name of the class, Mgt, followed by a sequence of fields, arranged one line per field. Each field has either two or three elements, depending on whether it is a leaf or a composite type. The first element of a field is the name. The second element of a field is either the number of bits if it is a leaf, or the name of the type if it is a composite. If it is a composite type, then the third element specifies how the length of the field is determined.

Listing 4 shows the actual definition of T(). The main workhorse of T() is BOOST_PP_SEQ_FOR_EACH_I(), which loops through the sequence of fields and applies further macros to each field. The complete definition of T depends on several more levels of macros, about 250 lines of code total; see pmeta.h (http://www.cuj.com/code/).

The classes generated by T() are lightweight. Fields are not stored, but are calculated on the fly. The private data pointers do not own the data to which they refer. These features simplify the structure of the code generated by T(). Classes generated by T() can be copied and passed by value with almost no computational overhead.

Listing 5 shows some of the code generated by T(). Although I wanted to keep sizeof(generated class) small, I also wanted to maximize runtime performance, so I decided to cache certain key values. As you can see, descriptors_num_elts_x is only calculated once. Similarly, when accessing an element of an array, say descriptor(n), the offset and index are cached so that descriptor(n+1) can be accessed quickly. The resulting size is sizeof(Mgt) = 28 in my development environment.

Macros and the code that they generate can be difficult to debug. If you need to, you can run the code through the preprocessor (for example, g++ -E -P file.cpp) or through a beautifier (indent or bcpp), and then recompile.

Flavor

"Formal Language for Audio-Visual Object Representation" (Flavor) is a powerful language for describing bitstreams [3]. The syntax is similar to C++ and specifications written in Flavor can be automatically translated into C++ or Java. There are lots of configuration options to control the code generation. For example, you can configure it to generate get() methods, put() methods, or both. The options can be given on the command line or via pragmas within the specification.

There are some drawbacks of code generation; for example, it adds an extra step to the development process and the generated code may be suboptimal, leading engineers to modify generated code rather than the input specification. These drawbacks apply not just to Flavor, but to any code generator, including my preprocessor macro T().

Listing 6 shows an Mgt defined in Flavor. Notice that the constructor and destructor had to be coded by hand, somewhat surprising considering the extent to which everything else is automated. The generated output is in the source files mgt_flavor.cpp and mgt_flavor.h.

Flavor is currently used in the Systems and Structured Audio parts of the ISO/IEC MPEG-4 Standard for the representation of the bitstream syntax.

Spirit

Spirit [4], another Boost library, is an LL parser framework that represents parsers directly as EBNF grammars in inlined C++, and has been applied to bitstream parsing by Thomas Guest in "A Mini-Project to Decode a Mini-Language" [5]. His program, dvbcodec, uses Spirit to parse the section formats at build time—generating a C++ file that is then built into the actual codec. He makes some simplifying assumptions, such as the assumption that loops are specified by number of bytes as opposed to number of elements, which proved fatal to my attempts to parse the MGT. Nevertheless, of all the strategies described in this article, this technique of using a general-purpose parsing library is probably the most powerful.

Conclusion

Should you embark on a project involving bitstream parsing, the strategies I have discussed here should serve you well. Ultimately, the choices you make will involve tradeoffs between simplicity, flexibility, ease of deployment, and performance.

References

  1. [1] Eyer, Mark K. PSIP Program and System Information Protocol, McGraw-Hill, September 2002; ISBN 0071389997.
  2. [2] Boost Preprocessor library; http://www.boost.org/libs/preprocessor/ doc/index.html.
  3. [3] Flavor; http://flavor.sourceforge.net/.
  4. [4] Spirit; http://www.boost.org/libs/spirit/index.html.
  5. [5] Guest, Thomas. "A Mini-Project to Decode a Mini-Language"; http://homepage.ntlworld.com/thomas.guest/dvbcodec/index .html.
CUJ


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.