Channels ▼

Christopher Diggins

Dr. Dobb's Bloggers

Parsing C++ Programs using C#

September 08, 2009

Intelligent parsing of C/C++ code can be very tricky. I have started an open-source project to develop a C++ parsing framework in C#.

I have a task ahead of me to clean-up and standardize the comments of a large SDK containing over 600 C++ header files. These files were written over the course of 15 years, by a large number of programmers of varying degrees of skill and experience, and no commonly agreed upon coding guidelines.

This kind of terribly mundane task, just begs to be automated, but the problem is that parsing C++ files is hard to do correctly. Part of the problem, is that the source files sometimes use non-standard Microsoft compiler extensions and use macros extensively.

My first task it to pull out the comments and reformat them. However I want to also extract the surrounding context. This way I can automatically generate missing comments, and verify existing ones.

I have already wrote a general purpose parsing framework in C++ called YARD but Igenerally don't like programming in C++ anymore. It is just much easier to write and maintain code in C#. You don't have to write so much boilerplate code, and the refactoring and debugging tools are much easier to use.

So I started work last week on an open-source parsing framework in C# called (C++ Ripper. It allows a grammar for C++ to be expressed within C# similar to how one would write a CFG or a PEG grammar. In fact it is very closely related to the PEG formalism, but I don't follow it to the letter.

Here is an example of code taken from here used to express a C++ structural grammar for the parser.

<br />
            node <br />
                = bracketed_group<br />
                | paran_group<br />
                | brace_group<br />
                | type_decl<br />
                | typedef_decl<br />
                | literal<br />
                | symbol<br />
                | label<br />
                | identifier;<br />
<br />
            declaration_content<br />
                = Plus(node + Eat(multiline_ws));<br />
<br />
            declaration<br />
                = comment_set + pp_directive + Eat(multiline_ws)<br />
                | comment_set + semicolon + Opt(same_line_comment) + Eat(multiline_ws)<br />
                | comment_set + declaration_content + Opt(semicolon) + Opt(same_line_comment) + Eat(multiline_ws);<br />
<br />
            file<br />
                = declaration_list + ws + NoFail(EndOfInput());
<span style="font-family: Georgia; font-size: 14px; white-space: normal" class="Apple-style-span"> Note that this is pure C#, with only a small amount of operator overloading. I overload the "+" and "|" operators to represent sequence and choice PEG operators.</span>

The C++ Ripper parsing framework allows a programmer to express a grammar from a series of basic Rule and Rule operators (like "+", "|", and "Opt()") which are used to automatically create an abstract syntax tree (AST). There are several specialized rules (like "Eat()") which are used for preventing too many nodes being created in the tree. This is important because the grammar expresses rules right down the individual character sets (no tokenizer is needed) so you can end up with extremely dense and hard to manage trees.

One of the reasons that I created my own parser from scratch, is that I wanted to keep comments instead of stripping them like most parsers do. I also wanted the parser to make a distinction between comments that precede a declaration, and those that follow a statement or declaration, but occur before the next newline.

This project is only in its beginning stages, and I am not really trying to solve anyone else's problems but my own. However, I hope that others might find a simple C++ parsing library to be useful in their own projects. I can think of several things I'd like to use it for (a C++ file comparison tool, a C++ pretty-printer, a C++ refactoring tool, a Doxygen->JavaDoc converter, a documentation generator, a C++ pre-processor, etc.), oh well if I only had more spare time (and money).

I'd love to hear whether some code from C++ Ripper makes it into your project, so please drop me a line.

Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 


Video