Parsing C++ Programs using C#
Intelligent parsing of C/C++ code can be very tricky. I have started an open-source project to develop a C++ parsing framework in C#.
I have a task ahead of me to clean-up and standardize the comments of a large SDK containing over 600 C++ header files. These files were written over the course of 15 years, by a large number of programmers of varying degrees of skill and experience, and no commonly agreed upon coding guidelines.
This kind of terribly mundane task, just begs to be automated, but the problem is that parsing C++ files is hard to do correctly. Part of the problem, is that the source files sometimes use non-standard Microsoft compiler extensions and use macros extensively.
My first task it to pull out the comments and reformat them. However I want to also extract the surrounding context. This way I can automatically generate missing comments, and verify existing ones.
I have already wrote a general purpose parsing framework in C++ called YARD but Igenerally don't like programming in C++ anymore. It is just much easier to write and maintain code in C#. You don't have to write so much boilerplate code, and the refactoring and debugging tools are much easier to use.
So I started work last week on an open-source parsing framework in C# called (C++ Ripper. It allows a grammar for C++ to be expressed within C# similar to how one would write a CFG or a PEG grammar. In fact it is very closely related to the PEG formalism, but I don't follow it to the letter.
Here is an example of code taken from here used to express a C++ structural grammar for the parser.
<span style="font-family: Georgia; font-size: 14px; white-space: normal" class="Apple-style-span"> Note that this is pure C#, with only a small amount of operator overloading. I overload the "+" and "|" operators to represent sequence and choice PEG operators.</span>
The C++ Ripper parsing framework allows a programmer to express a grammar from a series of basic Rule and Rule operators (like "+", "|", and "Opt()") which are used to automatically create an abstract syntax tree (AST). There are several specialized rules (like "Eat()") which are used for preventing too many nodes being created in the tree. This is important because the grammar expresses rules right down the individual character sets (no tokenizer is needed) so you can end up with extremely dense and hard to manage trees.
One of the reasons that I created my own parser from scratch, is that I wanted to keep comments instead of stripping them like most parsers do. I also wanted the parser to make a distinction between comments that precede a declaration, and those that follow a statement or declaration, but occur before the next newline.
This project is only in its beginning stages, and I am not really trying to solve anyone else's problems but my own. However, I hope that others might find a simple C++ parsing library to be useful in their own projects. I can think of several things I'd like to use it for (a C++ file comparison tool, a C++ pretty-printer, a C++ refactoring tool, a Doxygen->JavaDoc converter, a documentation generator, a C++ pre-processor, etc.), oh well if I only had more spare time (and money).
I'd love to hear whether some code from C++ Ripper makes it into your project, so please drop me a line.