Channels ▼

Little Languages with Lex, Yacc, and MFC

Source Code Accompanies This Article. Download It Now.

Jan99: Little Languages with Lex, Yacc, and MFC

Little Languages with Lex, Yacc, and MFC

Jason is a programmer at Maxis, where he develops simulators and simulation languages. He can be contacted at [email protected]

Whether designed to perform complex mathematical calculations, control specialized equipment, or specify text layout in web-based documents, little languages offer users a great deal of power while sparing nonprogrammers from the complexity of general-purpose languages. In this article, I will describe how to use lex, yacc, and MFC to create integrated Win32 development environments for little languages. More specifically, I'll develop a multidocument-interface application called "Slide" (short for "Small Language Integrated Development Environment") and integrate it with lex and yacc. The source code and related files for Slide, bison, and flex are available electronically (see "Resource Center," page 5).

Overview of Lex and Yacc

Lex and yacc are two of the most powerful utilities in a language designer's toolbox. Lex generates lexical analyzers (lexers) that split streams of symbols into substrings (tokens). Yacc generates grammar parsers that analyze streams of tokens and assemble the tokens into grammatical structures. The two utilities go hand-in-hand, letting you quickly develop code for converting symbol streams into grammatical structures.

Both lex and yacc translate script files into C source-code modules. The C code implements the lexer and the parser. The entry point for the yacc-generated parser is a function called yyparse(), which relies on yylex(), an external function generated by lex.

There are a number of implementations of lex and yacc available for PCs. In this article, I use BSD's flex Version 2.5.4 for lex and GNU's bison Version 1.25 for yacc. For more information on lex and yacc, see "Lex and Yacc," by Ian E. Gorman (DDJ, February 1996).

Overview of MFC

MFC is Microsoft's C++ foundation class library and application framework for Windows. The main value of MFC for projects such as this is not its structure or ease of use as a library, but its connection to Visual C++'s automated code generators -- AppWizard and ClassWizard.

AppWizard generates skeletal Windows applications. Users can customize, among other things, window style, OLE support, control styles, file extensions, Internet capabilities, and help-file support. After AppWizard creates an application, ClassWizard can be used to extend it. Users can create new window classes and add functionality to existing classes with a simple visual interface. The combination of MFC, wizards, and visual resource editors makes Visual C++ a useful tool for rapid application development without preventing serious applications from exploiting the full power of C++ and Win32.

Setting up the MFC Application

To set up a multidocument text-editing application using MFC's AppWizard, first create a new MFC AppWizard project called Slide. Make Slide a multiple document application and accept the default settings for all the other options. When you build the application and run it, you should see a multiple document interface application with a single blank document, toolbar, menu, and status bar.

Getting the Slide application ready for flex and bison involves changing Slide's project settings to accommodate C source code and redefining Slide's document and view types to support text editing. By default, AppWizard projects expect all compilation units to use precompiled headers through stdafx.h. Unfortunately, the C code generated by flex and bison cannot accept stdafx.h, so you need to change the Slide's precompiled header settings.

To change the precompiled header settings, go to the project settings dialog, select the C/C++ tab, and set precompiled headers to "Automatic use of precompiled headers." This setting prevents Visual C++ from searching for stdafx.h in each compilation unit.

Next, modify Slide's document and view classes to support text editing. The document and view classes generated by AppWizard inherit directly from the generic CDocument and CView classes. They need to be modified to inherit from CRichEditDoc and CRichEditView, respectively.

Add #include<afxrich.h> to the top of SlideView.h and SlideDoc.h, and replace all occurrences of CDocument in SlideDoc.h and SlideDoc.cpp with CRichEditDoc, and all occurrences of CView in SlideView.h and SlideView.cpp with CRich- EditView.

CRichEditDoc has one pure virtual function, CreateClientItem(_reobject *), which CSlideDoc must define. CreateClientItem(_reobject *) is used for creating additional rich text edit controls. Because Slide only needs the default rich text edit control, you can give this function a trivial implementation. Add Example 1 to CSlideDoc in SlideDoc.h.

By default, CRichEditDoc saves and loads text files via Serialize, which is precisely the functionality Slide needs. Change CSlideDoc's Serialize function so it calls CRichEditDoc's implementation (see Example 2).

When you recompile and run, the Slide1 view should be a text edit window, complete with select, cut, and paste capabilities. The Cut and Paste toolbar and menu items should be fully functional and File Load and Save should be working. You now have a multiple-document text editor. The next step is to format Slide's text for the compiler.

The first thing you need to do here is add a Compile item to Slide's menu and to use ClassWizard to add OnCompile() to CSlideView. OnCompile() will be called by MFC whenever the Compile item is selected from the menu.

Select the resource tab in the project view and open the IDR_SLIDETYPE menu for editing. Add a Build submenu between the View and Window submenus and add a Compile menu item with IDM_COMPILE as its ID to the Build menu. Now, open the ClassWizard, select the Message Maps tab, make sure CSlideView is in the class name dropdown, select IDM_COMPILE in the object ID's listbox, and double-click on the COMMAND selection to add OnCompile() to CSlideView. ClassWizard should have added OnCompile() to CSlideView at the bottom of SlideView.cpp.

Now that OnCompile() is defined, CSlideView needs to stream CRichEditCtrl's text into a CString to get it ready for the compiler. Right above CSlideView::OnCompile(), add the callback function in Listing One. The rich edit control will use this callback function to stream out its text. The parameters of the callback are:

  • dwCookie, a user-supplied value which OnCompile() will use to hand the callback a pointer to a CString.
  • pbBuff, a pointer to the text to be streamed.
  • cb, the size of pbBuff.
  • pcb, a return value indicating how many bytes were actually copied.

As you can see, the callback simply tacks each character in pbBuff onto the end of the string pointer passed in dwCookie and returns cb in pcb.

To stream CSlideView's text into a CString, add Example 3 to the body of OnCompile(). If everything goes well, you should be able to set a breakpoint at the end of OnCompile(), type some text into Slide1, select Compile, examine compileString in a watch window, and see the text you typed contained in compileString.

Slide now has everything it needs to send a text stream to a yacc-generated grammar parser.

Setting Up Flex and Bison

Flex and bison use script files to generate C modules. For the purposes of this article, I will implement a rudimentary parser that recognizes an arbitrarily long sequence of comma-delimited strings, where a string is any letter followed by any number of letters or digits.

The first step is to put the appropriate flex and bison utility files into Slide's project directory. These are FLEX.EXE, FLEX .SKL, BISON.EXE, BISON.SIMPLE, and BISON.HAI. BISON.SIMPLE is usually called BISON.SIM thanks to the 8.3 filename legacy, so you may need to rename it.

The next step is to add the script files for the parser and the lexer to the project. The parser's file is called SLPARS.Y (Listing Two), and the lexer's is SLLEX.L (Listing Three).

The idea is to add SLLEX.L and SLPARS.Y to the Slide project and to use the Custom Build feature to tell Visual C++ how to build these files. The build rules for SLLEX.L and SLPARS.Y generate C code files that must also be compiled and linked into the project. Therefore, it is important to make sure that you set up the dependencies and output files in the custom build settings to guarantee that SLLEX.L and SLPARS.Y are processed before their corresponding C files.

Add SLLEX.L and SLPARS.Y to the Slide project and open the settings for SLLEX.L. Add FLEX SLLEX.L to the Build command(s) listbox.

Next, add LEXYY.C to Output file(s) and SLPARS.Y to Dependencies. This informs Visual C++ that SLLEX.L generates LEXYY.C and that SLLEX.L must be rebuilt if SLPARS.Y is modified. Now select SLPARS.Y's settings and add the build command BISON SLPARS.Y -d. Add SLPARS_T.C and SLPARS_T.H to the Output file(s) listbox.

When you build, flex and bison should generate three new files in Slide's project directory: LEXYY.C, SLPARS_T.C, and SLPARS_T.H. LEXYY.C is generated by flex and implements the lexer. Bison generates SLPARS_T.C and SLPARS_T.H. SLPARS_T.C implements the grammar parser and SLPARS_T.H exports the symbols defined by the parser for use by the lexer. The d parameter in the bison build rule instructs bison to generate SLPARS_T.H. Add LEXYY.C, SLPARS_T.C and SLPARS_T.H to the Slide project.

Redirecting Lex's Input

Once both the MFC text editor and grammar parser are working in the application, the next step is to feed the text from the text editor window to the parser code.

The key to interfacing yacc with MFC is redirecting yylex()'s input stream. By default, yylex() expects the input stream to be yin, which is a FILE *. If yin is 0, then yylex() takes input from stdin.

Neither of these options is ideal for the Win32 world. You could dump CSlideView's text into a temporary FILE * and hand this to yylex(), but ideally, yylex() should read its symbols from the CString that OnCompile() generates.

Different flavors of lex have different protocols for redirecting the input stream. Flex lets users redirect input by redefining the YY_INPUT macro in the script file. YY_INPUT has the form: YY_INPUT (buffer, result, max_size), where buffer is the pointer to be filled in, max_size is the storage capacity for buffer, and result gets the number of bytes actually read.

In this case, SLLEX.L redefines YY_INPUT as follows: #define YY_INPUT(buf,result,max_size) (result = SlideYYInput (buf,max_size)), where SlideYYInput(char *,int) is defined in the C source-code section of SLLEX.L. It copies data from a static char *, SlideInputStream, into the buffer provided by lex (see Listing Three). SlideCompile(char *) (also in Listing Three) is Slide's compiler entry point. It takes a char * input stream and assigns it to SlideInputStream, calls yyrestart(0) to reset lex's state machine, and calls yyparse() to run the grammar parser.

Final Details

Finally, lex and yacc require you to define the functions yyerror(char *) and yywrap(). yyerror(char *) gets called when the parser encounters an illegal token sequence. The char * parameter is a string describing the error and it is almost always "parse error."

For my purposes here, yyerror(char *) calls SlideMessage(char *) (see Example 4), which pops a message box. It is important to declare SlideMessage extern "C" because it is called from the C code generated by flex and bison.

yywrap() gets called when lex runs out of input, giving users the opportunity to continue tokenizing with a new stream. In Slide's case, yywrap() should just return 1, indicating that the input stream has terminated.

Putting it all Together

The pieces are all in place. Now all that's left to do is declare the compiler entry point in SlideView.cpp and call SlideCompile(char *) from OnCompile(). Listing Four shows the final form of OnCompile(). That's all. Slide is now ready to go.

Testing Slide

The language implemented in SLLEX.L and SLPARS.Y is extremely simple. It recognizes an arbitrarily long sequence of comma-delimited strings. When it recognizes such a list it displays "StringList Found" in a message box. If it encounters anything else, it displays an error message.

You'll notice that sometimes you get both messages. For example, the sequence little,languages,are,cool ddj generates a "StringList Found" message followed by a parse error message. This happens because the first four words of the sentence are properly comma delimited. Those words form a proper sentence, which the grammar parser recognizes. The trailing "ddj" is not grammatically correct, so it generates an error.

Parting Thoughts

As consumer software becomes increasingly content-driven, the desire on the part of nonprogrammers to control application behavior and appearance will increase. One of the most striking examples of this phenomenon in recent years is HTML.

Providing content managers with special-purpose languages for modifying applications can not only save development time, but can also help clarify product designs by formalizing a vocabulary and grammar for describing product features.

In this article, I have described how to use MFC, flex, and bison to create an integrated text editor and compiler. But there is more to an integrated development environment than just text editing and compiling. Error reporting, debugging, and multifile project management are typical features of modern development environments. MFC makes extending Slide to support these features relatively easy.

Further Reading

Aho, Alfred V., Ravi Sethi, and Jeffrey D. Ullman. Compilers: Principles, Techniques, and Tools. Addison-Wesley, 1985.

Kaplan, Randy M. Constructing Language Processors for Little Languages. John Wiley & Sons, 1994.

Levine, John R., Tony Mason, and Doug Brown. lex &yacc. O'Reilly & Associates, 1992.


Listing One

/* This callback is used by CSlideView's rich text edit control to stream textinto a CString 
dwCookie - pointer to a CString
pbBuff - pointer to text
cb - size of pbBuff
pcb - gets number of bytes copied
 DWORD dwCookie, LPBYTE pbBuff, LONG cb, LONG *pcb)
 CString *compileString = reinterpret_cast<CString *>(dwCookie);
 for (int i=0;i<cb;i++)
  (*compileString) += static_cast<char>(pbBuff[i]);
 *pcb = cb;
 return 0;


Back to Article

Listing Two

%{/* This bison file implements the grammar rules for recognizing a 
sequence of comma-delimited strings
void SlideMessage(char *message);
void yyerror(char *);
int yylex(void);

%token TOKEN

start: /*empty*/
 | little_list {SlideMessage("Stringlist Found");}
little_list : TOKEN 
 | little_list ',' TOKEN 


Back to Article

Listing Three

%{/* This flex file implements the rules for tokenizing strings */
#include "slpars_t.h"
#include <stdlib.h>
#include <string.h>

/* Redefine input source (see article) */
int SlideYYInput(char *buf,int max_size);
#undef YY_INPUT
#define YY_INPUT(buf,result,max_size) \
 (result = SlideYYInput(buf,max_size))
[A-z][A-z0-9]* { return TOKEN;};
[ \t\n] {/*ignore whitespace*/};
 . {return yytext[0];};
void SlideMessage(char *message);
int yyparse();
static char *SlideInputStream;

/* yywrap and yyerror are required by flex and bison */
void yyerror(char *err)
int yywrap()
 return 1;
/* YY_INPUT redirects lex to get its input from this function.  
It just copies data from a local static char * assigned in SlideCompile.
int SlideYYInput(char *buf,int max_size)
 int n = min(max_size,(int)strlen(SlideInputStream));
 if (n > 0)
  SlideInputStream += n;
 return n;
/* SlideCompile runs the compiler. Called from OnCompile in CSlideView */
void SlideCompile(char *inputStream)
 SlideInputStream = inputStream;


Back to Article

Listing Four

/* OnCompile streams text from the rich edit control into a local CString and sends the CString to SlideCompile for processing by lex and yacc
extern "C" 
void SlideCompile(char *inputStream);
void CSlideView::OnCompile() 
 CString compileString = "";
  {reinterpret_cast<DWORD>(&compileString), 0, SlideViewEditStreamCallBack};
 GetRichEditCtrl().StreamOut(SF_TEXT, es);  
 if (es.dwError == 0)
  SlideCompile(const_cast<char *>(LPCTSTR(compileString)));

Back to Article

Copyright © 1999, Dr. Dobb's Journal

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.