Channels ▼
RSS

JVM Languages

So You Want To Write Your Own Language?


My career has been all about designing programming languages and writing compilers for them. This has been a great joy and source of satisfaction to me, and perhaps I can offer some observations about what you're in for if you decide to design and implement a professional programming language. This is actually a book-length topic, so I'll just hit on a few highlights here and avoid topics well covered elsewhere.

Work

First off, you're in for a lot of work…years of work…most of which will be wandering in the desert. The odds of success are heavily stacked against you. If you are not strongly self-motivated to do this, it isn't going to happen. If you need validation and encouragement from others, it isn't going to happen.

Fortunately, embarking on such a project is not major dollar investment; it won't break you if you fail. Even if you do fail, depending on how far the project got, it can look pretty good on your résumé and be good for your career.

Design

One thing abundantly clear is that syntax matters. It matters an awful lot. It's like the styling on a car — if the styling is not appealing, it simply doesn't matter how hot the performance is. The syntax needs to be something your target audience will like.

Trying to go with something they've not seen before will make language adoption a much tougher sell.

I like to go with a mix of familiar syntax and aesthetic beauty. It's got to look good on the screen. After all, you're going to spend plenty of time looking at it. If it looks awkward, clumsy, or ugly, it will taint the language.

There are a few things I (perhaps surprisingly) suggest should not be considerations. These are false gods:

  1. Minimizing keystrokes. Maybe this mattered when programmers used paper tape, and it matters for small languages like bash or awk. For larger applications, much more programming time is spent reading than writing, so reducing keystrokes shouldn't be a goal in itself. Of course, I'm not suggesting that large amounts of boilerplate is a good idea.
  2. Easy parsing. It isn't hard to write parsers with arbitrary lookahead. The looks of the language shouldn't be compromised to save a few lines of code in the parser. Remember, you'll spend a lot of time staring at the code. That comes first. As mentioned below, it still should be a context-free grammar.
  3. Minimizing the number of keywords. This metric is just silly, but I see it cropping up repeatedly. There are a million words in the English language, I don't think there is any looming shortage. Just use your good judgment.

Things that are true gods:

  1. Context-free grammars. What this really means is the code should be parsable without having to look things up in a symbol table. C++ is famously not a context-free grammar. A context-free grammar, besides making things a lot simpler, means that IDEs can do syntax highlighting without integrating most of a compiler front end. As a result, third-party tools become much more likely to exist.
  2. Redundancy. Yes, the grammar should be redundant. You've all heard people say that statement terminating ; are not necessary because the compiler can figure it out. That's true — but such non-redundancy makes for incomprehensible error messages. Consider a syntax with no redundancy: Any random sequence of characters would then be a valid program. No error messages are even possible. A good syntax needs redundancy in order to diagnose errors and give good error messages.
  3. Tried and true. Absent a very strong reason, it's best to stick with tried and true grammatical forms for familiar constructs. It really cuts the learning curve for the language and will increase adoption rates. Think of how people will hate the language if it swaps the operator precedence of + and *. Save the divergence for features not generally seen before, which also signals the user that this is new.

As always, these principles should not be taken as dicta. Use good judgment. Any language design principle blindly followed leads to disaster. The principles are rarely orthogonal and frequently conflict. It's a lot like designing a house — making the master closet bigger means the master bedroom gets smaller. It's all about finding the right balance.

Getting past the syntax, the meat of the language will be the semantic processing, which is where meaning is assigned to the syntactical constructs. This is where you'll be spending the vast bulk of design and implementation. It's much like the organs in your body — they are unseen and we don't think about them unless they are going wrong. There won't be a lot of glory in the semantic work, but in it will be the whole point of the language.

Once through the semantic phase, the compiler does optimizations and then code generation — collectively called the "back end." These two passes are very challenging and complicated. Personally, I love working with this stuff, and grumble that I've got to spend time on other issues. But unless you really like it, and it takes a fairly unhinged programmer to delight in the arcana of such things, I recommend taking the common sense approach and using an existing back end, such as the JVM, CLR, gcc, or LLVM. (Of course, I can always set you up with the glorious Digital Mars back end!)

Implementation

How best to implement it? I hope I can at least set you off in the right direction. The first tool that beginning compiler writers often reach for is regex. Regex is just the wrong tool for lexing and parsing. Rob Pike explains why reasonably well. I'll close that with the famous quote from Jamie Zawinski:

"Some people, when confronted with a problem, think 'I know, I'll use regular expressions.' Now they have two problems."

Somewhat more controversial, I wouldn't bother wasting time with lexer or parser generators and other so-called "compiler compilers." They're a waste of time. Writing a lexer and parser is a tiny percentage of the job of writing a compiler. Using a generator will take up about as much time as writing one by hand, and it will marry you to the generator (which matters when porting the compiler to a new platform). And generators also have the unfortunate reputation of emitting lousy error messages.


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 

Video