Channels ▼
RSS

Design

So You Want To Write Your Own Language?


My career has been all about designing programming languages and writing compilers for them. This has been a great joy and source of satisfaction to me, and perhaps I can offer some observations about what you're in for if you decide to design and implement a professional programming language. This is actually a book-length topic, so I'll just hit on a few highlights here and avoid topics well covered elsewhere.

Work

First off, you're in for a lot of work…years of work…most of which will be wandering in the desert. The odds of success are heavily stacked against you. If you are not strongly self-motivated to do this, it isn't going to happen. If you need validation and encouragement from others, it isn't going to happen.

Fortunately, embarking on such a project is not major dollar investment; it won't break you if you fail. Even if you do fail, depending on how far the project got, it can look pretty good on your résumé and be good for your career.

Design

One thing abundantly clear is that syntax matters. It matters an awful lot. It's like the styling on a car — if the styling is not appealing, it simply doesn't matter how hot the performance is. The syntax needs to be something your target audience will like.

Trying to go with something they've not seen before will make language adoption a much tougher sell.

I like to go with a mix of familiar syntax and aesthetic beauty. It's got to look good on the screen. After all, you're going to spend plenty of time looking at it. If it looks awkward, clumsy, or ugly, it will taint the language.

There are a few things I (perhaps surprisingly) suggest should not be considerations. These are false gods:

  1. Minimizing keystrokes. Maybe this mattered when programmers used paper tape, and it matters for small languages like bash or awk. For larger applications, much more programming time is spent reading than writing, so reducing keystrokes shouldn't be a goal in itself. Of course, I'm not suggesting that large amounts of boilerplate is a good idea.
  2. Easy parsing. It isn't hard to write parsers with arbitrary lookahead. The looks of the language shouldn't be compromised to save a few lines of code in the parser. Remember, you'll spend a lot of time staring at the code. That comes first. As mentioned below, it still should be a context-free grammar.
  3. Minimizing the number of keywords. This metric is just silly, but I see it cropping up repeatedly. There are a million words in the English language, I don't think there is any looming shortage. Just use your good judgment.

Things that are true gods:

  1. Context-free grammars. What this really means is the code should be parsable without having to look things up in a symbol table. C++ is famously not a context-free grammar. A context-free grammar, besides making things a lot simpler, means that IDEs can do syntax highlighting without integrating most of a compiler front end. As a result, third-party tools become much more likely to exist.
  2. Redundancy. Yes, the grammar should be redundant. You've all heard people say that statement terminating ; are not necessary because the compiler can figure it out. That's true — but such non-redundancy makes for incomprehensible error messages. Consider a syntax with no redundancy: Any random sequence of characters would then be a valid program. No error messages are even possible. A good syntax needs redundancy in order to diagnose errors and give good error messages.
  3. Tried and true. Absent a very strong reason, it's best to stick with tried and true grammatical forms for familiar constructs. It really cuts the learning curve for the language and will increase adoption rates. Think of how people will hate the language if it swaps the operator precedence of + and *. Save the divergence for features not generally seen before, which also signals the user that this is new.

As always, these principles should not be taken as dicta. Use good judgment. Any language design principle blindly followed leads to disaster. The principles are rarely orthogonal and frequently conflict. It's a lot like designing a house — making the master closet bigger means the master bedroom gets smaller. It's all about finding the right balance.

Getting past the syntax, the meat of the language will be the semantic processing, which is where meaning is assigned to the syntactical constructs. This is where you'll be spending the vast bulk of design and implementation. It's much like the organs in your body — they are unseen and we don't think about them unless they are going wrong. There won't be a lot of glory in the semantic work, but in it will be the whole point of the language.

Once through the semantic phase, the compiler does optimizations and then code generation — collectively called the "back end." These two passes are very challenging and complicated. Personally, I love working with this stuff, and grumble that I've got to spend time on other issues. But unless you really like it, and it takes a fairly unhinged programmer to delight in the arcana of such things, I recommend taking the common sense approach and using an existing back end, such as the JVM, CLR, gcc, or LLVM. (Of course, I can always set you up with the glorious Digital Mars back end!)

Implementation

How best to implement it? I hope I can at least set you off in the right direction. The first tool that beginning compiler writers often reach for is regex. Regex is just the wrong tool for lexing and parsing. Rob Pike explains why reasonably well. I'll close that with the famous quote from Jamie Zawinski:

"Some people, when confronted with a problem, think 'I know, I'll use regular expressions.' Now they have two problems."

Somewhat more controversial, I wouldn't bother wasting time with lexer or parser generators and other so-called "compiler compilers." They're a waste of time. Writing a lexer and parser is a tiny percentage of the job of writing a compiler. Using a generator will take up about as much time as writing one by hand, and it will marry you to the generator (which matters when porting the compiler to a new platform). And generators also have the unfortunate reputation of emitting lousy error messages.


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 

Comments:

ubm_techweb_disqus_sso_-c66186002719ab14bb8c9ed1c84c338e
2014-12-19T12:48:15

I don't see the point in semi colons. You can't use them as delimiters, because somebody might have forgotten to put them in, so you have to use some other marker, such as a newline character, and then complain to the programmer if he has failed to put in this thing you have no use for.


Permalink
AndrewBinstock
2014-08-03T23:43:40

Google is your friend. There are hundreds of articles on the Web about how to do this. And dozens of books, several of them free. Have you looked at any of them? This question has also been posted in numerous forums.


Permalink
princeeternity
2014-08-01T04:33:38

How do I create a lexer and a parser by hand?


Permalink
hgad
2014-02-20T23:01:56

Great read. Thanks!


Permalink
ubm_techweb_disqus_sso_-a086639ebf409b48172b84622ea42947
2014-02-16T01:10:03

very enjoyable read!!!


Permalink
disqus_DGtR0kqzuM
2014-02-05T22:21:01

Why should it not rely on a symbol table? How are you supposed to implement separate compilation without one? Or do you expect people to put up with C like includes?


Permalink
ubm_techweb_disqus_sso_-69cccb9a3c6f8fb6d506ea96a0358917
2014-02-04T08:16:15

Ddoc is designed to work with D code, and it relies on semantic information provided by the D compiler to Ddoc. It won't work with C++.


Permalink
ubm_techweb_disqus_sso_-d61ca45a8c7fed1a423c3901aeec682d
2014-01-26T13:56:22

In general, I agree with your assessment of parser generators. One very popular one is more cult than tool. The problem with recursive descent is that it is never correct and it is difficult to modify, for a large parser. I would like to humbly suggest that SLK at www.slkpg.byethost7.com suffers the shortcomings much less than others. Back in 2005 I did a translator from TRANSACT to COBOL for an auto company. Clearly these ancient languages were not changing the way a language under development would be. Still, there were frequent enough changes to the parser that I am sure that I would still be working on it today if it had not been done using SLK.


Permalink
ubm_techweb_disqus_sso_-6db56da6c2739c8811bdab6af015e077
2014-01-23T04:21:35

Interesting, including the last section and the elaboration below (on using github). After reading this, looked at http://dlang.org/ddoc.html and decided I have to try out Ddoc for the C++ code I develop daily. I tried doxgen twice before, and gave up since attempting to document for doxygen interfered with my thought process for solving the problem for which I was developing code. But I have not been able to find anything on Ddoc other than the above link; would appreciate any pointers for getting started documenting C++ with Ddoc.


Permalink
rudmerriam
2014-01-22T16:53:19

Very interesting article. Back in the early 80s I created an IDE (before they were prevalent!) for an embedded system. It ran on a PC and was written in Turbo Pascal. TP was the model for the IDE. The output was downloaded to the embedded system where a byte-interpreter ran the program.

This system allowed end users to create their own programs for this embedded computer. The language was pretty much Basic as it appeared on PCs. The recursive-descent parser was fascinating to write. Following the model of the TP IDE an error took you to the line of code where it appeared.

It was a fascinating project as I am sure D is, also.


Permalink
GerryRzeppa
2014-01-22T06:54:27

Interesting article, Walter.

Some years ago my elder son and I set about developing a compiler in the interest of answering three specific but rather unusual questions:

1. Is it easier to program when you don’t have to translate your natural-language thoughts into an alternate syntax?

2. Can natural languages be parsed in a relatively “sloppy” manner (as humans apparently parse them) and still provide a stable enough environment for productive programming?

3. Can low-level programs (like compilers) be conveniently and efficiently written in high level languages (like English)?

I'm happy to report that we can now answer each of those three questions, from direct experience, with a resounding “Yes!” Here are some details:

Our parser operates, we think, something like the parsing centers in the human brain. Consider, for example, a father saying to his baby son:

“Want to suck on this bottle, little guy?”

And the kid hears,

“blah, blah, SUCK, blah, blah, BOTTLE, blah, blah.”

But he properly responds because he’s got a “picture” of a bottle in the right side of his head connected to the word “bottle” on the left side, and a pre-existing “skill” near the back of his neck connected to the term “suck”. In other words, the kid matches what he can with the pictures (types) and skills (routines) he’s accumulated, and simply disregards the rest. Our compiler does very much the same thing, with new pictures (types) and skills (routines) being defined -- not by us, but -- by the programmer, as he writes new application code.

A typical type definition looks like this:

A polygon is a thing with some vertices.

Internally, the name “polygon” is now associated with a type of dynamically-allocated structure that contains a doubly-linked list of vertices. “Vertex” is defined elsewhere (before or after this definition) in a similar fashion; the plural is automatically understood.

A typical routine looks like this:

To append an x coord and a y coord to a polygon:
Create a vertex given the x and the y.
Append the vertex to the polygon’s vertices.

Note that formal names (proper nouns) are not required for parameters and variables. This, we believe, is a major insight. My real-world chair and table are never (in normal conversation) called “c” or “myTable” -- I refer to them simply as “the chair” and “the table”. Likewise here: “the vertex” and “the polygon” are the natural names for such things.

Note also that spaces are allowed in routine and variable “names” (like “x coord”). This is the 21st century, yes? And that “nicknames” are also allowed (such as “x” for “x coord”). And that possessives (“the polygon’s vertices”) are used in a very natural way to reference “fields” within “records”.

Note, as well, that the word “given” could have been “using” or “with” or any other equivalent since our sloppy parsing focuses on the pictures (types) and skills (routines) needed for understanding, and ignores, as much as possible, the rest.

At the lowest level, things look like this:

To add a number to another number:
Intel $8B85080000008B008B9D0C0000000103.

Note that in this case we have both the highest and lowest of languages -- English and machine code (in hexadecimal) -- in a single routine. The insight here is that (like a typical math book) a program should be written primarily in a natural language, with appropriate snippets in more convenient syntaxes as (and only as) required.

We hope someday soon to extend the technology to include Plain Spanish, and Plain French, and Plain German, etc.

Anyway, if you're interested, you can download the whole thing here:

www.osmosian.com/cal-3040.zip

It’s a small Windows program, less than a megabyte in size. No installation necessary; just unzip and execute. But it's a complete development environment, including a unique interface, a simplified file manager, an elegant text editor, a handy hexadecimal dumper, a native-code-generating compiler/linker, and even a wysiwyg page layout facility (that we used to produce the documentation). If you start with the "instructions.pdf" in the “documentation” directory, before you go ten pages you won't just be writing "Hello, World!" to the screen: you’ll be re-compiling the whole shebang in itself (in less than three seconds on a bottom-of-the-line machine from Walmart).

If you're wary of downloading a zip file from someone you don't know, you can get just the documentation as a PDF here:

www.osmosian.com/instructions....

Thanks,

Gerry Rzeppa
Grand Negus of the Osmosian Order of Plain English Programmers

Dan Rzeppa
Prime Assembler of the Osmosian Order of Plain English Programmers


Permalink
ubm_techweb_disqus_sso_-69cccb9a3c6f8fb6d506ea96a0358917
2014-01-22T00:12:39

I want to expand on my remarks on using github.
Keep records of all your emails and the history of your code. I've been accused of stealing other peoples' code:
1. from a person who forgot he'd licensed me the code
2. from a company that accused me of stealing from them the code they stole from me
3. from a company who refused to pay me for a contract job because they said they developed the code themselves
4. people who were going to "turn me in" to companies I'd licensed the software from
Having good backups and being able to identify where code came from and produce all your licenses will save you an awful lot of grief. Being honest is not good enough, you need to be able to prove it.


Permalink

Video