Channels ▼

Al Williams

Dr. Dobb's Bloggers

Irregular Expressions

July 01, 2013

I've been a big fan of UNIX and, now, of Linux. There is a certain logic to how things go together that made sense to me from the first time I saw it. That's why I'm glad to see more and more embedded systems go to some flavor of Linux.

More Insights

White Papers

More >>

Reports

More >>

Webcasts

More >>

However, I've also noticed a dark side to the wider acceptance of Linux in general. Some of the underlying philosophy has been lost or, if not lost, then at least diluted. Part of this is just due to the nature of GUI systems. Part of it is probably ideas leaking over as people transition from other operating systems.

What kind of philosophy am I talking about? It seems to me that Linux has been moving away from the idea of small modular tools that can be tied together easily. There is also a trend towards more opaque configuration in some newer software. Granted, there are things like dbus that try to fill those gaps, but they don't seem to be as widely used or understood as the classic mechanisms of pipes and text configuration files.

If you think about it, classic UNIX (and systems like Linux) had several key tenets: Everything looks like a file; programs operate on their standard input and outputs; there is a reasonably standard syntax for things like options, globbing, and regular expressions.

Regular expressions, of course, are not specific to UNIX. However, UNIX always embraced them with tools like grep and awk. Because the support is built into the standard library, I've written a lot of code that uses regular expressions that runs under UNIX or Linux. Today, there are plenty of similar libraries for other platforms as well.

If you aren't familiar with regular expressions, they are simple text strings that define patterns that can be matched in other strings. They can range from something simple like:

xy?z

which would match the string xyz or xz, to something very complex like this date validation expression from RegExpLib.com:

^((0?[13578]|10|12)(-|\/)(([1-9])|(0[1-9])|([12])([0-9]?)|(3[01]?))(-|\/)((19)([2-9])(\d{1})|(20)([01])(\d{1})|([8901])(\d{1}))|(0?[2469]|11)(-|\/)(([1-9])|(0[1-9])|([12])([0-9]?)|(3[0]?))(-|\/)((19)([2-9])(\d{1})|(20)([01])(\d{1})|([8901])(\d{1})))$

You might be thinking: Regular expressions for an embedded system? Why not? One system I was especially fond of used regular expressions to parse through input data from an external device sent via a USB serial port. Instead of hard coding the particular types of input records, a regular expression matched the record and extracted the data from the fields. If the input formats changed (and they did), it was a simple matter to edit the file that contained the regular expressions and alter the system behavior without so much as a recompile.

This leads to another problem with wider Linux adoption, though. If you've been doing UNIX for the last 25 years, you are probably a regular expression wizard. If you are just using Linux for the last week trying to get a Raspberry Pi or a Beagle Board to do something, you might not be ready to tackle some of the very hairy regular expressions you might need to create (like the date validation expression above).

One thing I've learned is that sometimes the most effective tools are ones you personally wouldn't use. A friend of mine called me last week wanting help crafting a regular expression and I realized that while I'm used to the terse nature of regular expressions, that it was probably the main obstacle to getting them right for people who haven't used them much. Yet those same people can handle a much more complicated programming language. So why can't regular expressions be more like a programming language?

I fired up emacs and started writing some pretty straightforward code (you can download it here). The idea is very similar to my universal cross assembler. The tool provides some simple functions that can be used to build regular expressions using a more verbose programming language-like syntax.

For example, this input:

start + space + zero_or_more + any_of("ABC") + literal(":") +
group(digit + one_or_more)

Results in this regular expression:

^\s*[ABC]:(\d+)

Like the universal cross assembler, all the real work is being done by the C compiler (well, in this case the C++ compiler, g++). The input gets inserted into a C++ program, compiled, and the output of the program is the regular expression. You can then take the expression and use it anywhere you need it.

The whole compile and run process is handled by a shell script (recompile), which is actually more complicated than the C++ program. The preprocessor (which everyone seems to hate these days) allows you to write things like start and have it turn into a function call.

You can skim through recomp.h to see all the syntax available. Keep in mind that there are a few flavors of regular expressions, so your mileage might vary on some items. You might also need to change the escape and classescape functions to suit your target environment. Long term, I probably should add command-line options to handle the different regular expression target environments, but for what I needed, this did the job.

Although this tool might be useful in or out of embedded systems development, I think it really highlights the vastly underused technique of using tools like the C or C++ compiler to do work for you. This is another part of the UNIX philosophy of stringing together tools to get a desired result. Hopefully, as Linux continues to grow in embedded systems, we won't lose sight of that original ideology.

Related Reading






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 


Video