Channels ▼
RSS

Parallel

A Universal Cross Assembler


Al Williams straddles the hardware and software line and has authored books ranging from MFC Black Book to Build Your Own Printed Circuit Board.. He can be contacted at al.williams@awce.com.


If you read my article The One Instruction Wonder you know that I like to build my own own CPUs. While that's great (if geeky) fun, it also means you have to build your own tools. And while building CPUs gives you some geek street cred, writing another assembler doesn't really do it.

For a while I'd hack together simple assemblers using a crude C program or maybe awk. But eventually I got tired and decided to write one last cross assembler. (The complete source code and related files are available here). I made a few observations:

  • My PC has way more memory than any of my target computers
  • My PC is ridiculously fast so optimizing the assembler too much is pointless
  • CPUs change (at least they do when you build them yourself) so the assembler has to change easily
  • I'm basically lazy

That last point is key. Being lazy means you borrow the best of what you can find to help you avoid doing real work. My previous experience told me that I liked using C or C++ for its output options, but I really liked using awk for its easy string matching. Sure, I can add string matching libraries to C, but then if I were that motivated, I wouldn't be lazy. Or, if you prefer, then I'd have to spend less time working on my CPU and more time working on my assembler.

After thinking about the problem for a few days, I realized one more thing. Pretty much every assembler I've ever seen accepts input like this:


somelabel:   opcode   arg1,arg2   ; comment

Sure, there's some that use slightly different syntax, but that pretty much sums up about 99% of the assembler's basic syntax. Forget labels for a minute and dump the comment. What if I could get my assembler to look like a C macro? Like this:


opcode(arg1,arg2);

I could easily write some C macros that would fill in an array with the right bit values for the instruction. Remember, I said my PC has way too much memory, so the assembler will just assemble to an array image of the target's memory. Then after its all done, I just have to dump the array out in my format of choice. Simple.

What about the labels? Well that's a little more complicated. Labels are a special case, as you'll see shortly.

The resulting assembler uses four files:

  • soloinc.awk processes "include" files for the assembler
  • solopre.awk converts assembler lines into C macros
  • soloasm.c, the core routines that all assemblers use
  • axasm, ashell script driver that ties it all together

The preprocessor (soloinc.awk; see Listing 1) is simple enough. It just copies files from its input to its output unless it sees a ##include token at the start of a line. When that happens, the program pushes the current file onto a stack and starts printing out the included file (expanding any ##include tokens it finds in the included file, of course). The preprocessor also emits #line directives to the C compiler to help the C compiler identify any errors at the correct location.


#!/usr/bin/gawk
#/**********************************************************************
#axasm Copyright 2006, 2007, 2008, 2009 
#by Al Williams (alw@al-williams.com).
#
#
#This file is part of axasm.
#
#axasm is free software: you can redistribute it and/or modify it
#under the terms of the GNU General Public Licenses as published
#by the Free Software Foundation, either version 3 of the License, or
#(at your option) any later version.
#
#axasm is distributed in the hope that it will be useful, but
#WITHOUT ANY WARRANTY: without even the implied warranty of 
#MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 
#GNU General Public License for more details.
#
#You should have received a copy of the GNU General Public License
#along with axasm (see LICENSE.TXT). 
#If not, see http://www.gnu.org/licenses/.
#
#If a non-GPL license is desired, contact the author.
#
#This is the assembler include file expander
#
#***********************************************************************/
# find the path
function pathto(file,    i, t, junk)
{
    if (index(file, "/") != 0)
        return file

    for (i = 1; i <= ndirs; i++) {
        t = (pathlist[i] "/" file)
        if ((getline junk < t) > 0) {
            # found it
            close(t)
            return t
        }
    }
    return ""
}

BEGIN {
    path = ENVIRON["AWKPATH"]
    ndirs = split(path, pathlist, ":")
    for (i = 1; i <= ndirs; i++) {
        if (pathlist[i] == "")
            pathlist[i] = "."
    }

# keep a stack of files
    stackptr = 0
    oldsp=-1
    input[stackptr] = ARGV[1] # ARGV[1] is first file
    linect[stackptr]=1;
    for (; stackptr >= 0; stackptr--) {
	if (oldsp!=stackptr) { 
	    print "#line " linect[stackptr] " \"" input[stackptr] "\"";
	    oldsp=stackptr;
	}
# copy file while handling includes
        while ((getline < input[stackptr]) > 0) {
            if (tolower($1) != "##include") {
                print
                continue
            }
            fpath = pathto($2)
            if (fpath == "") {
                printf("include:%s:%d: cannot find %s\n", \
                    input[stackptr], FNR, $2) > "/dev/stderr"
                continue
            }

                processed[fpath] = input[stackptr]
                input[++stackptr] = fpath
		linect[stackptr]=1;
        }
        close(input[stackptr])
    }
}

Listing 1: soloinc.awk

The real heart of the assembler is in solopre.awk (Listing 2). This program spits out some boilerplate C code first. Then it converts the assembly language into C macro form.


#/**********************************************************************
#axasm Copyright 2006, 2007, 2008, 2009 
#by Al Williams (alw@al-williams.com).
#
#
#This file is part of axasm.
#
#axasm is free software: you can redistribute it and/or modify it
#under the terms of the GNU General Public Licenses as published
#by the Free Software Foundation, either version 3 of the License, or
#(at your option) any later version.
#
#axasm is distributed in the hope that it will be useful, but
#WITHOUT ANY WARRANTY: without even the implied warranty of 
#MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 
#GNU General Public License for more details.
#
#You should have received a copy of the GNU General Public License
#along with axasm (see LICENSE.TXT). 
#If not, see http://www.gnu.org/licenses/.
#
#If a non-GPL license is desired, contact the author.
#
#This is the assembler preprocessor
#
#***********************************************************************/
# expect -v LFILE=xxx argument (output of label definitions)
# allow -v PROC=xxx  (processor type)

# Note there are a few things the target macros must support
# LABEL, DEFLABEL - Supports the label system
#  DATA, DATA4  - Supports STRING and STRINGPACK
# Granted STRING and STRINGPACK ought to be removable somehow

BEGIN {
  if (LFILE!="") {
    if (PROC=="") {
      print "#include <soloasm.inc>";  # default inc file
    } else {
      print "#include <" PROC ".inc>";
    }
    print "#include \""  LFILE  "\""
    print "" > LFILE
    }
  }


# one time init
{
  if (first != 1) { first=1;   print "#line 1 \"" FILENAME "\""; }
}


# pass through any line directives
/^#line / { print; next; }


# pass through C code and C preprocessor
/^[ \t]*##/  {sub("^[ \t]*##","#");  print; next; }
/^[ \t]*#/  { sub("^[ \t]*#",""); print; next; }

   {
# This won't disturb semicolons if there is a quote
# directly after it. This could lead to trouble with
# semicolons in quoted strings, for example
# so we save just in case and string handling gets all of it

     withsemi=$0
     sub(";[^'\"].*$","");  # remove asm comments
     op=1;
   }

# deal with labels
/^[ \t]*[^ \t,]+:/   { 
     label=$1;
     sub(/:$/,"",label);
     print "DEFLABEL("  label  ");" >>LFILE;
     printf "LABEL("  label  "); ";
     $1="";
     op=2;
   }

# blank lines (maybe it used to have just a label)
/^[ \t]*$/ { print; next; } 
   {

# note: the below means your .h file that defines the processor
# must use uppercase names in macros but you are free to
# use mixed case in the assembly
# probably should make this an option somehow
     mac=toupper($op);
     $op="";
# unpacked string
     if (mac=="STRING") {
	 $op="";
	 strng=withsemi 
	 first=0;
# scan each letter. Note first quote, copy until 2nd quote
	 for (i=1;i<length(strng);i++) {
	     if (substr(strng,i,1)=="\"") {
		 if (first==0) { first=1;  continue; }
		 break;
	     }
	     if (!first) continue;
	     v=substr(strng,i,1);
#	     if (v=="\\") { v=substr(strng,i,2); i++; }
# handle \xNN \DDD or \C
	     if (v=="\\") {
		 v1=substr(strng,i+1,1);
		 if (v1=="x"||v1=="X") {
		     v="\\x"
		     i+=2;
		     v=v substr(strng,i,1)
		     v1=substr(strng,i+1,1)
		     if ((v1>="0"&&v1<="9")||(tolower(v1)>="a"&&tolower(v1)<="f")) {
			     v=v v1
			     i++
		     }
		 }
		 else if (v1>="0" && v1<="7") { 

		     while (v1>="0" && v1<="7") {
			 v=v v1;
			 i++;
			 v1=substr(strng,i+1,1);
		     }
		 } else {
		     v=substr(strng,i,2); 
		     i++;
		 }
		 
		 
	     }
	     print "\tDATA('" v "');"

	 }
	 next;
     }
# packed string. Same logic as STRING
     if (mac=="STRINGPACK") {
	 $op="";
	 strng=withsemi #  $0;
	 first=0;
	 last=0;
	 for (i=1;i<length(strng);) {
	     if (substr(strng,i,1)=="\"") {
		 i++;
		 if (first==0) { first=1;  continue; }
		 break;
	     }
	     if (!first) { i++; continue; }
	     printf "\tDATA4("
	     k=0;
	     for (j=0;j<4;j++) {
# should look at \x type escapes
		 v=substr(strng,i+k++,1);
# handle \xNN \DDD or \C
	     if (v=="\\") {
		 v1=substr(strng,i+k+1,1);
		 if (v1=="x"||v1=="X") {
		     v="\\x"
		     k+=2;
		     v=v substr(strng,k,1)
		     v1=substr(strng,k+1,1)
		     if ((v1>="0"&&v1<="9")||(tolower(v1)>="a"&&tolower(v1)<="f")) {
			     v=v v1
			     k++
			 }
		 }
		 else if (v1>="0" && v1<="7") {
		     while (v1>="0" && v1<="7") {
			 v=v v1;
			 k++;
			 v1=substr(strng,k+1,1);
		     }
		 } else {
		     v=substr(strng,k,2); 
		     k++;
		 }
	     }

#		 if (v=="\\") { v=substr(strng,i+j,2); j++; }
		 if (v=="\"") last=1;
		 if (last) v="\\000";
		 printf("'" v "'")
		 if (j!=3) printf(",");
	     }
	     print ");"
	     i+=k;
	 }
	 next;
     }
# just some generic monadic macro or one with arguments
     if ($(op+1)=="") print(mac ";"); else  print(mac  "("  $0   ");");
   }

Listing 2: solopre.awk

The preamble contains three lines.

  • The first includes soloasm.h (common definitions for all assemblers).
  • The second line includes a .inc file that is specific to the target processor.
  • The final line of the preamble includes the label definition file (passed in on the command line as an argument).

The solopre.awk program generates this label definition file as well (although it hasn't created it at the time the preamble is written). However, the file will exist by the time the compiler reads the output. So this is a handy way to let the awk script output information about labels throughout the processing and then have the C compiler read it all up front. The alternative would have been to make multiple passes through the source: one to collect label information and a second pass to do output. Using the includes is simpler and, well, lazier.

When the awk script encounters a label, it does two things. First it outputs a DEFLABEL macro to the label definition file. Then it writes a LABEL macro to the output file. This means the assembler macros can "know" about forward reference labels since all labels will be defined up front with a DEFLABEL macro (remember, the compiler will read the label definition file as part of the preamble before any user code appears). This does require a compiler (like gcc) that support variable declarations that don't appear at the start of a block.

As the program converts assembly to macros, it also converts the opcode to uppercase. That means you can ignore case when writing assembler code, but not when defining opcodes in the target.inc file. The script also takes special note of STRING and STRINGPACK pseudo operations. These create DATA (or DATA4) statements with the characters of a string either byte by byte (STRING) or with the characters packed into a 32-bit word (STRINGPACK). Obviously, if you want to use these, your assembler definition will need to handle DATA and DATA4.

When the awk script completes, the shell script driver compiles the resulting C program along with soloasm.c and executes it. The soloasm.c file contains a main() and some code to output in different formats. So how does the assembly actually occur?


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 

Video