Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Channels ▼
RSS

Tools

Adding an Extension Language to Your Software


SEP91: ADDING AN EXTENSION LANGUAGE TO YOUR SOFTWARE

ADDING AN EXTENSION LANGUAGE TO YOUR SOFTWARE

The little language/application interface

Neville Franks

Neville is the owner of Soft As It Gets, a company specializing in the development of programmer's tools, including a programmer's editor that uses the extension language described in this article. Neville can be contacted at 3 Pullman Crt., East St. Kilda Victoria 3183, Australia. Phone +613-523-0557, fax +613-528-6836.


How often do you wish for a few more thousand bytes of code space, or that you could avoid yet another full edit/compile/link cycle of your complex application?

One way of solving problems such as these is by adding a general-purpose extension language to your software. An extension language is a programming language whose compiled executable code works without actually being linked into your software. In some ways, this is similar to Dynamic Link Libraries (DLLs) in operating systems such as OS/2. Because the linkage step is bypassed, program development is dramatically cut. Extensible languages also open up the possibility of end users writing their own programs to extend or modify your software--all you need to do is clearly document the functions in your application that are callable from the extension language, and provide your end users with the documentation and the extension language compiler. The end user does not need megabytes of disk space for a full-blown compiler, nor does he need to be concerned with the complexities of such an environment.

There are many areas where extension languages can be very useful. In a data entry application, for example, it is common to validate fields as they are entered. In some applications these validation rules can be quite complex and may even vary slightly from one customer to the next. If you write the rules using an extension language, they can be changed quickly and easily, even by the customer if necessary. You could even create a library of programs and select from it on-the-fly. Programmers demand flexibility and extensibility in their program editors. A well-designed extension language fits perfectly into an editor, and offers seamless program addition.

For several reasons, extension languages aren't suited to all programming areas. Firstly, they usually aren't full-blown languages, so some programming tasks may not be possible. Secondly, they are interpreted in one way or another, and are therefore not fast enough for many time critical areas. Finally, the executable code is typically loaded from disk on an as-needed basis; this also takes time.

This article describes the steps to add extension language capabilities to your software.

Choosing a Language

Like any other computer language, an extension language must be carefully designed. You need to consider what type of operations your extension language programs will perform: Are they going to be used in AI-style programs and be Prolog- or Lisp-based? Are they going to work with strings, bits, and bytes and be more like C or Pascal? Or are they so unique that they need a customized language?

There is a lot to be said for basing your language on one programmers are already familiar and comfortable with. The extension language used as an example in this article is a subset of C.

The Compiler

C is a fairly minimal language. Some people, in fact, even refer to it as a high-level assembler. This aside, C is a simple, yet powerful language widely used in commercial application development. C's minimality lends itself well to being an extension language. The task of compiler writing is simplified and the code needed for the runtime environment is not especially overwhelming.

The theory and design of simple C compilers -- in particular Small-C by Ron Cain and James Hendrix and its many derivatives -- has been covered in detail over the years. A compiler such as Small-C can be put into use as an extension language compiler by changing its code generator to one suited for an extension language environment. This is possible because a compiler performs the following sequence of events on a program: lexical analysis, parsing, code generation, and optimization. The main change needed for use with an extension language is in the area of code generation.

Instead of generating machine code for a particular CPU, you generate code for an intermediate language, a bit like the early Pascal P-code system. This intermediate language is then executed by a runtime machine or interpreter.

You could try to execute the extension language source code directly, but this has its limitations. In particular, the program might not be syntactically correct, would probably run slowly, and would most likely have a much larger runtime code overhead.

By compiling into an intermediate form, these problems disappear. Thorough checking is performed during normal lexical analysis and parsing, and the intermediate code is much more compact, so it will execute faster than the original program representation.

Executing Compiled Code

After the source code has been successfully compiled into its intermediate form, it can be executed using a runtime interpreter. Apart from code to load executable programs from disk, the interpreter is all you need to add to your application. All in all, you need less than 2 Kbytes of code space.

The runtime interpreter is implemented as a simple stack machine. The stack is used to hold function call stack frames, automatic variables, and intermediate values during expression evaluation. The stack is represented by an int array, and we assume that a pointer is the same size as an int; see Figure 1. Each time a function call is made, a new stack frame is added. A stack frame occupies four stack entries:

  • RA: the Return Address of the caller
  • SFL: the Stack Frame Link that links stack frames together
  • NA: the Number of Arguments passed to the called function
  • ES: the pointer to the environment's Expression Stack
The design of the intermediate language is fairly straightforward. You start by working out what sort of instructions are needed -- in other words, the machine's instruction set. If the instruction set mimics some of the higher-level extension language semantics, then the compiler's code generator will be easier to design and implement. Execution speed is an important consideration, and with careful design of the intermediate language, a reasonably fast interpreter can be constructed. The intermediate language instruction set for the C extension language falls into the broad categories listed in Table 1 . These instructions fit in well with C and are easy to relate during code generation. If you wanted the interpreter to work with different languages, this close cooperation would most likely be a hindrance and a more generic instruction set should be chosen instead.

Table 1: The intermediate language instruction set categories for a C extension language

  Load stack from memory
  Load memory from stack
  Load stack with memory address
  Load stack with immediate value

  Increment and decrement instructions

  Unconditional and Conditional Jumps

  Call and Return from function
  Call external function

  Logical instructions such as:
  Not, And, Or, Xor

  Relational Instructions such as:
  Gt, Lt, Eq, Ne

  Arithmetic instructions such as:
  Add, Sub, Mult, Div, Mod, Shift

  Bitwise instructions such as:
  And, Or, Xor

Stack Machine Operation

The stack machine is passed a pointer to a block of memory containing the compiled executable code. This block of memory is seen as a stream of bytes containing instructions and their operands. In A Little Smalltalk, Timothy Budd calls these "bytecodes." This book provides detailed information for a Smalltalk interpreter written in C and is recommended reading. Similarities also exist with the threaded code used in Forth interpreters.

To make the intermediate code as compact as possible, instructions vary in length from 1 byte for RET and SUB to 4 bytes for a CALL. Most instructions are either 1 or 3 bytes. Using varyingsized instructions complicates movement of the instruction pointer, but this is warranted, considering the space savings obtained.

The interpreter has four registers: an Instruction Pointer (IP), a Stack Pointer (SP), a Base Pointer (BP), and an Expression Stack pointer (ES).

The instruction pointer moves sequentially through the code until a Jump, Call, or Ret instruction is reached. After changing to its new location, this sequential process continues once more. When the instruction pointer finally gets back to its starting point, program execution ends. The other three registers are used as pointers into the stack.

The Base Pointer register points to the end of the current stack frame and is used to access variables local to the current function. It is also used by the Ret instruction. When a Ret instruction is reached, the sequence in Figure 2(a) occurs. For a Call, the sequence in Figure 2(b) takes place.

Figure 2: (a) When a Ret instruction is reached, this sequence occurs; (b) the sequence for a Call instruction.

  (a) sp = bp - size of stack frame;
      bp = sp[ stack frame link ];
      es = sp[ expression stack ];
      ip = sp[ return address ];
      sp = sp - sp[ argument count ];

  (b) sp[ return address ] = ip;
      sp[ stack frame link ] = bp;
      sp[ expression stack ] = es;
      sp[ argument count ] = arg_count;
      sp = sp + size of stack frame;
      bp = es = sp;
      ip = function address;

The argument count is determined at compile time and included as part of the Call instruction. Prior to performing a call, arguments to be passed to the called function are pushed onto the stack as part of the normal expression evaluation. The called function then has access to these arguments by using negative offsets from the base pointer, taking the stack frame size into account.

The last of the registers is the expression stack pointer which points to the base of the expression stack for the current function. The expression stack starts directly after the area reserved for local variables and is used to hold intermediate expression results; see Figure 3.

The interpreter executes the code by moving the instruction pointer through it, as just mentioned. Each instruction is decoded by case statements inside a big switch. The sample code in Figure 4 gives you the general idea. In the interest of simplicity, the code fragment assumes fixed length instructions. This example clearly shows the use of the ip, bp, and sp registers. The es register is not directly used by any instructions, except the Call and Ret instructions as shown earlier. Its primary purpose is for resetting the stack pointer back to the base of the expression stack once an expression has been completely evaluated.

Figure 4: The interpreter executes the code by moving the instruction pointer through it. Each instruction is decoded by case statements inside a big switch.

  switch( ip->instruction )
  {
      case LODI:        /* load immediate -> ++sp */
          *++sp = ip->immed_val;
          ++ip;
          break;

      case STOR :       /* store sp -> local variable */
          *(bp + ip->addr) = *sp;
          ++ip;
          break;

      case OPR_ADD:     /* add values on stack */
          --sp;
          *sp += *( sp + 1 );
          ++ip;
          break;

      case JPT :        /* conditional jump - true */
          if ( *sp == 0 )
          {
              ++ip;     /* next instr */
              break;
          }   /* note: fall through */

      case JMP :        /* unconditional jump */
          ip = ip->addr;
          break;

      case ....
  }

Apart from local or automatic variables, global variables are also supported. The executable program has a header that carries information such as the size of the global data area, as well as an image of the global data itself. Memory is allocated on the heap, and the global data is copied into it. The heap address is subsequently passed on to the interpreter. Global data includes string constants as well as globally declared variables.

During compilation, variables are determined to be either local or global and the appropriate instructions are generated. At runtime, global variables are accessed much like local variables by using an offset from the base of the global data area, instead of the base pointer register. The offset is encoded into the instruction at compile time.

Calling External Functions (Accessing Application Program Functions

An important feature is the ability of the extension language to call not just functions within the program itself, but external functions residing in the host application program. This gives extension language programs access to each and every function in an application program.

This capability is accomplished by providing both the extension language compiler and the interpreter with access to the application's symbol table. The symbol table is produced during linking and contains the name of every function, along with its address. In Microsoft-compatible languages, the symbol table is in a file with the extension .MAP.

During compilation, the symbol table is used to resolve the addresses of functions that are not in the program itself or have not yet been encountered (forward references). The symbol table (map file) is scanned, and if the function name is found, its address is returned. The compiler then generates the code for an external function call based on this address. The compiler also checks for name clashes between external functions and functions local to the program. If the same name is found in both areas an error is produced.

If you wish to restrict the range of functions accessible by extension language programs, simply strip down the .MAP file accordingly.

During program execution, the interpreter handles these external calls by pushing the parameters from the interpreter stack onto the application program's stack. It then does an indirect function call using the address obtained during compilation as a function pointer.

When the external function returns, its return value is placed on the interpreter stack for use as required.

Programs Can Call Each Other

The interpreter not only allows the use of recursion within a program, but also allows interpreter programs to call other interpreter programs. A program simply calls an external routine which loads the new program from disk, and then calls the interpreter all over again to start it running. This results in nested invocations of the interpreter. When the current program finishes executing, the interpreter returns to the loader routine, which in turn returns to the previous invocation of the interpreter; see Figure 5.

Figure 5: When the current program finishes executing, the interpreter returns to the loader routine, which in turn returns to the previous invocation of the interpreter.

  interpreter(code, global_data, stack)
  |
  -->external call to program loader
     |
     -->interpreter(code, global_data, stack)

Parameters can be passed to the nested interpreter, just as they can be passed to the original invocation. Furthermore, they can be passed either by value or by reference. Passing by reference gives the new program access to variables on the caller's stack, or the global data area.

Calling Extension Language Programs

As described here, extension language programs have a single entry point similar to main() in normal C programs. There is a difference, however, in that you can pass a variable number and type of arguments to a program.

The entry point is named main(), and it is up to the caller and callee to make sure the arguments match in type and number. Included in the runtime system is a program loader, which loads a specified program from disk and then calls the interpreter to run it. The C function prototype for the loader is: pgm_run (char *prog_name, int num_parms, void *parms[]); where prog_name is the name of the compiled program to run; num_parms is the number of parameters being passed to it; and parms[] is a combination of pointers to parameters and possibly parameters themselves.

Let's say you have an extension language program that adds up a set of numbers, and you want to run it from your application. The application code would look like the code in Figure 6(a), while the the extension language program would look like Figure 6(b). This simple example demonstrates parameters being passed by reference and by value.

Figure 6: (a) Sample application code for running an extension language program; (b) sample extension language program executed from an application program.

  (a) add_up_numbers()
      {
      void *parms[4];
      int result;

          parms[0] = &result; /* result passed by reference */
              /* numbers passed by value */
          parms[1] = 23; parms[2] = 69; parms[3] = -78;
          pgm_run( "add_num", 4, parms );
          printf( "Result = %d", result );
      }

  (b) main( result, num1, num2, num3 )
      int *result, num1, num2, num3;
      {
          *result = num1 + num2 + num3;
      }

The num_parms parameter is used by the interpreter to push the correct number of arguments from the parms array onto the interpreter stack before calling main().

The interpreter has a function that allows any extension language function, including main(), to find out exactly how many parameters it was passed. This is particularly useful in main(), allowing different parts of the application to call the same extension language program with different numbers of arguments.

The example above is simple in the extreme; in practice, considerably more complex programs are used. It is not uncommon to write programs in excess of 1000 lines of code.

What's Missing?

So far, the compiler supports a fairly complete subset of the C language. Elements that are missing include structures, unions, bit fields, typedef, casts, enum, extern, static, unsigned types, and data types float, double, and long. For an extension language, I don't think their loss incurs any great deficiencies. Several of these could be added if necessary, but of course they would make the compiler and interpreter larger and more complex. One of the aims of this project was to keep it as simple as possible, yet provide enough) power to handle most tasks.

Possible Extensions

It is possible to virtualize the stack machine so programs run in virtual memory. In a virtual memory environment, code is paged in and out of real memory as necessary, enabling quite large and complex programs to be executed. Virtual memory, along with a least recently used caching system, allows many programs to be loaded concurrently and call each other with minimum overhead.

References

Aho, Alfred V. and Jeffrey D. Ullman. Principles of Compiler Design. First edition. Reading, Mass.: Addison-Wesley, 1977.

Betz, David. "Embedded Languages." Byte (November, 1988).

Bornat, Richard. Understanding and Writing Compilers. Indianapolis, Ind.: Macmillan, 1985.

Brinch Hansen, Per. Brinch Hansen on Pascal Compiler. Englewood Cliffs, N.J.: Prentice Hall, 1985.

Budd, Timothy. A Little Smalltalk. Reading, Mass.: Addison-Wesley, 1987.

Darnell, Peter A. and Philip E. Margolis. Software Engineering In C. New York, N.Y.: Springer-Verlag, 1988.

Hendrix, James E. The Small-C Handbook. Englewood Cliffs, N.J.: Prentice Hall, 1984.

Schildt, Herb. "Building Your Own C Interpreter." Dr. Dobb's Journal (August, 1989).

Tremblay, J. and P.G. Sorenson. The Theory and Practice of Compiler Writing. New York, N.Y.: McGraw-Hill, 1985.

Wirth, Niklaus. Algorithms + Data Structure=Programs. Englewood Cliffs, N.J.: Prentice Hall, 1976.


Copyright © 1991, Dr. Dobb's Journal


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.