Channels ▼

Text Editors: Algorithms and Architectures

Source Code Accompanies This Article. Download It Now.

To maintain the view of text as an uninterrupted stream of characters, you can use a number of data structures, such as linked-list structures, the buffer-gap structure, and virtual-memory blocks.

Plain ASCII text is fine for editing program source code, but other uses require additional attributes to be associated with the text stream. These attributes control the way text is formatted or presented on the screen: typeface, point size, justification, and so on.

One implementation choice is to embed these presentation attributes into the text stream, thereby mixing formatting commands with text content. This approach is used in many older-generation typesetting machines, back before WYSIWYG. In general, this is also the approach that the future version of GnuEmacs will use. Version 19 of GnuEmacs, to be released later this year, will view a file as a stream of first-class Lisp objects that can represent text characters, formatting commands, events such as mouse-clicks, or any arbitrary Lisp function.

A different way to deal with text attributes is to maintain parallel streams--one of content, the other of presentation attributes related to content. This is used in Microsoft Word. Likewise, on the Macintosh, the environment provides an equivalent to the Windows Edit control called TextEdit that allows for "runs" or sequences of text styles, a "text style" being defined as a particular combination of font and point-size attributes. Text styles are implemented as an array of style records that point to locations in the text stream where a particular "run" begins. There's an elaborate figure on page 15-38 of Inside Macintosh Volume VI (Addison-Wesley, 1992) that shows how style records are maintained in relation to the text. In System 7.1, the WorldScript facility generalizes the notion of runs even further.

As you step up from ASCII text editors and simple word-processors to more sophisticated document-processing and desktop-publishing programs, text can no longer be regarded as a one-dimensional array of characters (even when associated with a parallel stream of text attributes). In document processors such as FrameMaker, Interleaf, or Ventura Publisher, the text content has its own elaborate structure -- words, sentences, paragraphs, subsections, chapters, appendices, and volumes. This is known as a "semantic" or "logical" structure, in contrast to the geometrical or visual structure of the presentation. Document-processing programs have the most complex task of all graphical editors, to maintain a consistent mapping between two tree-like structures: the semantic hierarchy of text content and the geometrical hierarchy of pages, columns, frames, and lines. Dealing with this complexity in an interactive, optimized manner is what leads to million-line-plus programs.

The Machine Representation of Text

In the EditLine() example, text is represented in the machine as a single array of ASCII-encoded bytes. Inserting or deleting a character merely requires calling movmem() to shift every byte by one memory location. Depending on the CPU, this brute-force method can work for even medium-to-large amounts of text. At some point, of course, this profligate expense of machine cycles becomes unworkable. Then the "buffer-gap" approach comes into play.

The buffer-gap approach divides the single array of characters into two parts, separated by a movable gap. The gap is an internal construct, not visible to the user. From the user's point of view, text remains in an unbroken stream. As the user navigates over the text, moving the cursor from one character to the next, the system updates the corresponding pointer in the text data structure, skipping over the buffer gap as necessary. When the user enters in a bunch of text, the system shifts the gap over to the point of insertion, then shrinks the gap by one character for each keystroke. This method avoids most of the shuffling and reshuffling of text required by the earlier approach.

Implementing a buffer-gap manager is not difficult, but requires attention to detail to avoid fencepost errors. As Finseth points out, three coordinate systems are in play at the same time. In the user coordinate system, location 0 corresponds to the position before the first character of the text. Note that coordinates label the positions between the characters, rather than the characters themselves. This is similar to a 2-D graphical coordinate system, such as that used by QuickDraw or Windows GDI, which labels the positions between pixels rather than the pixels themselves.

Second, there's the buffer-gap coordinate system, which is the same as the user coordinate system, except that the continuum is broken up by the variable-length gap. The third system is the storage coordinate system, which labels the memory locations where characters are stored (rather than the positions between them) and is the one used by pointers to memory. If you don't scrupulously maintain the distinction between these three coordinate systems, you'll be plagued by an ongoing cascade of fencepost errors. Fortunately, the code available electronically contains bufgap.c, a module that implements all the basic functions for managing a buffer gap--inserting and deleting text, moving the gap, expanding the gap, searching the buffer for a particular string, and so on. Ecerpts from buf_gap.c are shown in Listing Two. This code is heavily based on an example posted by Joseph Allen to the Internet on September 10, 1989. The module is not a stand-alone program, but assumes other modules for input, command dispatching, redisplay, memory allocation, and screen output.

Listing Two

Excerpts from BUF_GAP.C--buffer gap manager module. Derived by Ray Valdes from
code by Joseph H. Allen, who wrote in his post to the comp.editors newsgroup
on 9/10/89: "Do whatever you like with this, just leave my name on it."

private unsigned sizeofBuffer;    /* The size of theBuffer */
private char*    thePoint;        /* The point */
private char*    theBuffer;       /* The buffer */
private char*    theEndOfBuffer;  /* First character not in buffer */
private char*    theStartOfGap;   /* Beginning of theStartOfGap */
private char*    theEndOfGap;     /* First character not in theStartOfGap */
private bool     isBufferChanged; /* Set when file has been changed */

#define SIZEOF_GAP_INCREMENT   16384  /* Amount that the buffgap grows by */

public bool
{   sizeofBuffer = SIZEOF_GAP_INCREMENT;
    theBuffer = (char* ) mem_AllocMem(sizeofBuffer);
    if(!theBuffer) return FALSE;
    thePoint      = theBuffer;
    theStartOfGap = theBuffer;
    theEndOfGap   = theBuffer + SIZEOF_GAP_INCREMENT;
    theEndOfBuffer= theEndOfGap;
    return TRUE;
public void
bg_ExpandBuffer(unsigned amount)
   if( (theEndOfBuffer + amount - theBuffer) > sizeofBuffer)
    {   char* old = theBuffer;
        sizeofBuffer = theEndOfBuffer + amount
                 + SIZEOF_GAP_INCREMENT - theBuffer;
        theBuffer = (char* ) mem_ReallocMem(theBuffer, sizeofBuffer);
        if(!theBuffer) ProgramError("ReallocMem failed!");
        thePoint       += theBuffer - old;
        theEndOfBuffer += theBuffer - old;
        theStartOfGap  += theBuffer - old;
        theEndOfGap    += theBuffer - old;
public void
    if(thePoint==theStartOfGap) return;
    if(thePoint==theEndOfGap)  { thePoint = theStartOfGap; return; }
    if(thePoint < theStartOfGap)
    {   bg_MoveBytes( theEndOfGap - (theStartOfGap-thePoint),
                      thePoint, theStartOfGap - thePoint);
        theEndOfGap   = theEndOfGap-(theStartOfGap-thePoint);
        theStartOfGap = thePoint;
    {   bg_MoveBytes(theStartOfGap,theEndOfGap,thePoint-theEndOfGap);
        theStartOfGap += thePoint-theEndOfGap;
        theEndOfGap    = thePoint;
        thePoint       = theStartOfGap;
public void
bg_ExpandGap(unsigned size)
{   if(size > bg_SizeofGap())
        size += SIZEOF_GAP_INCREMENT;
                     theEndOfGap, theEndOfBuffer - theEndOfGap);
        theEndOfGap    += size;
        theEndOfBuffer += size;
public bool
{   while(((thePoint==theStartOfGap) ? (thePoint=theEndOfGap) : (thePoint))
            != theEndOfBuffer)
        if(*thePoint==NEWLINE_CH) return TRUE;
        else thePoint++;
    return FALSE;
public void
bg_InsertStringAtPoint(char* string, unsigned size)
{   bg_MoveGapToPoint();
    if(size > bg_SizeofGap())
    theStartOfGap += size;
    isBufferChanged = TRUE;
public bool
bg_CompareString(char* string, unsigned size)
{   char* x;
    if(thePoint==theStartOfGap) thePoint=theEndOfGap;
    if(    (theStartOfGap > thePoint )
        && (theStartOfGap < thePoint + size )
        && (theStartOfGap != theEndOfGap) )
         if(bg_CompareString(string,theStartOfGap-thePoint)) return TRUE;
             x = thePoint;
             thePoint = theEndOfGap;
             if(bg_CompareString(  string + (theStartOfGap-x)
                                 , size - (theStartOfGap-x)))
                  { thePoint=x;  return TRUE;  }
             else { thePoint=x;  return FALSE; }
        x = thePoint;
        do { if(*(x++) != *(string++)) return TRUE; } while(--size);
        return FALSE;
public bool              /*this routine assumes file is already open*/
bg_InsertFile(FILE* file)
{   unsigned amount;
    long file_size = filelength(fileno(file));

    if(file_size==0L)      return TRUE;
    if(file_size > 32767L) return FALSE;

    isBufferChanged = TRUE;


    amount = fread(theStartOfGap, 1, file_size, file);
    if(amount != file_size)
        ProgramError("I/O Error on reading file.");
        return FALSE;
    theStartOfGap += amount;
    return TRUE;

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.