Design

Text Editors: Algorithms and Architectures

By Ray Valdes, April 01, 1993

From Stallman's GnuEmacs to Microsoft's Word, text editors are one of the most taken-for-granted, yet most often used, applications around. When done right, however, the choice of core algorithms and how they're implemented in the overall architecture can make the difference between a good editor and a great one.

To maintain the view of text as an uninterrupted stream of characters, you can use a number of data structures, such as linked-list structures, the buffer-gap structure, and virtual-memory blocks.

Plain ASCII text is fine for editing program source code, but other uses require additional attributes to be associated with the text stream. These attributes control the way text is formatted or presented on the screen: typeface, point size, justification, and so on.

One implementation choice is to embed these presentation attributes into the text stream, thereby mixing formatting commands with text content. This approach is used in many older-generation typesetting machines, back before WYSIWYG. In general, this is also the approach that the future version of GnuEmacs will use. Version 19 of GnuEmacs, to be released later this year, will view a file as a stream of first-class Lisp objects that can represent text characters, formatting commands, events such as mouse-clicks, or any arbitrary Lisp function.

A different way to deal with text attributes is to maintain parallel streams--one of content, the other of presentation attributes related to content. This is used in Microsoft Word. Likewise, on the Macintosh, the environment provides an equivalent to the Windows Edit control called TextEdit that allows for "runs" or sequences of text styles, a "text style" being defined as a particular combination of font and point-size attributes. Text styles are implemented as an array of style records that point to locations in the text stream where a particular "run" begins. There's an elaborate figure on page 15-38 of Inside Macintosh Volume VI (Addison-Wesley, 1992) that shows how style records are maintained in relation to the text. In System 7.1, the WorldScript facility generalizes the notion of runs even further.

As you step up from ASCII text editors and simple word-processors to more sophisticated document-processing and desktop-publishing programs, text can no longer be regarded as a one-dimensional array of characters (even when associated with a parallel stream of text attributes). In document processors such as FrameMaker, Interleaf, or Ventura Publisher, the text content has its own elaborate structure -- words, sentences, paragraphs, subsections, chapters, appendices, and volumes. This is known as a "semantic" or "logical" structure, in contrast to the geometrical or visual structure of the presentation. Document-processing programs have the most complex task of all graphical editors, to maintain a consistent mapping between two tree-like structures: the semantic hierarchy of text content and the geometrical hierarchy of pages, columns, frames, and lines. Dealing with this complexity in an interactive, optimized manner is what leads to million-line-plus programs.

The Machine Representation of Text

In the EditLine() example, text is represented in the machine as a single array of ASCII-encoded bytes. Inserting or deleting a character merely requires calling movmem() to shift every byte by one memory location. Depending on the CPU, this brute-force method can work for even medium-to-large amounts of text. At some point, of course, this profligate expense of machine cycles becomes unworkable. Then the "buffer-gap" approach comes into play.

The buffer-gap approach divides the single array of characters into two parts, separated by a movable gap. The gap is an internal construct, not visible to the user. From the user's point of view, text remains in an unbroken stream. As the user navigates over the text, moving the cursor from one character to the next, the system updates the corresponding pointer in the text data structure, skipping over the buffer gap as necessary. When the user enters in a bunch of text, the system shifts the gap over to the point of insertion, then shrinks the gap by one character for each keystroke. This method avoids most of the shuffling and reshuffling of text required by the earlier approach.

Implementing a buffer-gap manager is not difficult, but requires attention to detail to avoid fencepost errors. As Finseth points out, three coordinate systems are in play at the same time. In the user coordinate system, location 0 corresponds to the position before the first character of the text. Note that coordinates label the positions between the characters, rather than the characters themselves. This is similar to a 2-D graphical coordinate system, such as that used by QuickDraw or Windows GDI, which labels the positions between pixels rather than the pixels themselves.

Second, there's the buffer-gap coordinate system, which is the same as the user coordinate system, except that the continuum is broken up by the variable-length gap. The third system is the storage coordinate system, which labels the memory locations where characters are stored (rather than the positions between them) and is the one used by pointers to memory. If you don't scrupulously maintain the distinction between these three coordinate systems, you'll be plagued by an ongoing cascade of fencepost errors. Fortunately, the code available electronically contains bufgap.c, a module that implements all the basic functions for managing a buffer gap--inserting and deleting text, moving the gap, expanding the gap, searching the buffer for a particular string, and so on. Ecerpts from buf_gap.c are shown in Listing Two. This code is heavily based on an example posted by Joseph Allen to the Internet on September 10, 1989. The module is not a stand-alone program, but assumes other modules for input, command dispatching, redisplay, memory allocation, and screen output.

Listing Two

/***************************************************************************
Excerpts from BUF_GAP.C--buffer gap manager module. Derived by Ray Valdes from
code by Joseph H. Allen, who wrote in his post to the comp.editors newsgroup
on 9/10/89: "Do whatever you like with this, just leave my name on it."
***************************************************************************/

private unsigned sizeofBuffer;    /* The size of theBuffer */
private char*    thePoint;        /* The point */
private char*    theBuffer;       /* The buffer */
private char*    theEndOfBuffer;  /* First character not in buffer */
private char*    theStartOfGap;   /* Beginning of theStartOfGap */
private char*    theEndOfGap;     /* First character not in theStartOfGap */
private bool     isBufferChanged; /* Set when file has been changed */

#define SIZEOF_GAP_INCREMENT   16384  /* Amount that the buffgap grows by */

/****************************************************************/
public bool
bg_InitializeModule(void)
{   sizeofBuffer = SIZEOF_GAP_INCREMENT;
    theBuffer = (char* ) mem_AllocMem(sizeofBuffer);
    if(!theBuffer) return FALSE;
    thePoint      = theBuffer;
    theStartOfGap = theBuffer;
    theEndOfGap   = theBuffer + SIZEOF_GAP_INCREMENT;
    theEndOfBuffer= theEndOfGap;
    return TRUE;
}
/****************************************************************/
public void
bg_ExpandBuffer(unsigned amount)
{
   if( (theEndOfBuffer + amount - theBuffer) > sizeofBuffer)
    {   char* old = theBuffer;
        sizeofBuffer = theEndOfBuffer + amount
                 + SIZEOF_GAP_INCREMENT - theBuffer;
        theBuffer = (char* ) mem_ReallocMem(theBuffer, sizeofBuffer);
        if(!theBuffer) ProgramError("ReallocMem failed!");
        thePoint       += theBuffer - old;
        theEndOfBuffer += theBuffer - old;
        theStartOfGap  += theBuffer - old;
        theEndOfGap    += theBuffer - old;
    }
}
/****************************************************************/
public void
bg_MoveGapToPoint(void)
{
    if(thePoint==theStartOfGap) return;
    if(thePoint==theEndOfGap)  { thePoint = theStartOfGap; return; }
    /*else*/
    if(thePoint < theStartOfGap)
    {   bg_MoveBytes( theEndOfGap - (theStartOfGap-thePoint),
                      thePoint, theStartOfGap - thePoint);
        theEndOfGap   = theEndOfGap-(theStartOfGap-thePoint);
        theStartOfGap = thePoint;
    }
    else
    {   bg_MoveBytes(theStartOfGap,theEndOfGap,thePoint-theEndOfGap);
        theStartOfGap += thePoint-theEndOfGap;
        theEndOfGap    = thePoint;
        thePoint       = theStartOfGap;
    }
}
/****************************************************************/
public void
bg_ExpandGap(unsigned size)
{   if(size > bg_SizeofGap())
    {
        size += SIZEOF_GAP_INCREMENT;
        bg_ExpandBuffer(size);
        bg_MoveBytes(theEndOfGap+size,
                     theEndOfGap, theEndOfBuffer - theEndOfGap);
        theEndOfGap    += size;
        theEndOfBuffer += size;
    }
}
/****************************************************************/
public bool
bg_FindNextNewline(void)
{   while(((thePoint==theStartOfGap) ? (thePoint=theEndOfGap) : (thePoint))
            != theEndOfBuffer)
    {
        if(*thePoint==NEWLINE_CH) return TRUE;
        else thePoint++;
    }
    return FALSE;
}
/****************************************************************/
public void
bg_InsertStringAtPoint(char* string, unsigned size)
{   bg_MoveGapToPoint();
    if(size > bg_SizeofGap())
        bg_ExpandGap(size);
    bg_MoveBytes(theStartOfGap,string,size);
    theStartOfGap += size;
    isBufferChanged = TRUE;
}
/****************************************************************/
public bool
bg_CompareString(char* string, unsigned size)
{   char* x;
    if(thePoint==theStartOfGap) thePoint=theEndOfGap;
    if(    (theStartOfGap > thePoint )
        && (theStartOfGap < thePoint + size )
        && (theStartOfGap != theEndOfGap) )
    {
         if(bg_CompareString(string,theStartOfGap-thePoint)) return TRUE;
         else
         {
             x = thePoint;
             thePoint = theEndOfGap;
             if(bg_CompareString(  string + (theStartOfGap-x)
                                 , size - (theStartOfGap-x)))
                  { thePoint=x;  return TRUE;  }
             else { thePoint=x;  return FALSE; }
         }
    }
    else
    {
        x = thePoint;
        do { if(*(x++) != *(string++)) return TRUE; } while(--size);
        return FALSE;
    }
}
/****************************************************************/
public bool              /*this routine assumes file is already open*/
bg_InsertFile(FILE* file)
{   unsigned amount;
    long file_size = filelength(fileno(file));

    if(file_size==0L)      return TRUE;
    if(file_size > 32767L) return FALSE;

    isBufferChanged = TRUE;

    bg_MoveGapToPoint();
    bg_ExpandGap((int)file_size);

    amount = fread(theStartOfGap, 1, file_size, file);
    if(amount != file_size)
    {
        ProgramError("I/O Error on reading file.");
        return FALSE;
    }
    theStartOfGap += amount;
    return TRUE;
}

Previous 1 2 3 Next

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

Design

Text Editors: Algorithms and Architectures

The Machine Representation of Text

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content

Design

Text Editors: Algorithms and Architectures

The Machine Representation of Text

Related Reading

News

Commentary

Slideshow

Video

Most Popular

More Insights

White Papers

Reports

Webcasts

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content