Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Channels ▼
RSS

Design

A Multifield Single-Pass Shell Sort Algorithm


MacGregor is retired from the U.S. Navy and currently resides in the Philippines. He can be contacted at http://www .topsecretcrypto.com/.


When developing the Top Secret Journal procedure for my Top Secret Crypto Gold communications and file-encryption program, I wanted to display the contents of the journal in a tree-view control that users could use to select and display a specific page of the journal. I needed it sorted by two major categories—By Date and By Keyword. Within each major category, I needed it sorted by keywords and then by dates within keywords. Each journal page would have one entry in the By Date category, and up to six entries under the By Keyword category; see Figures 1 and 2.

To do this, I create an index file each time a journal is opened, with one entry for each journal page under the By Date category, and up to six entries for each journal page in the By Keyword category. All By Date category entries have their KeyWord entry set to the single character A for sorting purposes. Each journal page has six KeyWord entries that you can enter when the page is created or edited, so you can group pages with similar content together in the tree-view control. I then have to sort it on three different fields to get it into the order required to create the entries for the tree-view control. Looking at the index file structure (Listing One), I have to sort on the Type_Entry field, which is a word value, then on the KeyWord field, which is a null-terminated string, and finally on the dwCreated field, which consists of two dwords, most significant dword followed by least significant dword. (The dwCreated field contains the date and time the journal page was created as the number of seconds since midnight 1 January 1970, which makes it easy to sort.)

I chose to use the shell sort algorithm and modify it to suit my needs because overall, it usually turns in the best times for worst and average cases, with the worst case beating the average case for sorting time. I also needed to modify the algorithm to sort on byte, word, or dword arrays, and null-terminated strings, and on a single signed byte, word, or dword. And finally, I wanted to be able to sort in ascending or descending order, and forward or backward within a field. To accomplish all of this, the basic shell sort algorithm proved easily modifiable to accomplish the task.

Shell Sort Template Structure

To provide a template for the modified shell sort algorithm to follow, I created a shell sort template structure (Listing Two) to instruct the shell sort algorithm how to sort the data. The first item in the structure is TYPE_SORT, which can be set to SORT_BYTES, SORT_WORDS, SORT_DWORDS for unsigned data; SORT_STRINGS for null-terminated strings; and SORT_SBYTES, SORT_SWORDS, and SORT_SDWORDS for signed data. Sorting signed data is limited to a single byte, word, or dword in a field; otherwise, it will not sort properly. The second item in the structure is COMPARE_DIRECTION, which can be set to FORWARD or BACKWARD and that lets you compare from the low-to-the-high address in a field or the high-to-the-low address. The third item in the structure is RECORD_SIZE, which holds the size of each record to sort. The fourth item in the structure is COMPARE_SIZE. If the field you want to compare is 1 byte long, or one word long, or one dword long, set COMPARE_SIZE to 1. If you have a 30-byte field to compare, set it to 30. If you have a two-word field to compare, set it to 2; if you have a four-dword field to compare, set it to 4. The fifth field in the structure is COMPARE_OFFSET. This is the zero-based offset of the start of the field within the record size you want to compare. If you are comparing the field in the FORWARD direction, this is the zero-based offset of the first byte, word, dword, or string within the record. If you are comparing the field in the BACKWARD direction, this is the offset to the last byte, word, or dword in the field. You cannot compare null-terminated strings in the backward direction. (Refer to Sort.h, available electronically; see "Resource Center," page 5.)

Because I want to sort my index file on three different fields, I extend the shell sort template structure to accommodate three different sort fields. The final item in the structure is the END_MARKER that must be set to SORT_END (-1). This informs the sort algorithm that there are no more fields to sort on. If you have different sorting tasks within a program that call for only one or two fields to be sorted, you can place the end marker in any TYPE_SORT item, which informs the sort algorithm that there are no more fields to sort on. You can extend this structure to accommodate any number of fields.

Filling in a sort template structure is easy. Following the structure of the journal index file (Listing One), I set up a sort template for each of the three fields I want to sort on using the shell sort template structure (Listing Two). The first field to sort on is the TYPE_ENTRY field. Since the type of this field is a word, TYPE_SORT is set to SORT_WORDS, COMPARE_DIRECTION is set to FORWARD, RECORD_SIZE is set to sizeof(DIARYENTRY), COMPARE_SIZE is set to 1, and COMPARE_OFFSET is set to 38 (Listing Three).

The second field to sort on is KeyWord, which is a null-terminated string up to 130 bytes long. Because of this, TYPE_SORT is set to SORT_STRINGS, COMPARE_DIRECTION is set to FORWARD, RECORD_SIZE remains at sizeof(DIARYENTRY), COMPARE_SIZE is set to 130, and COMPARE_OFFSET is set to 40. I know the size of the KeyWord field is 132, but I leave myself a little bit of wiggle room.

The third field to sort on is dwCreated, which contains the date and time the journal entry was created as a 64-bit value in two dwords, the most significant dword first. In the sort template structure TYPE_SORT is set to SORT_DWORDS, COMPARE_DIRECTION is set to FORWARD, RECORD_SIZE remains at sizeof(DIARYENTRY), COMPARE_SIZE is set to 2, and COMPARE_OFFSET is set to 172.

The END_MARKER in the sort template structure is set to SORT_END, which tells the sort algorithm that there are no more fields to sort on.

Shell Sort Algorithm

Most of the shell sort algorithm is written in 80x86 assembly language for 80386 or better processors. The rest is written in Microsoft Visual C 6.0 using Windows API functions for the Win32 environment. Most of the Windows API functions can be converted to Standard C functions to make the sort algorithm more generic.

For the sort algorithm to sort the KeyWord strings in any language, thereby displaying the sorted key words in the proper order according to the sort criteria for each language, I chose to use the Windows API CompareString function with the locale set to LOCALE_USER_DEFAULT and dwCmpFlags set to NORM_IGNORECASE.

The one Windows API function that would be hard to replace is the CompareString function. I say this because, to sort strings for any language, you need a locale that specifies how strings in the user's language are sorted. The CompareString function gives you this, while the Standard C runtime library compare function does not.

I follow the standard shell sort algorithm by dividing the list into two partitions of equal size, comparing each element in the first partition with the corresponding element in the second, and swapping them if necessary. I then divide each of these partitions into two partitions and proceed as above. When the partition size reaches zero, the sort is completed.

What I have added to the shell sort algorithm is the ability to sort on different types of fields, and to sort on more than one field in a single call to the sort procedure. In the heart of the sort procedure I have included a while (True) loop where all of this is accomplished (Listing Four). (See Sort.c, available electronically.)

At the beginning of the while (True) loop, I set up the address of the shell sort template structure in edx, and the offset to the current sort field in the shell sort structure in ebx. The addresses for the two records in the partitions are then placed in esi and edi. The compare offset for the field we want to sort within the record is then added to esi and edi, then these addresses are placed in temporary storage locations. The size of the field we are comparing is then placed in ecx. I then determine if we are sorting a signed field by testing bit 7 of TYPE_SORT in the sort template structure. I use the btr instruction, which copies the bit to the carry flag and resets (clears) the bit in TYPE_SORT. It is necessary to clear this bit so you can easily determine the type of field you are sorting on—bytes, words, dwords, or strings. I then set the signed flag using the setc instruction, which sets it to 1 if the carry bit is set, and 0 if not.

Next, I determine the direction to sort the field in by testing the value in COMPARE_DIRECTION. If the direction to compare is BACKWARD, I use the std instruction to set the direction flag so all subsequent string instructions will process down, from high addresses to low addresses. I then test TYPE_SORT to determine the type of field I am sorting—bytes, words, dwords, or strings. If I am sorting bytes, words, or dwords, I use the standard rep cmpsb, rep cmpsw, or rep cmpsd instructions to compare the fields. Once the comparison is complete, I reset the direction flag using the cld instruction, which places the processor back in its default mode of processing subsequent string instructions up, from low addresses to high addresses. The result of the comparison is then stored in the two temporary variables Above and Below, respectively. If the comparison was performed on unsigned fields, I use the seta and setb instructions; if on signed fields, I use the setg and setl instructions and bit 7 is set in the TYPE_SORT field using the bts instruction for the next comparison.

If the sorting is performed on null-terminated strings, all of the registers are first saved on the stack using the pushad instruction. The CompareString Windows API function is used with the locale set to LOCALE_USER_DEFAULT and dwCmpFlags set to NORM_IGNORECASE, which tells the function to ignore the case of the letters when sorting. The character count variable for each string to compare is set to -1, which tells the function to treat the strings as null terminated, and the length of each string is calculated automatically. The Above and Below variables are then set as determined by the outcome of the comparison, and all of the registers are restored from the stack using the popad instruction.

If the compared items are not equal, the procedure jumps out of the loop to determine if they should be swapped. If the compared items are equal, a check is made to see if there are any more fields to compare. This is done by adding 20 to the ebx register, which is the length of one set of sort parameters in the sort template structure. If the start of the next set of sort parameters contains END_SORT, there are no more fields to sort on; therefore, the function jumps out of the loop to determine if the items should be swapped, which they will not be because they are equal. If it does not contain END_SORT, we have another field to sort on. The sort function then returns to the start of the loop and sets up and compares the next field in the two records. This allows the procedure to sort a set of records on any number of fields you want.

For example, if you had a large data base of customers and you want to sort them by zip code, telephone area code, last name, first name, and middle initial in that order, all you have to do is set up the sort template structure for these five fields and call the sort procedure to sort them. One call is all it takes.

To sort a file, you make a call to the SortMyFile function and supply a pointer to the name of the file, a handle to the opened file, the number of bytes (if any) in the file header, a pointer to the sort template structure, and the direction you want to sort the data in, either ASCENDING or DESCENDING. An example of the call for sorting my journal index file is:

SortMyFile((LPTSTR)&szIndexTemp, hIndexTemp,0,&DiarySort,ASCENDING);

This function determines if you have enough memory to read and sort the file in memory (much faster), or if it will be sorted on disk. The dwHeaderBytes variable lets you tell the sort procedure to ignore the file header when sorting the file. If the file does not have a header, set it to 0. If the file is to be sorted on disk, the SortFileOnDisk function is called. If it is to be sorted in memory, the SortFileInMemory function is called. This function reads the complete file into memory and calls the ShellSort function to sort the data. Once the data is sorted, you return to the SortFileInMemory function and the sorted data is written back to disk.

The ShellSort function can also be called independently if your program needs to sort a table of data that resides solely in memory. (See Sort.c and Support.c; available electronically.)

I have commented out some of the code that contains the error-reporting functions used by my program. I have replaced them with simple MessageBox calls. This allows you to insert your own error procedures as required by the programs you write. This is the reason the SortMyFile function uses the name of the file being sorted. My error procedure displays the name so you will know what file an error occurred on.

Currently, the sort procedure is restricted to handling 4-GB files or less. With the advent of 64-bit computers, it could easily be modified to handle 264-byte files using 64-bit registers. One way around this restriction is the use of index files. While the index files you want to sort would be limited to 4 GB or less, the actual database file it indexes could grow to be much larger.

DDJ



Listing One

// Journal Entry Index File Structure
typedef struct _DIARYENTRY
{
    BYTE                Date_Time[32];  // Date time as a string.
    BYTE                Year[6];        // Year as string.
    WORD                Type_Entry;     // 1 = date entry, 2 = keyword entry.
    BYTE                KeyWord[132];   // Keyword or "A" for date entry.
    DWORD               dwCreated[2];   // Date time created - msd to lsd.
    SYSTEMTIME          st;             // local time created.
    ULARGE_INTEGER      uliOffset;      // Offset of entry in file.
} DIARYENTRY, *LPDIARYENTRY;
Back to article


Listing Two
// Shell Sort Template Structure
typedef struct _SORT_TEMPLATE
{
    DWORD   TYPE_SORT;      // Type of sort.
    DWORD   COMPARE_DIRECTION;  // FORWARD or BACKGROUND.
    DWORD   RECORD_SIZE;    // Record size.
    DWORD   COMPARE_SIZE;   // Sort field size.
    DWORD   COMPARE_OFFSET; // Offset of field in record to sort or -1 for end
    DWORD   TS;             // Start of 2nd sort or -1 for end marker.
    DWORD   CD;
    DWORD   RS;
    DWORD   CS;
    DWORD   CO;
    DWORD   TS1;                // Start of 3rd sort or -1 for end marker.
    DWORD   CD1;
    DWORD   RS1;
    DWORD   CS1;
    DWORD   CO1;
    DWORD   END_MARKER;         // Must be set to -1.
} SORT_TEMPLATE, *LPSORT_TEMPLATE;
Back to article


Listing Three
// Diary Sort Template
SORT_TEMPLATE   DiarySort = {SORT_WORDS,FORWARD,sizeof(DIARYENTRY),1,38,
          SORT_STRINGS,FORWARD,sizeof(DIARYENTRY),130,40,
          SORT_DWORDS,FORWARD,sizeof(DIARYENTRY),2,172,SORT_END};
Back to article


Listing Four
// Sort Algorithm Core.

while(TRUE)
{
    __asm
    {
        // Pointer to sort parameter structure.
        //.....................................
        mov     edx,dwTempEDX
        // Offset to current sort parameter in structure.
        //...............................................
        mov     ebx,dwTempEBX
        // Setup the records to compare.
        //..............................
        mov     esi,dwIndexB    // Bottom index.
        mov     edi,dwIndexC    // Center index.
        // Point to the part of the record to compare.
        //............................................
        add     esi,dword ptr [edx][ebx].COMPARE_OFFSET
        add     edi,dword ptr [edx][ebx].COMPARE_OFFSET
        mov     lpRecordB,esi
        mov     lpRecordC,edi
        // Size of the data to compare. If bytes, the number of bytes; if 
        // words, the number of words; if dwords, the number of dwords.
        //................................................
        mov     ecx,dword ptr [edx][ebx].COMPARE_SIZE
        // See if we are doing a signed comparison.
        //.........................................
        btr     dword ptr [edx][ebx].TYPE_SORT,7
        setc    Signed
        // Setup the direction for the comparision.
        //.........................................
        cmp     dword ptr [edx][ebx].COMPARE_DIRECTION,BACKWARD
        jne     L2
        // Comparision is performed backwards - high address to low.
        //..........................................................
        std
        // Compare bytes.
        //...............
    L2: cmp     dword ptr [edx][ebx].TYPE_SORT,SORT_BYTES
        jne     L3
        repe    cmpsb
        jmp     L5
        // Compare words.
        //...............
    L3: cmp     dword ptr [edx][ebx].TYPE_SORT,SORT_WORDS
        jne     L4
        rep     cmpsw
        jmp     L5
        // Compare dwords.
        //................
    L4: cmp     dword ptr [edx][ebx].TYPE_SORT,SORT_DWORDS
        jne     L7                  // Default to sort strings.
        repe    cmpsd
        // Make sure - reset direction flag to low to high address.
        //.........................................................
    L5: cld         
                        
        // Set the flags depending on if we sorted signed fields or not.
        //...............................................
        pushfd
        cmp     Signed,1
        je      L6
        popfd
        seta    Above
        setb    Below
        jmp     L8
    L6: popfd
        setg    Above
        setl    Below
        // Reset the signed field in the TYPE_SORT parameter for next record.
        //........................................
        bts     dword ptr [edx][ebx].TYPE_SORT,7
        jmp     L8

        // Save all of our registeres.
        //............................
    L7: pushad
    }
    // Compare strings using user default settings.
    //.............................................
    iCompareResult = CompareString(LOCALE_USER_DEFAULT,NORM_IGNORECASE,
                                   lpRecordB,-1,lpRecordC,-1);
    Above = 0;
    Below = 0;
    if (iCompareResult == CSTR_LESS_THAN)
    {
        Below = 1;
    }
    else if (iCompareResult == CSTR_GREATER_THAN)
    {
        Above = 1;
    }
    __asm
    {
        popad
        // Break if the fields are not equal.
        //...................................
        cmp     iCompareResult,CSTR_EQUAL
    L8: jne     CheckSwap
        // Break if the fields are equal and we have no more fields to sort 
        // on; else continue sorting the record on the next field.
        //...........................................
        add     ebx,20          // Size of 1 sort parameter.
        mov     dwTempEBX,ebx
        cmp     dword ptr [edx][ebx].TYPE_SORT,SORT_END
        je      CheckSwap
    }
}   // while TRUE
Back to article


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.