OCT93: UNDOCUMENTED CORNER

UNDOCUMENTED CORNER

Documenting Documentation: The Windows .HLP File Format, Part II

Pete Davis

Pete works for a small consulting firm as a programmer/analyst, writing client-server software in OS/2, DOS, Windows 3, and Windows NT. He is working on a book, tentatively titled The Hitchhiker's Guide to Win32 Programming, to be published by Addison-Wesley. Pete can be contacted on CompuServe at 71644,3570, or on his BBS (2400,N,8,1) at 703-503-3021.

Introduction

by Andrew Schulman

Open the manual for almost any product from Microsoft and on the copyright page you will find two patent numbers: 4,955,066 and 5,109,433. No, MS-DOS and Windows aren't patented. Nor are these patents for former Microsoft chief systems architect Gordon Letwin's famous (and patented) technique for mode-switching, nor for those funny plastic boxes that Microsoft products used to come in.

Instead, these two U.S. patents, invented by Leo A. Notenboom (Woodinville, WA) and assigned to Microsoft Corp., cover "Compressing and Decompressing Text Files." Specifically, the two patents (from September 1990 and April 1992) cover a form of multilevel data compression, with "phrase" substitution. As an example, both patents point to help files such as distributed with Microsoft Word, and mention as one embodiment of the patent, Microsoft's HELPMAKE program for producing DOS-based .HLP files.

Microsoft's patents seem to bear on parts of the Windows .HLP file format and its successor, the .MVB (Multimedia Viewer) file format. In last month's "Undocumented Corner" (DDJ, September 1993), Pete Davis showed that Windows .HLP and .MVB files are built upon an internal file system that Pete calls "WHIFS" (WinHelp Internal File System). Every Windows .HLP and .MVB file is actually a collection of "internal" files, with names such as |SYSTEM, |TOPIC, and |Phrases; the initial | distinguishes built-in WinHelp files from any additional files provided as "baggage" or bitmaps.

Last month, Pete discussed two WHIFS files, |SYSTEM and |Phrases. This month, Pete turns to |TOPIC, the WHIFS file where your actual help text is stored. As he shows, this text is compressed along much the same lines as set forward in Microsoft's text-compression patents. WinHelp uses a form of LZ77 compression instead of the Huffman encoding mentioned in the patents, but the multilevel phrase replacement scheme used by WinHelp is otherwise very similar to that claimed by Microsoft.

The help compiler goes through the help text looking for frequently-used chunks of text, and puts these into the |Phrases table; all occurrences of these chunks in the text are replaced with indices into |Phrases. The remaining text is then compressed using an LZ77 sliding window, and placed into the |TOPIC file. The |Phrases table is also compressed.

Recovering help text from a Windows .HLP or .MVB file requires reading in |Phrases, decompressing it, reading in |TOPIC, decompressing it, and substituting each phrase_index with Phrases[phrase_index]. Pete shows exactly how to do this in this month's TOPICDMP.C. Incidentally, thanks are again due to Carl Burke, Ron Burk, Lou Grinzo, and Brian Walker.

An interesting question is the extent to which WinHelp is really covered by the Microsoft patents. In other words, how general are these patents? Do they merely cover the specific multilevel sequence of phrase substitution and Huffman encoding, or do they cover any multilevel data compression involving phrase substitution? This is important because it affects whether third parties can develop independent WinHelp utilities without licensing Microsoft's patented technology.

Certainly, there is a market for alternative WinHelp utilities: a faster help compiler, a full WinHelp decompiler, a WinHelp viewer for DOS, and a library of WinHelp subroutines are all suggestions I've received just in the past month.

My wife has also asked me to give her everything on Donald Sutherland and John Malkovich from Microsoft's Cinemania product. As I mentioned last month, Cinemania is essentially a big .MVB file. Getting Amanda everything on her favorite actors would involve hours of pressing buttons in Cinemania, because the Multimedia Viewer doesn't have any batch-processing capabilities. It would be easier to buy her Roger Ebert Goes to the Movies, or perhaps to just buy her Roger Ebert.

But with the information Pete has presented in this two-part series, it should be possible to build better .HLP and .MVB viewers, including ones that allowed batch queries and extraction of material such as the capsule summary for every movie in which Madonna appeared. (Oh right, I'm supposed to be doing Donald Sutherland.)

The real problem isn't that WinHelp and Viewer are too interactive, but that currently Microsoft has no competition in the WinHelp arena. There are plenty of third-party tools to make .HLP files, but no alternative compilers or viewers. As shown both by Cinemania and by the excellent Microsoft Developer Network (MSDN) CD-ROM, .HLP and .MVB actually provide a new way of building applications. Having Microsoft as essentially the sole supplier is a real problem: WinHelp is a proprietary format, so your text is locked up.

The information Pete provides could enable a whole new generation of Windows utilities: not only help compilers, viewers, and decompilers like Pete's HELPDUMP (available electronically), but also new types of WinHelp-based applications. The question, again, is whether these applications will have to use technology licensed from Microsoft. I'd be interested in hearing from you (especially anyone at Microsoft) about this.

Next month, we'll break away from Windows to look at Novell's proprietary Network Core Protocol (NCP). Other possible future topics include undocumented Pentium instructions (the infamous "Appendix H") and internals of the DOS box in Windows Enhanced mode. Thanks to all of you who have suggested these and other topics for this column. You can reach me on CompuServe at 76320,302.

Last month I covered the basics of the WinHelp .HLP and .MVB file structures. I also discussed how .HLP and .MVB files have their own internal file system I call the WHIFS. The WHIFS system is much like the DOS file system in which there are file names and pointers to files. Some of the files in the WHIFS are created by the user, like Bitmap files and Baggage files, but there are many files that are used internally by WinHelp to provide a hypertext help environment. These built-in files have names such as |SYSTEM, |TOPIC, and |Phrases. I discussed a few of those files last month, such as |SYSTEM, which includes any WinHelp macros.

This month I'll talk about the remaining internal files used by WinHelp, emphasizing the |TOPIC file and the compression method it uses for |TOPIC and |Phrases. |TOPIC is where the actual help text is kept, so obviously it deserves special attention. Understanding the (patented) compression method is important because it is used by almost all .HLP files.

There is far more code this month than could be put in the magazine. An updated version of WHSTRUCT.H, with all the WinHelp structures, is available electronically (see "Availability," page 3), as is HELPDUMP.C, the source code for a simple help dumper, and WHTITLES.C, which displays the |TTLBTREE. There is room, however, for TOPICDMP.C (Listing One, page 67), which displays the help text found in the |TOPIC and |Phrases WHIFS files within a .HLP or .MVB file. Obviously, this simple program (discussed later in this article) could be enhanced to create a WinHelp viewer for DOS.

|TTLBTREE

|TTLBTREE is a B-tree with topic titles. With each topic title is the offset of the topic. This is used simply for getting a list of topics and jumping directly to a given topic based on its title.

The |TTLBTREE is set up much like the WHIFS B-tree discussed last month. The |TTLBTREE file use 2K pages, as opposed to the 1K pages used by the WHIFS B-tree, but otherwise, traversing the |TTLBTREE is the same as traversing the WHIFS (shown last month).

The data in the |TTLBTREE is simply the offset to the topic followed by the topic's title as a null-terminated, varying length, text string. The offset is a little strange, however (see "Topic Offsets" below).

The WHTITLES.C code (available electronically) dumps out |TTLBTREE, providing a handy dump of all the topics in a .HLP or .MVB files. For example, Cinemania turns out to have 33,535 topics, including "Abbott and Costello Go to Mars (1953)," "Abbott and Costello in Hollywood (1945)," "Abbott and Costello in the Foreign Legion (1950)," and even a few films in which Abbot and Costello did not appear.

|TOPIC

The |TOPIC file has the text from the individual topics and because the topic file is built as a linked list, it's easy to go through all the topics. There are some important things to keep in mind, however. In 90 percent of the cases, you can't simply go through the |TOPIC file to get the text. If there's LZ77 compression, you have to decompress the text. If there is a |Phrases file (see DDJ, September 1993), you need to do the phrase replacement within the |TOPIC file. If you want to handle fonts, you need to pull in information from the |FONT file, and so on, and so on.

The |TOPIC file has several layers. Because of the multilayer design, you must be aware of which |TOPIC level you are trying to traverse. The lowest layer is a doubly linked list of records. Each record has record type; inside that record is information based on that record type. For example, a record type of 0x02 means that the data is a topic header. This will have information such as the size of the total topic, where the next topic is, where the previous topic is, and so on. There is also a record type of 0x20 which contains displayable information such as text or bitmap.

Because a single topic may include many such records, something else is needed for traversing the entire help file via browse buttons or in some other sequential manner. This is where the next layer, a linked list of topics, comes in. This is information built from a type 0x02 record. This record type is a linked list within the lower-level linked list. It has pointers to previous and next topics so you can bypass the many nodes of the lower-level linked list between paragraphs.

Topic Offsets

Topic offsets in WinHelp are a mess, plain and simple. There are, essentially, three types of topic offset.

The first type I'll call the Actual Offset. This is the offset relative to the beginning of the |TOPIC file. This is the most straightforward (and, appropriately, least-frequently used) of the offsets.

The second type is the Extended Offset, which is used to compensate for space saved via LZ77 compression. The |TOPIC file is broken into 4K blocks. If compressed, however, a 4K block could expand to a 5K, 6K, or larger. So, to allow you to quickly go to a point in the |TOPIC file, the Extended offset is broken into two parts, the block number and the block offset. The block number is the upper 18 bits of an offset DWORD; the lower 14 bits are the offset within that block.

The last type is the Character offset. This has the same block number/offset form as an Extended offset. The split is different, though: the block number is the upper 17 bits and the block offset is the lower 15 bits. Furthermore, the block offset in a Character offset is the sum of the DataLen2 fields of all the previous TOPICLINK records in the block (see WHSTRUCT.H, available electronically). This is used to be able to provide an exact location within the text.

This is a fine mess. So, which ones are used where? The offsets given in |TTLBTREE, |KWDATA, |CONTEXT, and other files external to |TOPIC, are Character offsets. Extended offsets are used in the TOPICBLOCKHEADER and TOPICHEADER records. They are also used for the hot-link references within the |TOPIC text. The Actual Offsets are mainly used by you, the programmer, for getting around the |TOPIC files. Why both Extended and Character offsets are used, I don't know, since Extended could probably handle all the same functionality as Character.

|TOPIC Structures

The TOPICBLOCKHEADER (see WHSTRUCT.H, available electronically) starts a block of topic data in the |TOPIC file. The TOPICBLOCKHEADER appears at every 4K of the |TOPIC file, starting with the first 12 bytes of a topic file. The TopicData field has the offset of the start of topic data for this block. This address has to be translated (see previous subhead) because of the 13th and 14th bits are Extended offsets.

The TOPICLINK structure is the lower level. This begins immediately after the TOPICBLOCKHEADER and encompasses the rest of the |TOPIC file. The PrevBlock and NextBlock (linked-list pointers) are the offsets of the blocks relative to the beginning of the |TOPIC WHIFS file. The topic links are broken into record types. Type 0x02 are topic headers and the type 0x20 are usually displayable items like paragraphs or bitmaps.

There are two data pointers in the TOPICLINK structure. The contents depends on the record type. The TOPICHEADER file is located in the *LinkData of a type 0x02 record. The block size is the size of the topic, including all the lower level linked list within that topic. Type 0x20 records are essentially paragraphs, bitmaps, and other displayable types of information. The *LinkData1 for a type 0x20 record is broken into records.

LinkData1 consists of a list of variable-length records. Each record is usually delimited by a 0x80 byte. The first record is a number, 2xLength(LinkData1). The second is 2xSizeof(LinkData2). The rest of the codes are font descriptors, hotlinks, and the like. If a paragraph has multiple fonts and hotlinks in a single paragraph, they will be listed in the order in which they appear in the text. If, for example, the paragraph starts with 8 point helvetica normal, and then a word is bold, and then the font goes back to normal, you will have three font descriptors listed, one for 8 point helvetica normal, followed by 8 point helvetica bold, followed by 8 point helvetica normal again. In the text, NULL bytes (0x00) are used to denote changes in font, or the start and end of hot links.

The LinkData2 field is essentially the text for the paragraph. 0x00 is used as a delimiter in the text for font changes or hot links. If compression was used, references to the |Phrases file will be made. In the text, the presence of a byte value between 0x01 and 0x09 indicates that this and the following byte are a reference to the |Phrases file. If, for example, 0x01 0x08 shows up, then this is a reference to the fourth phrase in the |Phrases file. The formula is (((Byte11)*256)+Byte2)/2). This means there are a maximum of about 1,100 phrases in any given .HLP or .MVB file.

WinHelp's Data Compression

The |Phrases and |TOPIC file use an LZ77 compression algorithm to save space. LZ77 is a fairly simple algorithm and its implementation under WinHelp is even simpler (for details on LZ77, see Chapter 8 of The Data Compression Book, by Mark Nelson, M&T Books, 1992). During our work we also noticed that the compression used by COMPRESS.EXE and the LZEXPAND.DLL are very similar to WinHelp's. In this process, we also uncovered the fact that the compression algorithm used by WinHelp and COMPRESS are also covered under two U.S. patents. This means that to actually use this information in any commercial applications would probably require a license from Microsoft.

LZ77 is called a "sliding Window" compression algorithm. As data is read in, it is added to the "window." The window is, essentially, a queue that the data goes in. When the window is full, the data at the end is removed from the window to allow room for new data. When a coded segment of the data shows up, it consists of a pointer to data in the window. In this case, it consists of a pointer to the data and the length of the data. The pointer tells how far back in the window the data is located and the length tells us how much data needs to be copied to the current position.

The WinHelp compression algorithm uses bitmaps for every eight codes in the compressed data. A code is either an actual character or a 2-byte coded distance and length pair. The bitmap is simply a single byte that tells you which of the following eight codes are compression codes and which are actual characters to be copied.

The codes are in the format of a 12-bit distance to the codes and a 4-bit length. For example, encountering the bytes 0x42 0x31 would yield a 12-bit distance of 0x142 and a length of 0x3. Since any length of less than three would be useless (since the codes are two bytes), the 4-bit length is increased by three, so our length would actually be 0x6. Also, since a zero distance would have no meaning, the 12-bit distance bits are increased by one, leaving a final distance 0x143 and length of six.

This might be a little easier if we looked at some data in a hex-dump format. Figure 1 contains three encoded strings--Flower, Phanatical, and Pharmaceutical. Notice that the first byte is a 0. This is, of course, eight 0-bits, meaning that the eight codes to follow are actually just 1-byte characters. So, after reading the flag bit, simply read the following eight characters and add them to the window.

The next byte is, again, a 0. Therefore the next eight codes are actual characters and will simply be added to the window. The window now consists of the words "FlowerPhanatical." So far we haven't run into any codes, but the next byte is a 0x81, which in binary is 10000001. This means that the first two bytes are a code, followed by six bytes of text, followed by another 2-byte code.

The first 2-byte code is 0x09 0x00. This translates into a distance of 0x0A and a length of 3, as per the formula above. Moving back ten (0x0A) characters in the window, we come to the letter "P" in "Phanatical." The length is 3, so we take the "Pha" from "Phanatical" and add it to the window. The next six bytes are the letters "rmaceu." The window now consists of the letters "FlowerPhanaticalPharmaceu." According to our bitmap, we still have one code to go; this is 0x0D 0x20. Again, from our translation formula, the 12-bit distance is 0x0D+1=0x0E and the length is 0x02+3=5. Going 0x0E (14) characters back in the window we come to the "tical" part of Phanatical. After adding these characters to the window, we get "FlowerPhanaticalPharmaceutical." Tada!

How Compression is Applied

The WinHelp compiler applies the LZ77 compression to both |TOPIC and |Phrases. The previous example showed how it would be applied to |Phrases: the compression starts at the beginning of the first phrase and continues, without interruption, to the end of |Phrases.

|TOPIC, on the other hand, is compressed in 4K blocks to support incremental data decompression. If the compression wasn't blocked, then whenever you wanted to decompress a topic, even if it was at the end of |TOPIC, you would have to decompress all the data before it just to have the proper data in your window. By breaking it into 4K blocks, you never need go further back than 4k to decompress the data you're looking for. Thus, if you are decompressing a topic inside a block, you must start at the beginning of the block and continue until you reach the end of your topic. If your topic crosses over to the next block, you must decompress the second block until you have reached the end of your topic.

TOPICDMP

To see how |TOPIC and |Phrases and the LZ77 algorithm fit together, see TOPICDMP.C in Listing One (page 167). This program dumps out the text in a Windows 3.1 .HLP or .MVB file. A calling tree for the program is shown in Figure 2.

Unfortunately, TOPICDMP.C is not self-contained: all the structures used, such as PHRASEHDR, TOPICBLOCKHEADER, and TOPICLINK, are all found in WHSTRUCT.H, as is the GotoWHIFSPage() macro. Even if you downloaded WHSTRUCT.H last month, you will need to get it again this month, since it has changed considerably.

That's It?

Well, we've run out of time. There are several other crucial WHIFS files that I didn't discuss, including |FONT, |KWMAP, |KWBTREE, |KWDATA, and |CONTEXT. If you want to learn more about these, download the text file that we'll also be providing electronically, as well as the source code for HELPDUMP.C.

I know that there are a lot of you out there who want to know more. I will continue to compile, and make available, information about the WinHelp file format. If you have corrections or additions, I'd welcome them.

Figure 1: WinHelp Data Compression

Figure 2: Calling Tree for TOPICDMP.C



_UNDOCUMENTED CORNER_
edited by Andrew Schulman
written by Pete Davis

[LISTING ONE]

/* TOPICDMP.C -- Dumps topic file from a Windows .HLP or .MVB file.
Pete Davis, August 1993 With some modifications by Andrew Schulman,
September 1993. From Dr. Dobb's Journal, October 1993 */

#include <time.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <conio.h>
#include <ctype.h>
#include <limits.h>

#pragma pack(1)   /* Make sure we get byte alignment */
#include "whstruct.h"
#include "topicdmp.h"

HELPHEADER        HelpHeader;        /* Header for Help file.       */
WHIFSBTREEHEADER  WHIFSHeader;       /* WHIFS Header record         */
int               WHIFSLeafOne = -1; /* First WHIFS Leaf Node       */
long              FirstPageLoc;      /* Used by macros for b-trees  */
char              *PhrasesPtr;
int               Compressed;        /* Is there compression?       */
#define MSG(s)              { puts(s); return; }
#define FAIL(s)             { puts(s); exit(1); }
#define GET_STRING(f, s) \
    { char *p = (char *)(s); while (*p++ = fgetc(f)) ; *p = 0; }
#define BIT_SET(map, bit)   (((map) & (1 << (bit))) ? 1 : 0)
// Finds the first leaf in the WHIFS B-Tree
void WHIFSGetFirstLeaf(FILE *HelpFile) {
    int               CurrLevel = 1; /* Current Level in B-Tree */
    BTREEINDEXHEADER  CurrNode;      /* Current Node in B-Tree  */
    int               NextPage = 0;  /* Next Page to go to      */
    /* Go to the beginning of WHIFS B-Tree */
    fseek(HelpFile, HelpHeader.WHIFS, SEEK_SET);
    fread(&WHIFSHeader, sizeof(WHIFSHeader), 1, HelpFile);
    FirstPageLoc = HelpHeader.WHIFS + sizeof(WHIFSHeader);
    GotoWHIFSPage(WHIFSHeader.RootPage);  // macro in WHSTRUCT.H
    /* Find First Leaf */
    while (CurrLevel < WHIFSHeader.NLevels) {
       fread(&CurrNode, sizeof(CurrNode), 1, HelpFile);
       /* Next Page is conveniently the first byte of the page */
       fread(&NextPage, sizeof(int), 1, HelpFile);
       GotoWHIFSPage(NextPage);
       CurrLevel++;
    }
    /* First Leaf page is here */
    WHIFSLeafOne = NextPage;
}
// Get a WHIFS file by file number; returns offset and filename
void GetFile(FILE *HelpFile, DWORD Number, long *Offset, char *Name) {
    BTREENODEHEADER CurrentNode;
    DWORD           CurrPage, counter = 0;
    char            c, TempFile[19];
    /* Skip pages we don't need */
    CurrentNode.NextPage = WHIFSLeafOne;
    do {
        CurrPage = CurrentNode.NextPage;
        GotoWHIFSPage(CurrPage);
        fread(&CurrentNode, sizeof(CurrentNode), 1, HelpFile);
        counter += CurrentNode.NEntries;
    } while (counter < Number);

    for (counter -= CurrentNode.NEntries; counter <= Number; counter++) {
        GET_STRING(HelpFile, TempFile);
        fread(Offset, sizeof(long), 1, HelpFile);
    }
    strcpy(Name, TempFile);
}
// Get SysHeader to see if compression used on help file
void SysLoad(FILE *HelpFile, long FileStart) {
   SYSTEMHEADER    SysHeader;
   FILEHEADER      FileHdr;
   fseek(HelpFile, FileStart, SEEK_SET);
   fread(&FileHdr, sizeof(FileHdr), 1, HelpFile);
   fread(&SysHeader, sizeof(SysHeader), 1, HelpFile);
   if (SysHeader.Revision != 21)
       FAIL("Sorry, TOPICDMP only works with Windows 3.1 help files");
   Compressed = (SysHeader.Flags & COMPRESSION_310) ||
                (SysHeader.Flags & COMPRESSION_UNKN);
}
// Decides how many bytes to read, depending on number of bits set
int BytesToRead(BYTE BitMap) {
    int TempSum, counter;
    TempSum = 8;
    for (counter = 0; counter < 8; counter ++)
       TempSum += BIT_SET(BitMap, counter);
    return TempSum;
}
// Decompresses the data using Microsoft's LZ77 derivative.
long Decompress(FILE *HelpFile, long CompSize, char *Buffer) {
   long InBytes = 0;        /* How many bytes read in                    */
   WORD OutBytes = 0;       /* How many bytes written out                */
   BYTE BitMap, Set[16];    /* Bitmap and bytes associated with it       */
   long NumToRead;          /* Number of bytes to read for next group    */
   int  counter, Index;     /* Going through next 8-16 codes or chars    */
   int  Length, Distance;   /* Code length and distance back in 'window' */
   char *CurrPos;           /* Where we are at any given moment          */
   char *CodePtr;           /* Pointer to back-up in LZ77 'window'       */
   CurrPos = Buffer;
   while (InBytes < CompSize) {
      BitMap = (BYTE) fgetc(HelpFile);
      NumToRead = BytesToRead(BitMap);
      if ((CompSize - InBytes) < NumToRead)
          NumToRead = CompSize - InBytes;   // only read what we have left
      fread(Set, 1, (int) NumToRead, HelpFile);
      InBytes += NumToRead + 1;
      /* Go through and decode data */
      for (counter = 0, Index = 0; counter < 8; counter++) {
         /* It's a code, so decode it and copy the data */
         if (BIT_SET(BitMap, counter)) {
            Length = ((Set[Index+1] & 0xF0) >> 4) + 3;
            Distance = (256 * (Set[Index+1] & 0x0F)) + Set[Index] + 1;
            CodePtr = CurrPos - Distance;   // ptr into decompress window
            while (Length)
               { *CurrPos++ = *CodePtr++; OutBytes++; Length--; }
            Index += 2;  /* codes are 2 bytes */
         }
         else
            { *CurrPos++ = Set[Index++]; OutBytes++; }
      }
   }
   return OutBytes;
}
// Prints a Phrase from the Phrase table
void PrintPhrase(char *Phrases, int PhraseNum) {
    int *Offsets = (int *)Phrases;
    char *p = Phrases+Offsets[PhraseNum];
    while (p < Phrases + Offsets[PhraseNum + 1])
        { putchar(*p); p++; }
}
// Build up a table of phrases
void PhrasesLoad(FILE *HelpFile, long FileStart) {
   FILEHEADER      FileHdr;
   PHRASEHDR       PhraseHdr;
   int             *Offsets;
   char            *Phrases;
   long            DeCompSize;
   /* Go to the phrases file and get the headers */
   fseek(HelpFile, FileStart, SEEK_SET);
   fread(&FileHdr, sizeof(FileHdr), 1, HelpFile);
   fread(&PhraseHdr, sizeof(PhraseHdr), 1, HelpFile);
   /* Allocate space and decompress if it's compressed, else read in. */
   if (Compressed) {
      if ((Offsets = malloc((unsigned) (PhraseHdr.PhrasesSize +
          (PhraseHdr.NumPhrases + 1) * 2))) == NULL)
        MSG("No room to decompress |Phrases");
      Phrases = Offsets + fread(Offsets,2,PhraseHdr.NumPhrases+1, HelpFile);

      DeCompSize = Decompress(HelpFile, (long)FileHdr.FileSize -
          (sizeof(PhraseHdr) + 2 * (PhraseHdr.NumPhrases+1)), Phrases);
      if (DeCompSize != PhraseHdr.PhrasesSize) {
         printf("\n");
      }
   }
   else {
      if (!(Offsets=malloc((unsigned)(FileHdr.FileSize-sizeof(PhraseHdr)))))
         MSG("No room to decompress |Phrases");
      /* Backup 4 bytes for uncompressed Phrases (no PhrasesSize) */
      fseek(HelpFile, -4, SEEK_CUR);
      fread(Offsets, (unsigned) (FileHdr.FileSize - 4), 1, HelpFile);
   }
   PhrasesPtr = Phrases = (char *) Offsets;
}
/* Because the topic file is broken into 4k blocks, we'll have to handle
all the reads.  The idea is to filter out the TOPICBLOCKHEADERs and
do any decompression that needs doing. */
long TopicRead(BYTE *Dest, long NumBytes, FILE *HelpFile) {
   static long        CurrBlockLoc = 0;   /* Where we are in the block  */
   static BYTE        *DCmpBlock = NULL;  /* Block of uncompressed data */
   static long        DecompSize;         /* Size of block after decomp */
   static long        TopicStart, BlkNum; /* Start of |TOPIC file       */
   long               BytesLeft;          /* # Bytes left to return     */
   TOPICBLOCKHEADER   BlockHeader;
   TOPICLINK          *TempLink;
   long               EndOffset;
   /* If NumBytes = 0, then we're done and need to free memory */
   if (NumBytes == -1) { free(DCmpBlock); return 0; }
   if (!DCmpBlock) {
      if (Compressed) {
         if (! (DCmpBlock = malloc((unsigned) (4 * TopicBlockSize))))
             FAIL("Not enough memory to decompress |TOPIC file");
         TopicStart = ftell(HelpFile);
         BlkNum = 0;
      }
      else if (! (DCmpBlock = malloc((unsigned) TopicBlockSize)))
          FAIL("Not enough memory to handle |TOPIC file");
      DecompSize = 0;   /* Set initial size to 0 */
      /* Don't really need first block header, so get it out of the way */
      fread(&BlockHeader, sizeof(BlockHeader), 1, HelpFile);
   }
   BytesLeft = NumBytes;
   while (BytesLeft) {
      if (DecompSize == CurrBlockLoc) {
         BlkNum++;
         if (Compressed) {
            DecompSize = Decompress(HelpFile, (long)TopicBlockSize-1,
                (char *)DCmpBlock);
            /* Align ourselves at next 4k block */

            fseek(HelpFile, TopicStart + (4096L * BlkNum), SEEK_SET);
         }
         else
            DecompSize=fread(DCmpBlock,1,(unsigned) TopicBlockSize, HelpFile);
         CurrBlockLoc = 0;
         fread(&BlockHeader, sizeof(BlockHeader), 1, HelpFile);
         // Get offset of last topic link. (Don't need block #, hence 3FFFh)
         EndOffset = BlockHeader.LastTopicLink & 0x3FFF;
         TempLink = (TOPICLINK*)(DCmpBlock + EndOffset-sizeof(BlockHeader));
         /* Actual end of the data (Don't include header) */
         EndOffset += (TempLink->BlockSize - sizeof(BlockHeader));
         // If end shorter than topic block use it; else topic block full
         if (EndOffset > DecompSize) {
             /* Adjust DecompSize if crossing 4k boundary */
             EndOffset = TempLink->BlockSize-((TempLink->NextBlock) & 0x3FFF);
             DecompSize = (BlockHeader.LastTopicLink & 0x3FFF) + EndOffset;
         }
         else DecompSize = EndOffset;
     } /* If */
     *(Dest++) = *(DCmpBlock + (CurrBlockLoc++) );
      BytesLeft--;
   } /* While (BytesLeft) */
   return NumBytes;
}
// Displays a string from a topic link record. Checks for Phrase
// replacement and non-printable chars
void TopicStringPrint(char *String, long Length) {
   BYTE            Byte1, Byte2;
   int             CurChar, PhraseNum;
   long            counter;
   for (counter = 0; counter < Length; counter++) {
      CurChar = * ((char *) (String + counter));
      /* Check for Phrase replacement! */
      if ((CurChar > 0) && (CurChar < 10)) {
         Byte1 = (BYTE) CurChar;
         counter++;
         CurChar = * ((char *) (String + counter));
         Byte2 = (BYTE) CurChar;
         PhraseNum = (256 * (Byte1 - 1) + Byte2);
         /* If there's a remainder, we have a space after the phrase */
         PrintPhrase(PhrasesPtr, PhraseNum / 2);
         if (PhraseNum % 2) putchar(' ');
      }
      else if (isprint(CurChar)) putchar(CurChar);
      else putchar(' ');    // could do newline for 0x00 0x00
   }

}
// Dump |TOPIC file, doing decompression and phrase substitution
void TopicDump(FILE *HelpFile, long FileStart) {
   FILEHEADER      FileHdr;
   TOPICHEADER     *TopicHdr;
   TOPICLINK       TopicLink;
   /* Go to the TOPIC file and get the headers */
   fseek(HelpFile, FileStart, SEEK_SET);
   fread(&FileHdr, sizeof(FileHdr), 1, HelpFile);
   do {
      TopicRead((BYTE *) &TopicLink, sizeof(TopicLink) - 4, HelpFile);

      if (Compressed)
         TopicLink.DataLen2 = TopicLink.BlockSize - TopicLink.DataLen1;
      TopicLink.LinkData1=(BYTE *) malloc((unsigned)(TopicLink.DataLen1-21));
      if(!TopicLink.LinkData1)
          MSG("Error allocating TopicLink.LinkData1");
      TopicRead(TopicLink.LinkData1, TopicLink.DataLen1 - 21, HelpFile);
      if (TopicLink.DataLen2 > 0) {
          TopicLink.LinkData2=(BYTE*)malloc((unsigned)(TopicLink.DataLen2+1));
          if(!TopicLink.LinkData2)
             MSG("Error allocating TopicLink.LinkData2");
          TopicRead(TopicLink.LinkData2, TopicLink.DataLen2, HelpFile);
      }
      /* Display a Topic Header record */
      if (TopicLink.RecordType == TL_TOPICHDR) {
         TopicHdr = (TOPICHEADER *)TopicLink.LinkData1;
         printf("================ Topic Block Data ====================\n");
         printf("Topic#: %ld - ", TopicHdr->TopicNum);
         if (TopicLink.DataLen2 > 0)
            TopicStringPrint(TopicLink.LinkData2, (long) TopicLink.DataLen2);
         else printf("\n");
      }
      /* Show a 'text' type record. */
      else if (TopicLink.RecordType == TL_DISPLAY) {
         printf("-- Topic Link Data\n");
         TopicStringPrint(TopicLink.LinkData2, (long) TopicLink.DataLen2);
      }
      printf("\n\n");
      free(TopicLink.LinkData1);
      if (TopicLink.DataLen2 > 0) free(TopicLink.LinkData2);
   } while(TopicLink.NextBlock != -1);
}
void DumpFile(FILE *HelpFile) {
    long    FileOffset, PhraseOffset, TopicOffset;
    DWORD   i;
    char    FileName[32];

    fread(&HelpHeader, sizeof(HelpHeader), 1, HelpFile);
    if (HelpHeader.MagicNumber != 0x35F3FL)
        MSG("Fatal Error:  Not a valid WinHelp file");
    WHIFSGetFirstLeaf(HelpFile);
    TopicOffset = PhraseOffset = 0;
    for (i=0; i<WHIFSHeader.TotalWHIFSEntries; i++) {
       GetFile(HelpFile, i, &FileOffset, FileName);
       if (! strcmp(FileName, "|SYSTEM")) SysLoad(HelpFile, FileOffset);
       else if (! strcmp(FileName, "|Phrases")) PhraseOffset = FileOffset;
       else if (! strcmp(FileName, "|TOPIC")) TopicOffset = FileOffset;
       }
       if (PhraseOffset) PhrasesLoad(HelpFile, PhraseOffset);
       if (TopicOffset) TopicDump(HelpFile, TopicOffset);
       else MSG("No Topic file found!");
}
int main(int argc, char *argv[]) {
    char filename[40];
    FILE *HelpFile;
    if (argc < 2) {
       printf("Usage: TOPICDMP helpfile[.hlp]\n\n");
       printf("   helpfile      - Name of help file (.HLP or .MVB)\n\n");
       return EXIT_FAILURE;
       }
    if (! strchr(strcpy(filename, strupr(argv[1])), '.'))
       strcat(filename, ".HLP");
    if ((HelpFile = fopen(filename, "rb")) == NULL) {
       printf("Can't open %s!", filename);
       return EXIT_FAILURE;
    }
    DumpFile(HelpFile);
    fclose(HelpFile);
    return EXIT_SUCCESS;
}

.NET

Undocumented Corner

UNDOCUMENTED CORNER

Documenting Documentation: The Windows .HLP File Format, Part II

Pete Davis

by Andrew Schulman

|TTLBTREE

|TOPIC

Topic Offsets

|TOPIC Structures

WinHelp's Data Compression

How Compression is Applied

TOPICDMP

That's It?

Figure 1: WinHelp Data Compression

Figure 2: Calling Tree for TOPICDMP.C

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

.NET Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

.NET

Undocumented Corner

UNDOCUMENTED CORNER

Pete Davis

|TTLBTREE

|TOPIC

|TOPIC Structures

Figure 1: WinHelp Data Compression

Figure 2: Calling Tree for TOPICDMP.C

Related Reading

More Insights

White Papers

Reports

Webcasts

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

.NET Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content