DataBlade Technology for Web Development

DataBlade modules extend the general-purpose capabilities of the Informix server. To illustrate how you can use the Informix Universal Server and DataBlades, Chris develops a document search engine that can be used on any Internet/intranet where documents are shared.

June 01, 1997
URL:http://www.drdobbs.com/web-development/datablade-technology-for-web-development/184410386

Dr. Dobb's Sourcebook May/June 1997: DataBlade Technology for Web Development

Chris is a systems engineer for Informix's TelCo & Media group, and can be contacted at [email protected].

Sidebar: D-Tree Indexing

Creating a web site is easy: You take a message you want to broadcast, add images and HTML tags, publish on the Web, and let everything gently simmer. This is the model on which most web sites are built. For really powerful web sites, however, you need some form of connectivity between the web client and a database. One approach is to write CGI programs that invoke a query on the database, then write HTML code to format the results for display by the client. The disadvantage of this approach is that you usually must become familiar with yet another programming language, not to mention that programs tend to proliferate as the size and complexity of the site increases. Moreover, the nature of the site becomes piecemeal in that HTML pages are scattered around the disk, coupled with CGI programs and a back-end database server.

In this article, I'll present an alternative approach based on the Informix Universal Server and its DataBlade technology. Figure 1 shows how the components fit together into what is called the "Universal Web Architecture." The software integrates with your existing web-server implementation by providing a single CGI program -- WebDriver -- that manages requests from clients for HTML pages (a native Netscape NSAPI version is also available). The actual pages and other site data (such as images, Java applets, sound, and the like) are stored in the database. This puts an end to the question, "How do I manage my site?" -- storage management is now undertaken by the database. An added bonus is that the backup/recovery mechanism of the site is also provided by the database.

DataBlade modules extend the general-purpose capabilities of the Informix server. Individual off-the-shelf DataBlade modules that provide data storage and management functionality for images, text, visual information retrieval, spatial data, web development, and the like are available. You can also write DataBlade modules from scratch. (For more information on DataBlade modules, see http:// www.informix.com/.)

For instance, assume that the client in Figure 1 has requested the test page, indicated by the value of the MIval variable. WebDriver establishes and manages connections with the database and calls on the Web DataBlade to process the request. This means retrieving the page from the database and parsing it for MIxxxx tags, which provide the interface with the database via embedded SQL and provide mechanisms for handling variables and errors (see Table 1).

The results of the queries are returned to the Web DataBlade, which uses the HTML tags surrounding the column identifiers ($1 and $2 in Figure 1) to format the data. In essence, each HTML page has been transformed from a static document into a dynamic data template where the data is supplied by the database. This is much like painting GUIs in regular client applications, and the process can lead to a reduction in the number of pages needed to support your web site.

Since the interface to the database is via SQL, you can call from HTML pages any Informix Universal Server DataBlade module to extend the functionality of your web site.

The Search Engine Example

To illustrate how you can use the Universal Server and DataBlades, I've developed a document search engine (see Listings One and Two) that can be used on any Internet/intranet where documents are shared.

To build the site, I used the Informix Text DataBlade, which provides support for document management and free text searches. The Text DataBlade implements a single object of type doc, 27 functions (including Contains, which answers the question, "Does the document contain the list of specified keywords?"; and Occurrences, which addresses the question, "How many times does the list of keywords appear in the document?"), and the D-tree access method. (For more information, see the accompanying text box entitled "D-Tree Indexing.") When a document is inserted into a table as an object of type doc, the file is parsed into individual words. Syntactically a word is a letter followed by zero or more letters, digits, or underscores. Punctuation is discarded. The DataBlade module then looks for is the stem of each word in the wordnet tables that are provided as part of the DataBlade module. A stem word can be thought of as the root of the word. For example, plural nouns are usually stemmed to their singular versions. For verbs, suffixes connoting tense are discarded. The Text DataBlade is case insensitive.

The stemword list is then compared with the stopwords list, also provided as part of the Text DataBlade installation. Stopwords represent white-noise words, such as "is" and "it;" any matches are removed. The remaining words are then stored in the object that represents the document in the database. These words form the basis of the document's D-tree index, which is used to support fast free-text queries.

The Informix Text DataBlade supports the document formats ascii, dvipostscript, nroff, postscript, qwertzgml, tex, and troff.

When inserting documents into the database, you must specify the format of the document so that the appropriate preprocessing can take place. Third-party DataBlades from Verity and PLS extend the formats supported and provide additional search facilities such as thesaurus and fuzzy matches.

The critical part of the search engine is the SQL string used to extract documents from the database:

select GetDocText(doc), title, Occurrences(doc, $keywords) where Contains(doc, $keywords) order by Occurrences(doc, $keywords)

Each of the function calls is implemented by the Text DataBlade. $keywords contains the search text entered in the text field on the input form. The result set is ordered by the number of matches on the search words entered -- a simple form of document ranking.

Example 1 shows what the query and formatting look like when included in the HTML template. Each column in the result set is identified by its position (starting from 1). Example 1 returns three columns in the result set. For the purposes of display, I use only columns two and three (indicated by $2 and $3). Any HTML formatting can be applied to the results set.

When the <?MIVAR> tag in the input form (see Example 2) is encountered, the Web DataBlade replaces the tags with the value of the variable specified. In this case, $WEB_HOME is replaced with the URL to WebDriver (on my machine, that's http://holly/cgi-bin/webdriver). The text input field is named "keywords." This name is supplied as an environment variable in the page that is called when the form is submitted. Hidden input fields specify which page in the database executes when the form is submitted. This is illustrated in Example 2 when MIval is set to search: When the page is submitted, WebDriver loads the search page from the database and sets the environment variables according to their values from the input form.

Consider what happens if you choose not to enter any keywords and simply submit the form. Since keywords will be empty, the search returns an empty table. It would be better to display a message informing users why the search returned nothing. The Web DataBlade lets you do this by supporting logical blocks within an HTML page. Each block is delimited by the <?MIBLOCK>...<?/MIBLOCK> tags. Each block can be associated with a condition, as in Example 3. This condition states that the variable keywords must exist and not be empty. The block can be embedded in the same page as the one used to process the search request. In effect, you can specify a set of preconditions that must be satisfied in order for the page to execute successfully. There is no restriction on the number of blocks a page can contain, and blocks can be nested.

What about errors from the database? The Web DataBlade provides a convenient means of trapping errors generated by database interactions. The error handler in Example 4 inserts the error number and current timestamp into the web_errors table whenever an error is returned. The scope of the handler extends from the point in the HTML page where it is defined to the end of the page. Error handlers can be defined by using the ERR attribute of MIERROR. This is illustrated in Example 5, which specifies that the NO_KEYWORDS error handler is to be executed if the highlight line fails. As for MIBLOCK, tags there is no restriction on the number of error handlers a page can include.

DDJ

Listing One

<HTML>
<HEAD>
<TITLE>Document Search Engine

</TITLE>
</HEAD>
<BODY>

<P>
<B>Welcome</B> to the document search engine. This engine will allow you to search for documents using any combination of keywords. Matched documents can be viewed on-line.
</P>

<P>This WEB site was built using the following DataBlades:</P>

<UL>
<LI>Informix Web DataBlade 2.1 </LI>

<LI>Informix Text DataBlade 1.3 </LI>
</UL>

<?MISQL SQL="select img from web_pics where ID = 'marble_hbar';">
<P>
<IMG SRC="$WEB_HOME?LO=$1&type=image/gif" HEIGHT=14 WIDTH=763>
</P>
<?/MISQL>

<P>
<FORM ACTION=<?MIVAR>$WEB_HOME<?/MIVAR> METHOD="Post">
Enter words to search on (space delimited): 
<INPUT NAME="keywords" TYPE="text">
<INPUT NAME="search" TYPE="submit" VALUE="Search">
<INPUT TYPE="hidden" NAME="MIval" VALUE="doc_search_v3">
</FORM>
</P>

<?MISQL SQL="select img from web_pics where ID = 'marble_hbar';">
<P>
<IMG SRC="$WEB_HOME?LO=$1&type=image/gif" HEIGHT=14 WIDTH=763>
</P>
<?/MISQL>

<CENTER>
<P><FONT SIZE=-2>This page <I>needs</I> your help to be successful. If you have a document that you think would benefit from being indexed here then simply drop me a line: <A HREF="mailto:[email protected]">Chris Trueman</A> </FONT></P></CENTER>

</BODY>
</HTML>

Back to Article

Listing Two

<HTML>

<?MIBLOCK COND=($keywords.nxst.)>
<H1>
Error
</H1>
<H2>
You must specify a word or words to be matched against the documents.
</H2>
<?/MIBLOCK>

<?MIBLOCK COND=($keywords.xst.)>
<H2>
You requested a document search on: <?MIVAR>$keywords<?/MIVAR>
</H2>

<P align=center>
<table width=100% border=1>
<tr>
<th>Document</th>
<th>Frequency (%)</th>
<th>Occurrences</th>
</tr>

<?MISQL SQL="select raw_doc, name, Round(100 * WeightContains('documents_index', ctid, '$keywords'))::integer, Occurrences(doc, '$keywords') from documents where Contains(doc, '$keywords');">
<tr>
<td><A HREF="$WEB_HOME?LO=$1&type=text/html">$2</A></td>
<td>$3%</td>
<td>$4</td>
</tr>
<?/MISQL>

</table>
</P>

<?/MIBLOCK>
</html>

Back to Article

Dr. Dobb's Sourcebook May/June 1997: DataBlade Technology for Web Development

DataBlade Technology for Web Development

By Chris Trueman

Dr. Dobb's Sourcebook May/June 1997

<CENTER>
<TABLE WIDTH=75% BORDER=1 CELLSPACING=3>
<TR>
<TH>Document Title</TH>
<TH>Occurrences</TH>
</TR>
<?MISQL SQL=" select GetDocText(doc), title, Occurrences(doc, $keywords) where Contains(doc, $keywords);">
<TR>
<TD>$2</TD>
<TD>$2</TD>
</TR>
<?/MISQL>
</TABLE>
</CENTER>

Example 1: Query and formatting.

Dr. Dobb's Sourcebook May/June 1997: DataBlade Technology for Web Development

DataBlade Technology for Web Development

By Chris Trueman

Dr. Dobb's Sourcebook May/June 1997

<P>
<FORM ACTION=<?MIVAR>$WEB_HOME<?/MIVAR> METHOD="Post">
Enter words to search on (space delimited): 
<INPUT NAME="keywords" TYPE="text">
<INPUT NAME="search" TYPE="submit" VALUE="Search">
<!--- hidden form fields ---!>
<INPUT TYPE="hidden" NAME="MIval" VALUE="search">
</FORM>
</P>

Example 2: Using the <?MIVAR> tag.

Dr. Dobb's Sourcebook May/June 1997: DataBlade Technology for Web Development

DataBlade Technology for Web Development

By Chris Trueman

Dr. Dobb's Sourcebook May/June 1997

<?MIBLOCK COND=($keywords.nxst.)>
<H1>
Error
</H1>
<H2>
You must specify a word or words to be matched against the documents.
</H2>
<?/MIBLOCK>

Example 3: Associating a block with a condition.

Dr. Dobb's Sourcebook May/June 1997: DataBlade Technology for Web Development

DataBlade Technology for Web Development

By Chris Trueman

Dr. Dobb's Sourcebook May/June 1997

<?MIERROR TAG=MISQL SQL="insert into web_errors values ('$MI_ERRORCODE', current_timestamp);"<?/MISQL>

Example 4: The error handler inserts the error number and current timestamp.

Dr. Dobb's Sourcebook May/June 1997: DataBlade Technology for Web Development

DataBlade Technology for Web Development

By Chris Trueman

Dr. Dobb's Sourcebook May/June 1997

<?MIERROR ERR=NO_KEYWORDS TAG=MIVAR><B>KEYWORDS is not defined</B><?/MIERROR>
<?MIVAR ERR=NO_KEYWORDS>$keywords<?/MIVAR>

Example 5: Using the ERR attribute.

Dr. Dobb's Sourcebook May/June 1997: DataBlade Technology for Web Development

DataBlade Technology for Web Development

By Chris Trueman

Dr. Dobb's Sourcebook May/June 1997

Figure 1: Universal Web Architecture.

Dr. Dobb's Sourcebook May/June 1997: DataBlade Technology for Web Development

DataBlade Technology for Web Development

By Chris Trueman

Dr. Dobb's Sourcebook May/June 1997

Figure 2: Typical index.

Back to Sidebar

Dr. Dobb's Sourcebook May/June 1997: DataBlade Technology for Web Development

DataBlade Technology for Web Development

By Chris Trueman

Dr. Dobb's Sourcebook May/June 1997

Figure 3: References from the index.

Back to Sidebar

Dr. Dobb's Sourcebook May/June 1997: DataBlade Technology for Web Development

Dr. Dobb's Sourcebook May/June 1997

D-tree Indexing

In a D-tree index, each entry points to multiple tuples in the table. For example, if an index is created like Figure 2(a), and the table is populated as in Figure 2(b), then the D-tree will contain the word list in Figure 2(c) with the exception that case doesn't count.

The references from the index to the table tuples would, assuming the tuples are numbered 1 through 3, look like Figure 3. This list is represented as an ordered list, but in reality, it would be implemented as a B-tree.

A multiword input search string requires an index scan for each word. Since this is a B-tree, we have a worst case performance of m.log(n), where m is the number of words and n is the total number of words indexed. The value of n is greater than the tuple count but less than the number of words in all the documents. In the above example the tuple count is 3, there are 16 words in total and the index containing just 10 entries.

Thanks to Paul Brown for submitting the original tech note on D-trees to the Informix Knowledgebase.

-- C.T.