Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Channels ▼
RSS

Web Development

DataBlade Technology for Web Development


Dr. Dobb's Sourcebook May/June 1997: DataBlade Technology for Web Development

Chris is a systems engineer for Informix's TelCo & Media group, and can be contacted at [email protected].


Sidebar: D-Tree Indexing

Creating a web site is easy: You take a message you want to broadcast, add images and HTML tags, publish on the Web, and let everything gently simmer. This is the model on which most web sites are built. For really powerful web sites, however, you need some form of connectivity between the web client and a database. One approach is to write CGI programs that invoke a query on the database, then write HTML code to format the results for display by the client. The disadvantage of this approach is that you usually must become familiar with yet another programming language, not to mention that programs tend to proliferate as the size and complexity of the site increases. Moreover, the nature of the site becomes piecemeal in that HTML pages are scattered around the disk, coupled with CGI programs and a back-end database server.

In this article, I'll present an alternative approach based on the Informix Universal Server and its DataBlade technology. Figure 1 shows how the components fit together into what is called the "Universal Web Architecture." The software integrates with your existing web-server implementation by providing a single CGI program -- WebDriver -- that manages requests from clients for HTML pages (a native Netscape NSAPI version is also available). The actual pages and other site data (such as images, Java applets, sound, and the like) are stored in the database. This puts an end to the question, "How do I manage my site?" -- storage management is now undertaken by the database. An added bonus is that the backup/recovery mechanism of the site is also provided by the database.

DataBlade modules extend the general-purpose capabilities of the Informix server. Individual off-the-shelf DataBlade modules that provide data storage and management functionality for images, text, visual information retrieval, spatial data, web development, and the like are available. You can also write DataBlade modules from scratch. (For more information on DataBlade modules, see http:// www.informix.com/.)

For instance, assume that the client in Figure 1 has requested the test page, indicated by the value of the MIval variable. WebDriver establishes and manages connections with the database and calls on the Web DataBlade to process the request. This means retrieving the page from the database and parsing it for MIxxxx tags, which provide the interface with the database via embedded SQL and provide mechanisms for handling variables and errors (see Table 1).

The results of the queries are returned to the Web DataBlade, which uses the HTML tags surrounding the column identifiers ($1 and $2 in Figure 1) to format the data. In essence, each HTML page has been transformed from a static document into a dynamic data template where the data is supplied by the database. This is much like painting GUIs in regular client applications, and the process can lead to a reduction in the number of pages needed to support your web site.

Since the interface to the database is via SQL, you can call from HTML pages any Informix Universal Server DataBlade module to extend the functionality of your web site.

The Search Engine Example

To illustrate how you can use the Universal Server and DataBlades, I've developed a document search engine (see Listings One and Two) that can be used on any Internet/intranet where documents are shared.

To build the site, I used the Informix Text DataBlade, which provides support for document management and free text searches. The Text DataBlade implements a single object of type doc, 27 functions (including Contains, which answers the question, "Does the document contain the list of specified keywords?"; and Occurrences, which addresses the question, "How many times does the list of keywords appear in the document?"), and the D-tree access method. (For more information, see the accompanying text box entitled "D-Tree Indexing.") When a document is inserted into a table as an object of type doc, the file is parsed into individual words. Syntactically a word is a letter followed by zero or more letters, digits, or underscores. Punctuation is discarded. The DataBlade module then looks for is the stem of each word in the wordnet tables that are provided as part of the DataBlade module. A stem word can be thought of as the root of the word. For example, plural nouns are usually stemmed to their singular versions. For verbs, suffixes connoting tense are discarded. The Text DataBlade is case insensitive.

The stemword list is then compared with the stopwords list, also provided as part of the Text DataBlade installation. Stopwords represent white-noise words, such as "is" and "it;" any matches are removed. The remaining words are then stored in the object that represents the document in the database. These words form the basis of the document's D-tree index, which is used to support fast free-text queries.

The Informix Text DataBlade supports the document formats ascii, dvipostscript, nroff, postscript, qwertzgml, tex, and troff.

When inserting documents into the database, you must specify the format of the document so that the appropriate preprocessing can take place. Third-party DataBlades from Verity and PLS extend the formats supported and provide additional search facilities such as thesaurus and fuzzy matches.

The critical part of the search engine is the SQL string used to extract documents from the database:

select GetDocText(doc), title, Occurrences(doc, $keywords) where Contains(doc, $keywords) order by Occurrences(doc, $keywords)

Each of the function calls is implemented by the Text DataBlade. $keywords contains the search text entered in the text field on the input form. The result set is ordered by the number of matches on the search words entered -- a simple form of document ranking.

Example 1 shows what the query and formatting look like when included in the HTML template. Each column in the result set is identified by its position (starting from 1). Example 1 returns three columns in the result set. For the purposes of display, I use only columns two and three (indicated by $2 and $3). Any HTML formatting can be applied to the results set.

When the <?MIVAR> tag in the input form (see Example 2) is encountered, the Web DataBlade replaces the tags with the value of the variable specified. In this case, $WEB_HOME is replaced with the URL to WebDriver (on my machine, that's http://holly/cgi-bin/webdriver). The text input field is named "keywords." This name is supplied as an environment variable in the page that is called when the form is submitted. Hidden input fields specify which page in the database executes when the form is submitted. This is illustrated in Example 2 when MIval is set to search: When the page is submitted, WebDriver loads the search page from the database and sets the environment variables according to their values from the input form.

Consider what happens if you choose not to enter any keywords and simply submit the form. Since keywords will be empty, the search returns an empty table. It would be better to display a message informing users why the search returned nothing. The Web DataBlade lets you do this by supporting logical blocks within an HTML page. Each block is delimited by the <?MIBLOCK>...<?/MIBLOCK> tags. Each block can be associated with a condition, as in Example 3. This condition states that the variable keywords must exist and not be empty. The block can be embedded in the same page as the one used to process the search request. In effect, you can specify a set of preconditions that must be satisfied in order for the page to execute successfully. There is no restriction on the number of blocks a page can contain, and blocks can be nested.

What about errors from the database? The Web DataBlade provides a convenient means of trapping errors generated by database interactions. The error handler in Example 4 inserts the error number and current timestamp into the web_errors table whenever an error is returned. The scope of the handler extends from the point in the HTML page where it is defined to the end of the page. Error handlers can be defined by using the ERR attribute of MIERROR. This is illustrated in Example 5, which specifies that the NO_KEYWORDS error handler is to be executed if the highlight line fails. As for MIBLOCK, tags there is no restriction on the number of error handlers a page can include.

DDJ

Listing One

<HTML>
<HEAD>
<TITLE>Document Search Engine

</TITLE>
</HEAD>
<BODY>

<P>
<B>Welcome</B> to the document search engine. This engine will allow you to search for documents using any combination of keywords. Matched documents can be viewed on-line.
</P>

<P>This WEB site was built using the following DataBlades:</P>

<UL>
<LI>Informix Web DataBlade 2.1 </LI>

<LI>Informix Text DataBlade 1.3 </LI>
</UL>

<?MISQL SQL="select img from web_pics where ID = 'marble_hbar';">
<P>
<IMG SRC="$WEB_HOME?LO=$1&type=image/gif" HEIGHT=14 WIDTH=763>
</P>
<?/MISQL>

<P>
<FORM ACTION=<?MIVAR>$WEB_HOME<?/MIVAR> METHOD="Post">
Enter words to search on (space delimited): 
<INPUT NAME="keywords" TYPE="text">
<INPUT NAME="search" TYPE="submit" VALUE="Search">
<INPUT TYPE="hidden" NAME="MIval" VALUE="doc_search_v3">
</FORM>
</P>

<?MISQL SQL="select img from web_pics where ID = 'marble_hbar';">
<P>
<IMG SRC="$WEB_HOME?LO=$1&type=image/gif" HEIGHT=14 WIDTH=763>
</P>
<?/MISQL>

<CENTER>
<P><FONT SIZE=-2>This page <I>needs</I> your help to be successful. If you have a document that you think would benefit from being indexed here then simply drop me a line: <A HREF="mailto:[email protected]">Chris Trueman</A> </FONT></P></CENTER>

</BODY>
</HTML>

Back to Article

Listing Two

<HTML>

<?MIBLOCK COND=($keywords.nxst.)>
<H1>
Error
</H1>
<H2>
You must specify a word or words to be matched against the documents.
</H2>
<?/MIBLOCK>

<?MIBLOCK COND=($keywords.xst.)>
<H2>
You requested a document search on: <?MIVAR>$keywords<?/MIVAR>
</H2>

<P align=center>
<table width=100% border=1>
<tr>
<th>Document</th>
<th>Frequency (%)</th>
<th>Occurrences</th>
</tr>

<?MISQL SQL="select raw_doc, name, Round(100 * WeightContains('documents_index', ctid, '$keywords'))::integer, Occurrences(doc, '$keywords') from documents where Contains(doc, '$keywords');">
<tr>
<td><A HREF="$WEB_HOME?LO=$1&type=text/html">$2</A></td>
<td>$3%</td>
<td>$4</td>
</tr>
<?/MISQL>

</table>
</P>

<?/MIBLOCK>
</html>

Back to Article


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.