Enterprise Search: The Next Frontier

In an expanding universe of electronic media, new tools boldly delve into strange new worlds of unstructured documents. Information search and retrieval is as old as the computer itself, but today it's an imperative in an audit-happy, data-heavy climate.


December 01, 2004
URL:http://www.drdobbs.com/enterprise-search-the-next-frontier/184415231

Enterprise Search: The Next Frontier

Twenty years ago, relational databases entered the enterprise. Around the same time, corporations and government agencies, overflowing with years of accumulated data from the widespread use of computers, discovered search technology. Today, a more powerful class of information retrieval technology is making a similar foray into the corporate world.

"Enterprise search is emerging as a development platform for a wide range of applications, says Prabhakar Raghavan, chief technology officer for Verity, a search tool vendor in Sunnyvale, Calif. "The volume and value of unstructured data in a typical enterprise is doubling every 12 months. It's this data that, increasingly, contains the information of greatest value to the enterprise. And it's this amorphous content that companies are struggling to shape into a strategic advantage rather than a looming liability.

Today's search engines have surpassed mere keyword matching, embracing artificial intelligence in the form of natural language query, concept extraction, classification and taxonomy. Enterprise search engines differentiate themselves from Web search engines in the way they handle voluminous amounts of data from different sources and in varying formats. The enterprise products can search files regardless of format and sift through multiple repositories. They support many operating systems (some obsolete) and allow users to set up classification, taxonomies, personalization, profiling, agent alert, community/social networking and collaborative filtering, real-time analysis and so on. They also provide more sophisticated SDKs and manage enterprise-wide security concerns.

Nevertheless, you must remember you're dealing with software: No matter how elegant the algorithms, a computer program can never truly understand the concepts involved in a document. Therefore, to a certain degree, every search feature beyond keyword matching is guesswork.

The Search Supernova

In the last few years, the search engine market has consolidated, thanks to a number of mergers and acquisitions. Although hundreds of search engines remain on the market, only about a dozen are enterprise-grade. Furthermore, most search engine vendors have unbundled enterprise and Web search, with one exception: Google's search server appliance serves both the enterprise and Web-search markets.

Ironically, while some vendors, such as FAST and Convera, still call themselves "search engine vendors, others have tried to differentiate their position: For example, IBM's and Entopia's "knowledge management system, Verity's "intellectual capital management system, and Autonomy's "infrastructure software of information processing. Name aside, the search engine is the core technology and foundation of all of these solutions. Most companies are generally tuned to particular vertical industries and applications with their own taxonomy and classification features. Verity, for instance, offers an intellectual capital management system that's widely used in technology industry applications.

What to Look For in a Look-Up Tool?

Enterprise search is a wide spectrum, from basic search functions to a more advanced development platform for mission-critical applications, and it's a critical field: The average knowledge worker spends nearly a quarter of his day looking for information, according to research firms IDC and Delphi Group. Enterprises face a complicated trade-off: low-priced simplicity versus high-priced performance. They may use Microsoft SharePoint and Google's Search Appliance for simplicity, low cost and speed, in their portal sites and intranet; Verity's Ultraseek for departmental-level and small enterprise applications; and Autonomy's Enterprise Search Server and Verity's K2 Enterprise Search for sophisticated enterprise-wide applications.

All search engines boast basically the same features and functionality—the difference lies in how well these features are implemented and how relevant the results are. The challenge for every search vendor is finding truly relevant docs and improving the signal-to-noise ratio. Entopia claims that three things affect search results: semantic analysis, context extraction and user activity. Convera claims that its internal semantic network does the trick, while Verity's search results rest on its context extraction feature. Entopia's newest focus is user activity, an extension of Google's link analysis to determine a document's importance by references to it.

Autonomy claims to lead in collaboration and expertise networks. Verity says it pulls ahead in classification, taxonomy and concept extraction. Entopia boasts an edge in relevance, analyzing user activity and context around the content. Not surprisingly, given the company name, iPhrase argues that its strength is in natural language query. Google is mainly used in intranets. Convera asserts that it's the best choice for indexing and searching multimedia contents. And Microsoft's offering, true to form, is the least pricey.

What are the functional characteristics that you should expect from an enterprise-grade search engine?

Multiple repositories. Enterprise search must deal with multiple repositories from files on local disks on the desktop, e-mails and folders in groupware, and data in databases, to files in content management systems. Typical repositories in an enterprise IT infrastructure include:

Data formats. Computers come and go, but data persists—so long as it's still readable. Enterprise search must deal with most, if not all, of the nearly 300 widely used file formats in today's enterprise data warehouse, from the most popular Microsoft Office files (Word, Excel, PowerPoint), through Corel Suite, PDF, HTML, XML and Lotus 1-2-3, to obsolete Microsoft Works (DOS).

Filters (extraction of all the text, metadata and other information from a file for search purpose) and connectors (gateway interface to access to the diverse content repositories across enterprise). These come in three forms: owned by the vendor, licensed from other vendors, and interfaces provided by the vendor for third-party or customer-made modules. Each has its pros and cons. For commonly used file formats and repositories, vendor-owned filters and connectors offer better control and tighter integration. The possible downside? They might not be able to keep these updated as quickly as products from a vendor who's focused on the filters and connectors. The interface for third-party/customer-made modules is mainly for customer special needs; for example, some obsolete file formats or a customer's own file formats.

Cross-platform support. A typical enterprise IT infrastructure is compatible with a variety of platforms, from Windows NT/2000/XP, to Linux and all sorts of Unix (Solaris, HP-UX, AIX), to Mac OS—as well as legacy platforms such as DEC Unix. Start asking the tough questions: What should you be concerned with? What platforms does the search engine support? On what platforms will the search solutions be deployed? And are they all supported?

Scalability. As the number of users grows and the amount of information accessed increases, the search engine must keep pace by adding more servers or processors. This scalability can be achieved with a multitier parallel computing architecture that takes advantage of advanced symmetric multiprocessing (SMP), the multithreading and concurrency features of SMP hardware and software platforms. Factors that provide scalable applications include the capacity to accommodate a large number of documents per collection up to a terabyte, the number of collections per server, load balancing and distributed search on multiple servers, and geo-efficiency (a single and seamless service to create and utilize local replicas in a geographically distributed environment to improve local performance and reduce resource overhead).

Performance. In multiple-server environments, some servers may be overloaded while others are underutilized, leading to inconsistent response times. Concurrent load-balancing brokers distribute the load on multiple servers and disk paths.

Fault tolerance. Enterprise mission-critical applications may require 24x7 fault-tolerant operations, including duplication of collections across nodes or servers, parallel indexing, concurrent indexing, query failure isolation and automatic server on/off adjustments in real-time.

Linguistics. Third-party linguistics companies have been actively licensing such features as thesauri, stemming (reducing a word to its root form, then matching all forms of the word in a search query to all forms of the same word in context), tokenization (breaking text into words and phrases), word and phrase analysis, parts of speech analysis, noun phrase extraction, language identification, spell-check and so on. Linguistic modules from InXight and Teragram are the two most widely used in enterprise search engines.

International Support. Business today is becoming increasingly international, requiring the ability to access information in various languages. Enterprises, big and small, offer information in languages other than English, making non-English content search a vital capability. Pertinent features include major language support, Unicode support, double-byte support, cross-lingual search and interfaces for user-added languages.

Metadata search. Metadata is information about information: more precisely, it's structured information about resources. More and more files, in formats such as PDF, Microsoft Office, HTML and XML, contain metadata. Because metadata in general is more structured, it helps locate useful information with precision. Search engines can index metadata content and use it to improve search rankings.

Crawling. Crawlers locate and update indexes periodically, and come in two types: the Web crawler, which locates and updates files to index by following links; and the file indexer, which locates and updates files to index by following the hard drive's directory structure. Crawler issues include performance, real-time capability, selective relevance and configurability.

Security. In an enterprise IT infrastructure, users should access only authorized information. Therefore, enterprise search must offer security management for document access control and encryption of the communication protocol, seamlessly integrating with the repositories' existing access controls. Features to watch for include authentication, single sign-on, integration with repository security mechanisms and encrypted communication protocol.

Software Development Kit. SDKs let users build applications that incorporate search functionality. Consisting of a series of application programming interfaces that are typically modularized into functional suites, SDKs allow software developers to add search functionality without re-engineering their applications. SDKs should be compatible with standard Web development languages, such as Java, COM and C/C++. In addition, they should support standard integration technologies, such as Active Server pages (ASP) and the Java platform.

The Bottom Line

The starting price for an enterprise search engine ranges from a few thousand dollars to more than $100,000. For example, Microsoft SharePoint Portal Server's base price is $3,999 plus $71 per user, Google's GB-1001 costs $32,000 (a two-year license with hardware, software and technical support all included), and Verity's typical deal size for K2 Enterprise Search is in the $150,000 range, while its Ultraseek starts at $6,000. Enterprise solution deployments rise to the multimillion zone, depending on pricing scheme (number of servers, number of users, size of collections, number of collections, number of documents and so on) and optional modules. Solutions from Microsoft and Google are considered off-the-shelf and low-cost. Regardless of price, a typical deal includes software licenses, maintenance and professional services.

The accompanying matrix of enterprise-grade search engines is by no means comprehensive, but does represent a spectrum from departmental and intranet portal site level to fully-fledged enterprise applications.

Enterprise Search Engines
The Tool Features and Claims File Formats, Repositories, Database System Requirements/Cost
Autonomy Enterprise Search Server “Finds Similar”: Conceptually related, using Bayesian probabilistic pattern-recognition. Index-structured data for field searching. Provides real-time analysis, personalization and search agents, user profiling and automatic alerts. Supports 60 languages, authentication, encrypted communication protocols. Extracts from document content; optional “Navigator” for hierarchical category structure. 200+ file formats. Import module for file formats. Oracle, ODBC, Documentum, FileNet, Lotus Notes, Microsoft Exchange and so on. Supports audio and video files. Windows NT/2000/XP; HP-UX, Linux, Solaris, other POSIX-compliant Unixes. Contact vendor for pricing.
Convera Excalibur RetrievalWare Semantic network query expansion, fuzzy searching, cross-lingual searches based on internal semantic network for English, French, German, Spanish, Italian and Dutch, profile-based alert. Supports 13 languages and authorization with additional security and authentication interfaces for third-party proxies and cross-repository authentication. Comes with categorization tools with multiple-level taxonomies. 200+ document types, including Lotus/Domino, FileNet Panagon, Microsoft Exchange and Documentum. Native bridges to Oracle, Sybase, Informix and MS SQL databases and ODBC. Optional video/image search modules for video and image. Audio search available through consulting. Windows NT/2000, Solaris, HP-UX, Compaq Tru64 Unix (formerly Digital Unix), Linux, IBM AIX, NetBSD and Mac OS X. Contact vendor for pricing.
Entopia K-Bus Uses dynamic activity and context around content to capture the real-time value of information. Semantic and full-text search on document, people and sources. Localized versions for 4 Western European languages, authentication with dynamic summarization. Implements expertise identification and social network. 256 file formats using the Stellent processor. Documentum, Microsoft SharePoint, Open-Text, file shares, local hard disks, websites, Microsoft Exchange, Lotus Notes, IMAP. Windows and Unix. Pricing: Starts at $50,000, based on the number of servers and number of seat clients.
FAST Data Search Supports synonym lists, manual recommendation and business rules. Search reporting and log analysis. Near-real-time index updating for breaking news and auction sites. 77 languages supported. User/password, cookie, SSL protection with automatic categorization, built-in taxonomy. 225+ file formats and connectors for Microsoft Exchange, Lotus Notes, Oracle, DB2, MySQL, Documentum and Vignette. AIX 5.1/5.2, HP-UX 11, Solaris 8, Windows 2000/2003, Red Hat Enterprise Linux and Advanced Server 2.1. Contact vendor for pricing.
Google Search Appliance Simplicity, low cost and speed. 28 languages supported, form-based authentication, HTTP authentication or NTLM, dynamic page summarization. 220+ file formats; Lotus Notes. Linux. Pricing: Starts at $32,000 (a two-year license with hardware, software and technical support included).
IBM DB2 Information Integrator Uses OmniFind for search relevancy; can infer usage and adjust ranking dynamically based on intent. Scalable to over 20 million documents and thousands of users. Supports dictionary-based search for 20 languages, plus basic support for 50 others. Easily administered by part-time staff. 225 file formats with Stellent Outside In Content Access. HTTP/HTTPS, news groups (NNTP), file systems, Domino databases, Microsoft Exchange public folders, DB2 Content Manager, DB2 UDB, Informix, Oracle, Documentum and FileNet (via third-party software). AIX 5.2, Red Hat Linux 3, Suse Linux 8 and Windows 2000. Pricing: Starts at $5,000 per processor and $15,000 per data source connector.
iPhrase OneStep Enterprise Search Question-answer search, natural language query. Indexes unstructured data such as Web pages. 200+ file formats. Oracle, SQL, Interwoven. Windows 2000/2003, Solaris 8/9, Red Hat Linux and AIX. Contact vendor for pricing.
Microsoft SharePoint Portal Server An embedded search engine. Subscription for alerts for new information. File servers, websites, Lotus Notes servers, e-mail archives from Microsoft Exchange public folders, and interface for modules for other data types. Windows 2000. Server Pricing: $3,999 + $71 per user.
Verity Ultraseek A downloadable enterprise search engine targeted for the SME and departmental applications, easy query, rapid deployment, low ongoing administration and overhead, “set-and-forget.” 16 major languages, security interfaces on both collection and document level with conceptual summarization. Supports basic and NT challenge/response authentication. 295 file formats. Microsoft Exchange, Lotus Notes, Documentum, FileNet. Windows NT, 2000; Sun Solaris 2.5 and above, Linux, HP-UX 11.0. Pricing: Starts at $6,000, based on the number of documents.
Verity K2 Enterprise Search Sophisticated taxonomy and social networking. Queries use Boolean search with options for stemming, fuzzy search and concept extraction. 26 major languages and Unicode, auto language detection. Security interfaces on both collection and document level with dynamic file summarization. Classification: adaptive ranking, expert location, community. 295 file formats. Microsoft Exchange, Lotus Notes, Documentum, FileNet. Windows NT 4, 2000; Solaris, HP-UX, AIX, IRIX, DEC Unix, Linux. Pricing: Typical deal size at $150,000, based on number of CPUs and number of users.

Merger Mania

A brief history of search market consolidation and technical differentiation.

December 2000:
Excalibur and Intel's Interactive Media Services division merge to form a new company, Convera.

November 2002:
Verity acquires Inktomi's enterprise search engine Ultraseek.

December 2002:
Yahoo! acquires Inktomi's Web portal search engine.

February 2003:
Overture buys AltaVista Corporation, which provided pay-for-performance search services on public websites and AltaVista Enterprise Search products.

February 2003:
FAST (Fast Search and Transfer) sells its Web portal AlltheWeb to Overture.

June 2003:
Overture sells the AltaVista Enterprise Search business to FAST.

July 2003:
Yahoo! acquires Overture.


Roland Wang is a consultant with more than 15 years' experience working in the IT industry in Silicon Valley, with a focus on Asia Pacific IT infrastructure project management, professional services, technical management, software development, and international sales and business development. Reach him at [email protected].

Terms of Service | Privacy Statement | Copyright © 2024 UBM Tech, All rights reserved.