Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

Channels ▼

Open Source

Indexing and Searching on a Hadoop Distributed File System

Kashyap Santoki works for Infosys Technologies Limited and can be contacted at [email protected]

In today's information-saturated world, the huge growth of geographically distributed data necessitates a system that facilitates fast parsing for the retrieval of meaningful results. A searchable index for distributed data would go a long way toward speeding the process. In this article, I demonstrate how to use Lucene and Java for basic data indexing and searching, how to use a RAM directory for indexing and searching, how to create an index on the data residing in HDF, and how to search those indexes. The development environment consists of Java 1.6, Eclipse 3.4.2, Lucene 2.4.0, and Hadoop 0.19.1 running on Microsoft Windows XP SP3.

For tackling this task, I've turned to Hadoop. The Apache Hadoop Project develops open-source software for reliable, scalable, distributed computing, and the Hadoop Distributed File System (HDFS) is designed for storing and sharing files across wide area networks. HDFS is built to run on commodity hardware and provides fault tolerance, resource management, and most importantly, high throughput access to application data.

Creating an Index on a Local File System

The first step is to create an index on the data stored in a local file system. Start by creating a project in Eclipse, creating a class in it, then adding all the required JAR files to the project. Take this example of data found in the web server log file of an application:

2010-04-21 02:24:01 GET /blank 200 120

This data is mapped to some fields:

  • 2010-04-21 -- Date field
  • 02:24:01 -- Time field
  • GET -- Method field (GET or POST) -- we will denote it as "cs-method"
  • /blank -- Requested URL field -- we will denote it as "cs-uri"
  • 200 -- Status-code for the request -- we will denote it as "sc-status"
  • 120 -- Time-taken field (time required to complete request)

The data present in our sample file is located in an "E:\DataFile" named "Test.txt" and is as follows:

2010-04-21 02:24:01 GET /blank 200 120
2010-04-21 02:24:01 GET /US/registrationFrame 200 605
2010-04-21 02:24:02 GET /US/kids/boys 200 785
2010-04-21 02:24:02 POST /blank 304 56
2010-04-21 02:24:04 GET /blank 304 233
2010-04-21 02:24:04 GET /blank 500 567
2010-04-21 02:24:04 GET /blank 200 897
2010-04-21 02:24:04 POST /blank 200 567
2010-04-21 02:24:05 GET /US/search 200 658
2010-04-21 02:24:05 POST /US/shop 200 768
2010-04-21 02:24:05 GET /blank 200 347

We want to create index for the data present in this "Test.txt" file and save the index to the local file system. The following Java code that does this. (Note the comments for details on what each part of code does).

// Creating IndexWriter object and specifying the path where Indexed
//files are to be stored.
IndexWriter indexWriter = new IndexWriter("E://DataFile/IndexFiles", new StandardAnalyzer(), true);
// Creating BufferReader object and specifying the path of the file
//whose data is required to be indexed.
BufferedReader reader= new BufferedReader(new FileReader("E://DataFile/Test.txt"));
String row=null;
// Reading each line present in the file.
while ((row=reader.readLine())!= null)
// Getting each field present in a row into an Array and file delimiter is "space separated"
String Arow[] = row.split(" ");
// For each row, creating a document and adding data to the document with the associated fields.
org.apache.lucene.document.Document document = new org.apache.lucene.document.Document();
document.add(new Field("date",Arow[0],Field.Store.YES,Field.Index.ANALYZED));
document.add(new Field("time",Arow[1],Field.Store.YES,Field.Index.ANALYZED));
document.add(newField ("cs-method",Arow[2],Field.Store.YES,Field.Index.ANALYZED));
document.add(newField ("cs-uri",Arow[3],Field.Store.YES,Field.Index.ANALYZED));
document.add(newField ("sc-status",Arow[4],Field.Store.YES,Field.Index.ANALYZED));
document.add(newField ("time-taken",Arow[5],Field.Store.YES,Field.Index.ANALYZED));
// Adding document to the index file.

Once the Java code is executed, index files will be created and stored at the location "E://DataFile/IndexFiles."

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.