Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Channels ▼
RSS

JVM Languages

Indexing and Searching Image files


Adelene Ng is a senior staff software engineer with Motorola. She holds a Ph.D. in Computer Science from the University of London. Adelene can be contacted at [email protected].


Lucene is a high-performance, full-featured text retrieval library. Originally written in Java, it has since been ported to C++, C#, Perl, and Python. In this article, I show how Lucene.NET can be used to index and search image files captured by digital cameras. What makes this possible is that digital photos embed the camera settings and scene information as metadata. The specification for this metadata is the Exchangeable Image File Format (www.exif.org). Examples of stored information include shutter speed, exposure settings, date and time, focal length, metering mode, and whether the flash was fired. Here I show how the EXIF information can be extracted from the images through some user-specified criteria. These user-specified search criteria are then used as an index to search your image library. To keep the example simple, I limited the EXIF search criteria to date range and user comments fields. All images that satisfy the search criteria are displayed as thumbnails.

ImageSearcher

The ImageSearcher utility I present here was developed in C# using Visual Studio 2008. (The complete source code for ImageSearcher is available online at www.ddj.com/code/.) It also makes use of a number of open-source libraries and components such as Lucene.NET 2.0 (incubator.apache.org/lucene.net), NLog 1.0 (www.nlog-project.org), and ExifExtractor (www.codeproject.com/KB/graphics/exifextractor.aspx).

Although a number of open-source search libraries such as DotLucene.net (www.dotlucene.net) and NLucene (sourceforge.net/project/showfiles.php?group_id=57306) are available, the search library I selected was Lucene.NET because the DotLucene.NET project has been closed since May 2007 and NLucene has not been updated since 2002.

In addition to Lucene, I use the NLog logging library, which has a programming interface similar to Apache log4net (logging.apache.org/log4net). I use NLog to write diagnostic messages to a file for logging and tracing purposes. To extract EXIF information, I use the ExifExtractor library. Although .NET already has utilities to extract EXIF information, it returns raw data and tags. More processing would be required for this to be used in this application. For example, if I wanted to extract shutter speed information, I would need to know the tag number, extract the tag, and then convert the data from ASCII to a number. ExifExtractor abstracts all these steps. To display the thumbnail images that have been returned by the search engine, I make use of Marc Lievin's Image Thumbnail Viewer (www.codeproject.com/KB/cs/thumbnaildotnet2.aspx).

As Figure 1 illustrates, the ImageSearcher application is made up of six main classes:

  • ImageSearcherForm
  • ImageDialog
  • ImageViewer
  • ThumbnailController
  • ThumbnailImageEventHandler
  • BuildIndex

ImageSearcherForm is the main point of entry into the application. It lets users enter the directory where the images are stored, where the index directory is to be created, and what search parameters (start and end dates, user comments, and so on) to use. The remaining classes control the display of the thumbnail images in the status window. This portion makes extensive reuse of code from Lieven's Image Thumbnail Viewer.

[Click image to view at full size]

Figure 1: ImageSearcher classes and their relationships.

The BuildIndex class is where the index creation and searching takes place. To use Lucene, I first create an index by instantiating an IndexWriter(). An IndexWriter is created using the following constructor:


IndexWriter idxWriter = new IndexWriter(indexDir, 
    new StandardAnalyzer(), true);


where indexDir represents the path to the index directory. Text is analyzed with the StandardAnalyzer; the last argument is a Boolean variable that, if set to True, creates the index or overwrites the existing one. If set to False, it appends to the existing index.

Analyzers in Lucene can be used to tokenize text, extract relevant words, remove common words, stem the words (that is, reduce them to the root form; for example, "edits," "editor," and "editing" are condensed to "edit"), and perform any other processing before storing it into the index. The common analyzers provided by Lucene are:

  • WhiteSpaceAnalyzer, which separates tokens based on whitespace.
  • SimpleAnalyzer, which tokenizes the string to a set of words and converts them to lowercase.
  • StandardAnalyzer, which tokenizes the string to a set of words identifying acronyms, e-mail addresses, host names, and so on, discarding the basic English stop words (a, an, the, to) and stemming the words.

A Lucene index is a sequence of files. All searching is done on this index. The raw EXIF metadata associated with the image files has to be read and extracted from my image files, and passed to Lucene where it can be indexed and searched. The IndexWriter object is created in the BuildIndex constructor, which takes in two arguments; the first is the directory containing your image files, the second is the directory in which the index files are created.

Next, the IndexDocs() method is called. This method has one argument, which is the name of the directory containing your image files. It runs through each file in the specified directory, checks that it is a JPEG file, and creates an Image object from the file, using the call Image.FromFile(filename) from the System.Drawing package:


Image img = Image.FromFile(file)

Next, the ExifExtractor library is used to extract EXIF information from the image files. To keep the application simple, I extract only "Date Time" and "User Comment" EXIF data. EXIF data is extracted as follows:

First, create the EXIFExtractor instance,


Goheer.EXIF.EXIFextractor er = new Goheer.EXIF.
   EXIFextractor(ref img, "\n");

Next, retrieve Date/Time EXIF data:


string s1 = (String)er["Date Time"];

Likewise, to extract the user comments EXIF information, we access the er object as follows:


string s2 = (String)er["User Comment"];

The aforementioned EXIF tags are extracted from each image file. For each image file processed, a Document() object is created. This is created using the Document constructor as follows:


Document d = new Document();

Documents are the primary retrievable items from a Lucene query. Each Document object is made up of one or more field objects.

Fields represent a section of the Document. They contain the name of the section and actual data associated with the section. Each field contains information that you query against or display in your search results. Because I will be using the filename, date, and time the picture was taken and user comments in the search results, these keywords would be added to the Document object as a field. Each of these keywords has an associated value. These values are the EXIF data extracted from the image file. Field values are a sequence of terms. I construct the Field object using the constructor:


new Field(String name, String value, Field.Store store, Field.Index index))

where the first argument is the name of the field, the second argument is the value associated with this field name, the third argument indicates how the field is stored, and the last argument denotes how the field is indexed. In this application, the store is set to Field.Store.YES and the index is set to Field.Index.UN_TOKENIZED. The former stores the original field value in the index. The latter indexes the field's value without using an Analyzer, so it can be searched. Fields are added to the Document object using the Add method:


d.Add(new Field("path", filename, Field.Store.YES,      Field.Index.UN_TOKENIZED))

The Document object is then added to the IndexWriter instance using the following method:


idxWriter.AddDocument(d)

When indexing is complete, we optimize the index for search,


idxWriter.Optimize()

Finally, we close the index:


idxWriter.Close()

Once the index has been built, it can be searched. In this application, the search is activated by the user. After users have entered all the search parameters, they activate the search by clicking on the search button. To do the search, we first create an IndexSearcher instance that points to the directory containing the indices that have been created previously:


IndexSearcher searcher = new IndexSearcher(idxDir.FullName)

I use the RangeQuery object to search for documents that match documents within an exclusive range. An instance of RangeQuery is created as follows:


Term sTerm = new Term("datetime", startDateSearchString);
Term eTerm = new Term("datetime", endDateSearchString);
RangeQuery query = new RangeQuery(sTerm, eTerm, true);

The last argument to the RangeQuery constructor is a Boolean flag, which is set to True if the query constructed is inclusive, or otherwise set to False. The query instance is then passed as an argument to the search method of IndexSearcher instance,


Hits oHitColl = searcher.Search(query)

This returns the documents that match the query. I extract the Document objects from the Hits object by calling the doc method as follows:


string[] filesFound = null;
if (oHitColl.Length() > 0)
   filesFound = new string[oHitColl.Length()];
for (int i = 0; i < oHitColl.Length(); i++) {
   Document oDoc = oHitColl.Doc(i);
   filesFound[i] = (string)oDoc.Get("path");
}

In addition to the date/time fields, if I want to add the comments field to the search, I create a new Term query containing the user comments data.


TermQuery tQuery = new TermQuery(new Term("comments", comments))

Then I create a BooleanQuery object,


BooleanQuery bQuery = new BooleanQuery();
bQuery.Add(rQuery, BooleanClause.Occur.MUST);
bQuery.Add(tQuery, BooleanClause.Occur.MUST);

I add both the TermQuery and RangeQuery objects to the BooleanQuery and pass this to the Search method of the IndexSearcher instance. This returns the documents that match the query.


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.