Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

Channels ▼


Augmented Reality on Mobile Internet Devices

In the past few years, various methods have been suggested to present augmented content to users through mobile devices. Many of the latest mobile Internet devices (MIDs) feature consumer-grade cameras, WAN and WLAN network connectivity, location sensors (such as global positioning systems or GPSs) and various orientation and motion sensors. Recently, several reality-augmenting applications, such as Wikitude, and similar applications for the iPhone have been announced. Although similar in nature to our proposed system, these solutions rely solely on location and orientation sensors, and, therefore, require detailed location information about points of interest to correctly align augmenting information with visible objects. Our system extends this approach by using image-matching techniques and location sensors together for object recognition and for precise placement of augmenting information. The new Google Goggles application also combines image matching with location information. However, Google's approach is to send the query image to remote servers for processing, which might incur long response times, due to network latency. Powered by the Intel Atom processor, our system can perform image matching on the MID itself instead of shifting all computation to remote servers.

In this article, we demonstrate a complete, end-to-end, mobile augmented reality (MAR) system that consists of MIDs, powered by the Atom processor, and a Web-based MAR service hosted on a server. On the server, we store a large database of images automatically extracted from geo-tagged Wikipedia pages that we update on a regular basis. Figure 1 shows a snapshot of the actual client interface running on the MID. In this example the user has taken a picture of the Golden Gate Bridge in San Francisco. The MAR system uses the location of the user along with the camera image to return the top five candidate images from the database on the server that match the image taken on the handheld device. The user has the option of selecting the image of interest, which links to a corresponding Wikipedia page with information about the Golden Gate Bridge. A transparent logo is then added to the live camera view, virtually "pinned" to the object (Figure 1). The user will see the tag whenever the object is in the camera view and can click on it to return the retrieved information.

[Click image to view at full size]
Figure 1: Illustration of the MAR System Showing Results when Querying the Golden Gate Bridge (Source: Intel Corporation, 2010).

System Overview

The MAR system (Figure 2) is partitioned into two components: a client MID and a server. The MID communicates with the server through a WAN or WLAN network.

[Click image to view at full size]
Figure 2: MAR System Diagram with Image Acquisition, Feature Extraction, Tracking, Rendering, and Image Matching Running on the Client and the Large Database Residing on the Server (Source: Intel Corporation, 2010).

The client performs the following operations:

  • Query acquisition. The client MID device contains a camera and an orientation and location sensor (such as a GPS). The client continuously acquires live video from the camera and waits for the user to select the picture of interest. As soon as a picture is taken, data from location and orientation sensors are saved to the image's Exchangeable image file format (EXIF) field.
  • Feature extraction. Visual features are then extracted from the picture. We compared several well-known feature point techniques, such as Scale-invariant feature transform (SIFT) [6] and Speeded-Up Robust Features (SURF) [7]. We decided to use 64-dimensional SURF image features because their compute efficiency is higher than that of SIFT features, at comparable recognition rates. If users know their destination beforehand, they can pre-cache a small database of image features, thumbnails, and related metadata in the corresponding neighborhood. With a pre-cached database, image matching can run on the client MID device. Otherwise, the client sends the extracted feature vectors and recorded sensor data to the server to search for matched images and related information. The reason for working with the compact visual features instead of full resolution images is to reduce network bandwidth, hence latency, when sending the image data to the server, and also to efficiently use the memory resources on the MID, when pre-caching the data.
  • Rendering and overlay. Once the matching image is found, related data, including the image thumbnail and the Wikipage link, are immediately accessed from the precached database if available, or else they are downloaded to the client via the WAN/WLAN network. The client renders the augmented information, such as the Wikipedia tag of the query object, and overlays it on the live video. In addition to constraining the image recognition based on location information, the orientation sensors (compass, accelerometer, and gyroscope) of our device are used to track the location of objects to overlay augmented information on a live camera view. As the user moves the device, the augmented information representing the query is pinned to the position of the object. This way, the user can interact separately with multiple queries by simply pointing the camera toward different locations.
  • Tracking. When a query is made, the pointing direction of the MID is recorded by using the orientation sensors on the device. The client continuously tracks the movement of orientation sensors [8, 9]. However, tracking that uses orientation sensors is not very precise. We extend our tracking method to also include the visual information. We use image-based stabilization, which is based on aligning neighboring frames in the input image sequence by use of a low-parametric motion model. The motion estimation algorithm is based on a multi-resolution, iterative, gradient-based strategy [10], optionally robust in a statistical sense [11]. Two different motion models have been considered: pure translation (two parameters) and pure camera rotation (three parameters).

The server performs the following operations:

  • Database collection. We automatically extracted Wikipedia pages with their concomitant GPS information [8] from the Wikipedia database. We downloaded the images from these pages, extracted visual features from images, and added images and features to our image database. Our database is constantly growing, and it now contains more than 500,000 images. This fact further emphasizes the need for and the importance of applications that can explore large numbers of images.
  • Image match. As mentioned in the previous section, in the absence of a Pre-cached database on an MID device, the server receives the features of the query image from the client and performs the image match.

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.