In the past few years, various methods have been suggested to present augmented content to users through mobile devices. Many of the latest mobile Internet devices (MIDs) feature consumer-grade cameras, WAN and WLAN network connectivity, location sensors (such as global positioning systems or GPSs) and various orientation and motion sensors. Recently, several reality-augmenting applications, such as Wikitude, and similar applications for the iPhone have been announced. Although similar in nature to our proposed system, these solutions rely solely on location and orientation sensors, and, therefore, require detailed location information about points of interest to correctly align augmenting information with visible objects. Our system extends this approach by using image-matching techniques and location sensors together for object recognition and for precise placement of augmenting information. The new Google Goggles application also combines image matching with location information. However, Google's approach is to send the query image to remote servers for processing, which might incur long response times, due to network latency. Powered by the Intel Atom processor, our system can perform image matching on the MID itself instead of shifting all computation to remote servers.
In this article, we demonstrate a complete, end-to-end, mobile augmented reality (MAR) system that consists of MIDs, powered by the Atom processor, and a Web-based MAR service hosted on a server. On the server, we store a large database of images automatically extracted from geo-tagged Wikipedia pages that we update on a regular basis. Figure 1 shows a snapshot of the actual client interface running on the MID. In this example the user has taken a picture of the Golden Gate Bridge in San Francisco. The MAR system uses the location of the user along with the camera image to return the top five candidate images from the database on the server that match the image taken on the handheld device. The user has the option of selecting the image of interest, which links to a corresponding Wikipedia page with information about the Golden Gate Bridge. A transparent logo is then added to the live camera view, virtually "pinned" to the object (Figure 1). The user will see the tag whenever the object is in the camera view and can click on it to return the retrieved information.
The MAR system (Figure 2) is partitioned into two components: a client MID and a server. The MID communicates with the server through a WAN or WLAN network.
The client performs the following operations:
- Query acquisition. The client MID device contains a camera and an orientation and location sensor (such as a GPS). The client continuously acquires live video from the camera and waits for the user to select the picture of interest. As soon as a picture is taken, data from location and orientation sensors are saved to the image's Exchangeable image file format (EXIF) field.
- Feature extraction. Visual features are then extracted from the picture. We compared several well-known feature point techniques, such as Scale-invariant feature transform (SIFT)  and Speeded-Up Robust Features (SURF) . We decided to use 64-dimensional SURF image features because their compute efficiency is higher than that of SIFT features, at comparable recognition rates. If users know their destination beforehand, they can pre-cache a small database of image features, thumbnails, and related metadata in the corresponding neighborhood. With a pre-cached database, image matching can run on the client MID device. Otherwise, the client sends the extracted feature vectors and recorded sensor data to the server to search for matched images and related information. The reason for working with the compact visual features instead of full resolution images is to reduce network bandwidth, hence latency, when sending the image data to the server, and also to efficiently use the memory resources on the MID, when pre-caching the data.
- Rendering and overlay. Once the matching image is found, related data, including the image thumbnail and the Wikipage link, are immediately accessed from the precached database if available, or else they are downloaded to the client via the WAN/WLAN network. The client renders the augmented information, such as the Wikipedia tag of the query object, and overlays it on the live video. In addition to constraining the image recognition based on location information, the orientation sensors (compass, accelerometer, and gyroscope) of our device are used to track the location of objects to overlay augmented information on a live camera view. As the user moves the device, the augmented information representing the query is pinned to the position of the object. This way, the user can interact separately with multiple queries by simply pointing the camera toward different locations.
- Tracking. When a query is made, the pointing direction of the MID is recorded by using the orientation sensors on the device. The client continuously tracks the movement of orientation sensors [8, 9]. However, tracking that uses orientation sensors is not very precise. We extend our tracking method to also include the visual information. We use image-based stabilization, which is based on aligning neighboring frames in the input image sequence by use of a low-parametric motion model. The motion estimation algorithm is based on a multi-resolution, iterative, gradient-based strategy , optionally robust in a statistical sense . Two different motion models have been considered: pure translation (two parameters) and pure camera rotation (three parameters).
The server performs the following operations:
- Database collection. We automatically extracted Wikipedia pages with their concomitant GPS information  from the Wikipedia database. We downloaded the images from these pages, extracted visual features from images, and added images and features to our image database. Our database is constantly growing, and it now contains more than 500,000 images. This fact further emphasizes the need for and the importance of applications that can explore large numbers of images.
- Image match. As mentioned in the previous section, in the absence of a Pre-cached database on an MID device, the server receives the features of the query image from the client and performs the image match.