Technical Design Challenges
Three fundamental steps must be followed in order for the user to hear the text that is printed on a piece of paper: the image must be captured, the image must be converted to text, and finally the text must be rendered as audible speech. In some cases, one or two of the above steps can be skipped. For example, if the user already has the text in electronic form, then the capture stage can be skipped. This is common for e-book files. Even the text-to-speech step can be skipped if the text has already been recorded by a human. These are commonly known as books on tape or audio books, with MP3 being the most popular audio format.
The first design and usability challenge centered on camera placement. Initial prototypes placed the camera on the back of the device (Figure 3) similar to the design of most general-purpose cameras.
When placed in the hands of end users, it was interesting to observe their intuitive response: they stood up in order to capture pages in a book placed on a desk. Standing was the only way for them to see what was displayed on the device screen, as the camera was pointed to the text laying flat on the table. When asked how they would use the device when sitting down, we observed a very awkward twist in their wrist, in order to position the camera above the text, and a strain in their neck to peer above the device to view the display. We knew we needed to develop a different solution to make the product ergonomically friendly for all end users of various age groups. Interaction designers and engineers worked together to implement a solution to the problem: design a vertically-mounted camera placed at the bottom of the device. With the camera in this location, users can wrap both hands around the device and view the display that is positioned at user eye-level when seated -- with the camera aimed directly above the printed text (see Figure 4).
Unanticipated benefits followed from this design choice. For most users, placing their elbows on a table assured them that the unit was level to the surface, and such a posture provided stability while holding the units, important for accurate imaging. For seniors, this was especially important, enabling them to rest while holding the device. For blind users, the physical location of the document between their elbows and directly in front of them while seated allowed for more accurate framing of the image.
In the midst of the human factors and usage research, we evaluated a number of end-user requirements. One significant challenge concerned the size of the device. Teens wanted the thinnest, lightest device we could design, while low-vision seniors needed buttons to be placed far enough apart to allow easy use. Blind users had limited use for the screen on the device, though it would turn out that most blind users wanted to have a visual display so they could share information with sighted people. Low-vision users wanted to have as large a screen as possible. Each of these needs pushed industrial design in various different directions.
Placement of Buttons
There are many considerations in the placement of buttons, among which are these:
- Placement of the buttons influences the user's experience and learning curve. The buttons must be close enough together such that a user does not have to stretch uncomfortably; yet, they must be large enough and far enough apart to accommodate users with large fingers and limited dexterity, including users with limited range of motion due to arthritis and other rheumatoid conditions.
- The main function of the device, capturing images, pointed to making the image-capturing button the most prominent button on the device. The large button, at the cut corner on the right, also oriented the device for a blind person, pointing the camera downward. For the teen or the senior, pushing the button while aiming it with one or two hands was an important consideration. This placement also was found to reduce motion blur induced by the user's finger movement during the image capture.
- The team tested two-handed and one-handed usage models, determining that having one hand free to hold content or keep a book page flat would be instrumental to success. Following from this, the main keypad on the device (see Figure 5) was designed to be within a thumb's reach while holding the device with the right hand, as almost 90 percent of the population is right-handed.
- Human factors research also pointed to the importance of being able to move up or down in menu structures easily. A four-way directional pad with variable function, depending on context, met this need. Consumer electronic devices typically included a fifth navigation button for Select or OK button, and therefore we located this button within the center of the four-way direction pad.
- For advanced usages, these five navigation buttons have multiple functions depending on context. While playing a document, the Enter button functions as a play/pause button. This functionality, not originally planned, was added after user testing showed that many users expected this behavior. Similarly, the right and left buttons have multiple functions depending on context and usage. Holding the right button down while paused in a reading document, will allow the users to jump one page. If the users are playing back text, the right and left button accelerate progressively or jump back to recently read text.
- For the Intel Reader, a sixth button, called the "back button", was added to allow users to jump from lower-level menus to higher-level menus. In subsequent software versions, buttons would take on additional functions. For example, the back button, when held down for an extended duration, would return the user all the way to the Home menu, the top level of the user interface, rather than to just go up one level.
- Grouping of the other buttons on the device by function improved discoverability.
- Low-vision-related buttons, of no use to a blind person, and moderately useful to a dyslexic user, were placed together on the lower left. These buttons allowed for font magnification and for toggling between image view and text view.
- Voice speed changes were important to power users, those who are very comfortable with using assistive technology: these users listen to content at speeds of over 400 words per minute (four times common speech) and slow down for complex content. Similarly, all users wanted to have Favorites and Help easily accessible, which made dedicated buttons critical.
- A tactile adjuster for volume allowed one handed use, with a rocker-type button on the right side.
- Design considerations also included responsiveness of buttons, durability, and tactile feedback. For example, the button on the lower left for decreasing font size on the screen is indented, while the one for increasing font size is out dented, signaling to the users zoom in and out. Similarly, each of the buttons is shaped uniquely by function, allowing users to distinguish one button from another.
- In some cases, buttons were eliminated. For example, initial designs included a voice recorder for audio notes.
The resulting placement of the buttons is shown in Figure 5.
The first major step is to capture the image of the target, which in this case is printed text. As with any camera, the target must be in focus, properly illuminated, and the capture time must be fast enough to prevent motion blur. The Intel Reader's camera subsystem was designed to be fully automatic, including auto-focus, auto-exposure, and auto-flash mechanisms. The major considerations in image capture are discussed next.
The image sensor has minimum illumination requirements. In general, a sensor that has higher sensitivity will be more expensive, and a sensor that can operate at very low-light conditions will not have sufficient dynamic range to function at very high illumination. For the usages we considered, ambient illumination will vary across a tremendous range, from as low as 100 lux in a typical residential setting, to over 10,000 lux outside on a sunny day.
The standard industry solution has been to use an integrated Xenon flash strobe, with an exposure time of a few tens of microseconds. The Xenon strobe is effective, because it can produce a large amount of illumination over a short time, with a reasonably uniform light distribution. For the Intel Reader's usage models, the target is generally 10 cm to 1 meter from the sensor, so the flash strobe can be significantly lower in power than what might be found on a standard camera that might be two to three meters from the target.
In selecting the strobe, we also had to consider the useful life of the product. A consistently-used Intel Reader could easily work through 50 pages a day of text, requiring 17,000+ strobe flashes a year. Given a useful life of three years for the device, 50,000+ would be a likely usage scenario. This drove designers to consider long-life strobes to ensure product durability.
The lens in front of the sensors determines the field of view. This field of view can be considered as a pyramidal cone, as shown in Figure 6. Often this is thought of as a stack of planes, where the area of the planes increases with the distance from the sensor. However, not all possible positions within the field of view are in sharpest focus. For any given distance between the lens and the sensor, there will be a plane of ideal focus, as in Figure 6.
Since the distance between the Intel Reader and the target is adjusted by the user, the lens must be moved relative to the sensor in order to have the target in the sharpest focus. The auto-focus algorithm calculates the required lens position. This algorithm must also account for slight hand movements by the user, especially for our low-vision senior users.
While the region of sharpest focus is shaped as a plane, the corresponding planes immediately in front of and behind that plane are still in reasonably sharp focus. The distance between the sharpest plane and the plane of just-acceptable focus is known as the depth of field. For usage models associated with the Intel Reader, the target material will often not be perfectly flat. For example, with a thick book opened to a page near the beginning, the distance from the sensor to the left side of the left-facing page may be several centimeters greater than the right side of the right page. The lens must thus have enough depth of field to account for that difference.
Preventing Motion Blur
As the user holds the device, some natural motion will be imparted to the camera. The complete image must be captured while the image is stationary on the sensor; therefore, the exposure time must be set correctly relative to the potential motion of the camera. For some users, especially elderly users with tremors, the motion may be significant. If the exposure time is too long, the image will be blurred, resulting in poor accuracy when the optical character recognition is attempted.
To prevent this blurring, the exposure must be fast. Of course, as the exposure time is reduced, the amount of light hitting the sensor will also be reduced. This drives sensitivity requirements on the image sensor, as well as performance requirements on the illumination source.
Even with the relatively low weight of the device, when capturing a large number of images, such as an entire magazine, newspaper, or book, the user may become fatigued. In addition, the book might not naturally stay flat on its own. The user might need some significant dexterity to hold the Intel Reader in one hand while holding the target open with the other hand, and at the same time not obscure the text of the target with his or her hand.
To address the user requirement of avoiding fatigue, especially for our low-vision senior group, and to allow the bound printed material to be held open, the team developed an accessory called the Intel Portable Capture Station, as in Figure 7. The Intel Portable Capture Station has a tray that is large enough to keep magazines or books open, with two pages exposed in the view angle of the Intel Reader's camera. The Intel Reader is placed into a holder that determines the exact height above the tray and positions the camera's field of view.
Design of the Intel Portable Capture Station included a number of considerations. User focus groups identified durability as a key requirement, since this device is likely to be used in a public school setting. Portability is another key requirement, since the user may choose to bring it to work, school, or home.
A key element of the design of the Intel Portable Capture Station and the Intel Reader included the ability to transition from mobile to fixed usages. When the Intel Reader is used in conjunction with the Intel Portable Capture Station, bulk capturing of text becomes easy. However, for single worksheets and news handouts, capturing the text by using the Intel Reader in hand-held mode is quicker and more convenient.
Initial design included the possibility of an illumination system in the Intel Portable Capture Station itself. The additional complexity and associated cost of the system, including the electrical subsystems and the cost of the illumination elements, made using the lighting system on the Intel Reader a preferable choice.
Other user design requirements included an indented tray on the Intel Portable Capture Station to give users a physical guide to the Intel Reader's viewable area when docked in the Intel Portable Capture Station. This allowed both sighted and non-sighted users the ability to orient the target material and assume all text is within the camera's field of view.
The Intel Portable Capture Station was designed to have a separate button to initiate image capture by using the Intel Reader. Because the capture button is placed on the bottom of the tray, the user does not have to reach up high to activate the shoot button on the Intel Reader that is placed in the holder, about two feet above the text. Users with reduced mobility in their arms find this feature valuable. In addition, seniors often have difficulty raising their arms above their shoulders, and the decision to place the capture button on the tray base was based on end-user feedback.
Optical Character Recognition (OCR)
Once the image has been captured, the next major step is to convert the raw image to text. This process is known as Optical Character Recognition (OCR). OCR software packages are commercially available and one was selected for incorporation into the Intel Reader.
The OCR algorithm must examine the pixels associated with the image and derive the corresponding text. A key step involves finding the natural boundaries: the spaces between columns, paragraphs, words, and characters. Individual characters may only be a few pixels apart.
Key challenges to a highly accurate OCR process include having a sufficient number of pixels per character, having clear focus on the characters, and minimizing interference such as glare, etc. The differences in individual characters, such as the lowercase letter i and the lowercase letter l may be very subtle; therefore, acquiring a very accurate image of each character is critically important.
When capturing multiple pages, such as chapters of a book, the software is designed to perform background OCR processing and allow users to listen to the first few pages that have been processed, if they choose to. Typically, by the time a user captures three or four pages of a multi-page document, the first page is available to be read aloud. This was a design tradeoff between optimizing for time to first sentence versus optimizing for capturing multi-page documents; we choose the latter based on our usage models.
Text to Speech
After capturing the image of a page of text, the final stage is reading it aloud to the user, often called "text to speech". This requirement also drives a wide variety of design decisions. Naturally, as users start to listen to more and more content, their ability to comprehend that content will also improve. For example, a new user may start out being comfortable with a speed of 110 words per minute, roughly the rate of common human speech, but will later find that speed far too slow, and he or she may eventually find that speed frustrating. However, even users that are comfortable with 300 words per minute for simple or familiar content may need to go much slower for complex or unfamiliar content, such as a technical journal. Thus, the ability to scale to much higher rates will be of great importance to the long-term appeal of any device that reads aloud to users. The Intel Reader operates at rates as slow as 70 words per minute and allows playback of up to 500 words per minute, with increments at appropriate rates in between.
Rendering Existing Electronic Content
In designing the Intel Reader, we focused on the ecosystem of other consumer electronic devices within which the device would function, as well as the environment of electronic files for reading materials, such as plain text files or DAISY format (described later). Connecting to a personal computer, playing MP3 audio content, or even generating MP3 to play on other devices were critical to the usage model. To enable this transfer, the designs included the ability to access files on the system and transfer them on or off, by using a USB connection to a PC or by using USB flash drives.
Imported text files can be played in the same manner as text that has been derived by using the OCR capabilities. However, files in MP3 and other digital audio formats require an entirely different playback mechanism. Instead of using the text-to-speech engine, a digital audio decode engine is required.
While MP3 is a popular format for audio files, it is not the default for accessible text. A consortium known as the Digital Accessible Information System (DAISY) has established standards for Digital Talking Books. The DAISY format includes both digital audio files with human narration and corresponding text of that file. The Intel Reader was designed to play both text only and audio versions of these files.
Over time, the user may accumulate a very large number of files. A familiar usage model is a filing cabinet, where related files are stored in the same drawer or folder within a drawer. As a user captures files, the files must be identified. To simplify this identification process, we set a default where the title of a document is identified, based on text size and placement on the page. The first 25 characters of this title become the title of the book. In a case where there is no clear title, the first 25 characters identified become the default title. The user may rename the file and place it in one of several categories. To distinguish between documents with similar content, the file created also records the date and time.
Sorting and managing content can also be done through a computer-based architecture, by attaching the device to a computer via a USB. Using a standard Windows or Mac interface to do this allows users with preferred screen-reading software to navigate the device content through the tools they use on a daily basis. Our team made a specific decision not to develop a separate user interface for the computer, allowing users to navigate the Intel Reader through the tools they already use.
For users with vision, even limited vision, an electronic display, such as an LCD, can provide significant information. Some uses are obvious, such as display of functional menus or a preview of the text to be captured. A less obvious but very important use is to display the text as it is being read aloud by the device. For users with limited vision, seeing the text along with hearing the words yields a significantly improved reading experience. The same holds true for users with specific learning disabilities: multiple forms of input increases comprehension. Given that users have differing levels of sight, the font size for the displayed text must be adjustable, even up to the point where a single word covers the entire displayed area.
A number of factors drive the decision on the size of the LCD panel to include. First, the LCD panel must be large enough to be useful. Users with limited vision will require very large fonts. For some users, reading is easier when many lines of text are on the screen; low-vision users may have enough sight to shift between words while reading, thereby gaining context as they go. For users with specific learning disabilities, stochastic reading, placing one word at a time on the screen, can be useful as this allows them to avoid scanning across a line of text, a common challenge for readers with specific learning disabilities.
Overall, having a larger display incurs many burdens on the system: the larger the display, the higher the cost, weight, and power consumption. One also needs to consider the business implications. Settling on a size that conforms to industry norms, for example, those commonly used for gaming systems, allows the design to benefit from the economies of scale for these adjacent markets.
The display will be viewed under a wide variety of lighting conditions, ranging from dark rooms to bright sunlight. Given this requirement, the LCD panels must include integrated backlighting, with LED-based illumination being the most power efficient. In order to have sufficient intensity, the voltage must be raised beyond the 12 volts provided by the main power supply, and a pulse-width modulation must be applied to allow for variation in the intensity.
A wide viewable angle is also important: it allows users to hold the device in a lap or observe it in the Intel Portable Capture Station while seated or standing 30+ degrees above or below the screen.
For our blind users, having a screen is not critical. Their typical usage is to turn the screen off to save battery power and predominantly use menu voicing to determine device status; however, a number of blind users indicated that they would prefer to have a viewable screen to share content with sighted people when necessary.
Framing printed text is a challenge without audio cues to indicate whether the text is within the field of view of the camera. When specific usage tips are provided, blind users have more success framing the document. For example, a tip could be advising users to hold the Intel Reader a distance from the paper equal to the largest dimension of the page being imaged: that is, for an 8x11-inch piece of paper, the user is advised to hold the device 11 inches above the document to ensure it is captured. Placing the Intel Reader in the center of the document to be imaged and raising it directly above the document also proved useful for blind and low-vision users in testing. When capturing documents at home or in the office, having the Intel Reader docked in the Intel Portable Capture Station removes the height variable, and the user is confident that anything placed within the defined borders of the base will be captured by the Intel Reader's camera.
Storage requirements were defined based upon usage. For a multi-megapixel color image, the raw images may be over 10 megabytes each. Text is a very compact format with 1 character generally only requiring 1 byte of memory. That same 10 megabytes can store over 5,000 pages of text, and if the text is compressed, storage can typically be increased by 2 to 4 times more. For text capture, therefore, the memory has to be large enough to handle capture of a few dozen raw images.
Some of the storage space must be allocated for audio files and text files. Audio data can be tightly compressed due to gaps between words, the low base frequency, and the relatively small dynamic range of the human voice. MP3-type compression can yield files as small as 500 kbytes per minute. By far, the least amount of storage is required for text, as one minute of text at 160 words per minute might require only 800 bytes.
Testing showed that power users were likely to move content on and off the device to an external drive of a computer. This pointed the team to consider cost and to specify the size of the internal drive. Compression of images was also a critical element, allowing the user to get good OCR accuracy from full resolution images and then store the content once read in a low-resolution JPEG format.
Creating a Complementary Environment
To assist in moving content into and out of the Intel Reader, a USB interface allows direct connection to any modern computer. The Intel Reader appears to the computer as would any storage-class device, such as is done with many digital cameras or external disk drives. This allows direct re-use of existing tools for movement of data files. Users can thus download content from remote servers to their computer, and then move it to the Intel Reader. Conversely, text captured by using the Intel Reader can be archived on the computer or sent to other systems.
In some cases, the user may not have access to a computer. In examining alternative data transfer methods, the USB flash drive appeared to be a great choice. No cable is required, as the flash drive is plugged directly into the corresponding USB host port on the Intel Reader. In addition to data transfers, the USB host port also allows some useful USB peripherals to be connected, such as a keyboard. This proved important in user testing for people who use alternative input devices, such as pointer or breath-activated keyboards. The keyboards generally function according to standard USB interfacing, allowing people with limited mobility and special-needs computing devices to use the Intel Reader. As an extension of this functionality, the Intel Portable Capture Station also allows for USB connectivity, enabling the user to drive the device while docked in the station.