Channels ▼
RSS

Parallel

Amazon SimpleDB: A Simple Way to Store Complex Data


Beyond Trivial Data

The trivial data we used for the examples was sufficient to get a feel for SimpleDB. To see how suitable SimpleDB might be for real world use, we need real world data. One of the sites that my search for large data sets led me to was the Center for Machine Learning and Intelligent Systems at the University of California, Irvine, where I found the data set that was used in a contest that was run by Netflix. One of the files, movie_titles.txt, contains a unique identifier, title and release data for each of 17,770 movies. At .5MB it is undoubtedly not the largest data set you have ever worked with but is sufficient for the task at hand. Also available are 17,770 data files, one for each movie, containing an anonymized customer identifier, the number of stars that customer gave the movie and the date of the rating. The total size of these files is 2GB. My original thought was to store all of the title/release date data and all of the review data in SimpleDB. Table 1, which gives the limitations of SimpleDB, shows why I settled for something a little less ambitious.

Table 1: Amazon SimpleDB Limits.

Although a total of 10GB per domain times 100 domains per account seems a lot, 256 attribute name-value pairs per item proved to be an obstacle that I just could not overcome no matter whether I made each movie, each customer or each review date an item. Since my original intention was to write a simple demo program, rather than agonize over ways I might arrange the data, I settled for less and wrote NetflixDataLoader.java (available here). It reads each of the ratings files and derives the following data, which it assigns as attributes to each movie:

  • Minimum number of stars
  • Maximum number of stars
  • Average number of stars
  • Number of times movie was reviewed
  • Earliest review date
  • Latest review date

The program also assigns the title and release date as attributes. I use the movie's unique identifier as the item. NOTE: I originally tried to use the movie's title as the item rather than an attribute only to find that the title of both movies 15846 and 16045 was "A Raisin in the Sun".

If you examine the program, you will see that it consists of a pool of worker threads, each of which pops a file name from a list of data files, calculates the ratings data, creates a PutAttributesRequest and uses the client's putAttributes() method to write to SimpleDB.

The code that creates the PutAttributesRequest is in the updateSimpleDB() method and looks like this:


List<ReplaceableAttribute> movieAttributes = new ArrayList<ReplaceableAttribute>();
String title = movieTitleMap.get(titleKey).getTitle();
movieAttributes.add(new ReplaceableAttribute("title", title, false));
String releaseDate = movieTitleMap.get(titleKey).getReleaseDate();
movieAttributes.add(new ReplaceableAttribute("release_date", releaseDate,
       false));
movieAttributes.add(new ReplaceableAttribute("times_reviewed",
       Integer.toString(ri.getnReviews()), false));
movieAttributes.add(new ReplaceableAttribute("earliest_review",
       sdf.format(ri.getEarliestReview()), false));
movieAttributes.add(new ReplaceableAttribute("latest_review",
       sdf.format(ri.getLastReview()), false));
movieAttributes.add(new ReplaceableAttribute("min_stars",
       Integer.toString(ri.getMinStars()), false));
movieAttributes.add(new ReplaceableAttribute("max_stars",
       Integer.toString(ri.getMaxStars()), false));
movieAttributes.add(new ReplaceableAttribute("average_stars",
       ri.getAvgStars().toPlainString(), false));
PutAttributesRequest par = new PutAttributesRequest(DOMAIN_NAME, titleKey,
       movieAttributes);

This line in the sendRequest() method sends the request:


PutAttributesResponse r = client.putAttributes(par);

If you look at the code in the sendRequest() method in detail, you will see that any time the putAttributes() method returns an error, the program calculates an amount of time that it should sleep before retrying. If n is the number of retries that have been attempted, the sleep interval is calculated as a random number between 0 and 4n * 100 milliseconds. This is referred to as "exponential backoff" and its use is requested by Amazon.

If you run the program specifying a worker thread pool size of 1, it is the equivalent of running a non-threaded version of the program . Here is what the program reports:


17769 files processsed - elapsed = 5259773
Thread # 1[TID = 10] processing file mv_0017770.txt
Thread # 1[TID = 10]  processed file mv_0017770.txt in 366 milliseconds
17770 files processsed - elapsed = 5260139
total time: 1 hours, 25 minutes, 23 seconds, 444 milliseconds

Having written numerous programs that process far more than 17,000 files, I knew that an hour and 25 minutes was a lot longer than I would have figured so I started to explore what might be responsible. My first thought was to comment out the call to the sendRequest() method, which makes the putAttributes() call to SimpleDB and rerun the program. Now it displayed the following:


17769 files processsed - elapsed = 694787
Thread # 1[TID = 10] processing file mv_0017770.txt
Thread # 1[TID = 10]  processed file mv_0017770.txt in 33 milliseconds
17770 files processsed - elapsed = 694820
total time: 11 minutes, 34 seconds, 861 milliseconds

So, it appears that most of the time was attributable to the call to SimpleDB. Now, before you write off SimpleDB, let me point out that it was designed to thrive on processing parallel requests; so I uncomment the call to sendRequest() and re-ran the program specifying a pool size of 10. Here is what I saw:


17769 files processsed - elapsed = 680912
Thread # 7[TID = 16]  processed file mv_0017764.txt in 480 milliseconds
17770 files processsed - elapsed = 680974
total time: 11 minutes, 21 seconds, 129 milliseconds

So, how many threads should we use? I experimented with different values only to discover that there is a point of diminishing returns but was unable to define that point. I will say that 1000 threads took in excess of 20 minutes. Feel free to experiment for yourself. NOTE: These timings were observed when I ran the program on my 2.8-GHz Intel Core Duo MacBook Pro and my network bandwidth was 3 Mb/s DSL. When I ran the program using a pool of 10 threads on Amazon's Elastic Compute Cloud (EC2) with the files stored on Elastic Block Store (EBS), it took 58 minutes, 12 seconds.

Besides sending the PutAttributeRequests in parallel, there is another way you can improve performance. You can use a single BatchPutAttributesRequest to send attributes for up to 25 items in a single call. The program NetflixAlternateDataLoader.java (available here) does that. Each of the worker threads assembles all of the attributes for a single item but instead of sending it to SimpleDB, it simply adds it to a ConcurrentLinkedQueue. A separate thread, BatchRequestProcessorThread, dequeues the items and each time 25 have been dequeued, it builds a BatchPutAttributesRequest and sends it to SimpleDB. Here is output from that program using a thread pool size of 10:


total time: 6 minutes, 42 seconds, 542 milliseconds
enqueued 17770
dequeued 17770
transmitted 17770
# transmissions = 711

Notice that the number of transmissions is 711, which is only about 4% that was required when data for one item at a time was sent.

Conclusion

The code I've examined proves that SimpleDB is easy to use. Some developers find that libraries other than the one I chose make it even easier. How you choose to store the data is entirely up to you and you do not have to involve a DBA. Adding one or more attributes does not have a disruptive effect. If, for example, we decided to include standard deviation in addition to min, max and average in our ratings attributes, we just do it on a going forward basis without have to go back and retrofit existing items with default or null values for the new attribute. We discovered that you do have to be aware of and willing to live with characteristics such as "eventually consistent". So, is SimpleDB for you? Only you can answer that question.


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 

Video