Beyond Trivial Data
The trivial data we used for the examples was sufficient to get a feel for SimpleDB. To see how suitable SimpleDB might be for real world use, we need real world data. One of the sites that my search for large data sets led me to was the Center for Machine Learning and Intelligent Systems at the University of California, Irvine, where I found the data set that was used in a contest that was run by Netflix. One of the files, movie_titles.txt, contains a unique identifier, title and release data for each of 17,770 movies. At .5MB it is undoubtedly not the largest data set you have ever worked with but is sufficient for the task at hand. Also available are 17,770 data files, one for each movie, containing an anonymized customer identifier, the number of stars that customer gave the movie and the date of the rating. The total size of these files is 2GB. My original thought was to store all of the title/release date data and all of the review data in SimpleDB. Table 1, which gives the limitations of SimpleDB, shows why I settled for something a little less ambitious.
Although a total of 10GB per domain times 100 domains per account seems a lot, 256 attribute name-value pairs per item proved to be an obstacle that I just could not overcome no matter whether I made each movie, each customer or each review date an item. Since my original intention was to write a simple demo program, rather than agonize over ways I might arrange the data, I settled for less and wrote NetflixDataLoader.java (available here). It reads each of the ratings files and derives the following data, which it assigns as attributes to each movie:
- Minimum number of stars
- Maximum number of stars
- Average number of stars
- Number of times movie was reviewed
- Earliest review date
- Latest review date
The program also assigns the title and release date as attributes. I use the movie's unique identifier as the item. NOTE: I originally tried to use the movie's title as the item rather than an attribute only to find that the title of both movies 15846 and 16045 was "A Raisin in the Sun".
If you examine the program, you will see that it consists of a pool of worker threads, each of which pops a file name from a list of data files, calculates the ratings data, creates a PutAttributesRequest and uses the client's putAttributes() method to write to SimpleDB.
The code that creates the PutAttributesRequest is in the updateSimpleDB() method and looks like this:
List<ReplaceableAttribute> movieAttributes = new ArrayList<ReplaceableAttribute>(); String title = movieTitleMap.get(titleKey).getTitle(); movieAttributes.add(new ReplaceableAttribute("title", title, false)); String releaseDate = movieTitleMap.get(titleKey).getReleaseDate(); movieAttributes.add(new ReplaceableAttribute("release_date", releaseDate, false)); movieAttributes.add(new ReplaceableAttribute("times_reviewed", Integer.toString(ri.getnReviews()), false)); movieAttributes.add(new ReplaceableAttribute("earliest_review", sdf.format(ri.getEarliestReview()), false)); movieAttributes.add(new ReplaceableAttribute("latest_review", sdf.format(ri.getLastReview()), false)); movieAttributes.add(new ReplaceableAttribute("min_stars", Integer.toString(ri.getMinStars()), false)); movieAttributes.add(new ReplaceableAttribute("max_stars", Integer.toString(ri.getMaxStars()), false)); movieAttributes.add(new ReplaceableAttribute("average_stars", ri.getAvgStars().toPlainString(), false)); PutAttributesRequest par = new PutAttributesRequest(DOMAIN_NAME, titleKey, movieAttributes);
This line in the sendRequest() method sends the request:
PutAttributesResponse r = client.putAttributes(par);
If you look at the code in the sendRequest() method in detail, you will see that any time the putAttributes() method returns an error, the program calculates an amount of time that it should sleep before retrying. If n is the number of retries that have been attempted, the sleep interval is calculated as a random number between 0 and 4n * 100 milliseconds. This is referred to as "exponential backoff" and its use is requested by Amazon.
If you run the program specifying a worker thread pool size of 1, it is the equivalent of running a non-threaded version of the program . Here is what the program reports:
17769 files processsed - elapsed = 5259773 Thread # 1[TID = 10] processing file mv_0017770.txt Thread # 1[TID = 10] processed file mv_0017770.txt in 366 milliseconds 17770 files processsed - elapsed = 5260139 total time: 1 hours, 25 minutes, 23 seconds, 444 milliseconds
Having written numerous programs that process far more than 17,000 files, I knew that an hour and 25 minutes was a lot longer than I would have figured so I started to explore what might be responsible. My first thought was to comment out the call to the sendRequest() method, which makes the putAttributes() call to SimpleDB and rerun the program. Now it displayed the following:
17769 files processsed - elapsed = 694787 Thread # 1[TID = 10] processing file mv_0017770.txt Thread # 1[TID = 10] processed file mv_0017770.txt in 33 milliseconds 17770 files processsed - elapsed = 694820 total time: 11 minutes, 34 seconds, 861 milliseconds
So, it appears that most of the time was attributable to the call to SimpleDB. Now, before you write off SimpleDB, let me point out that it was designed to thrive on processing parallel requests; so I uncomment the call to sendRequest() and re-ran the program specifying a pool size of 10. Here is what I saw:
17769 files processsed - elapsed = 680912 Thread # 7[TID = 16] processed file mv_0017764.txt in 480 milliseconds 17770 files processsed - elapsed = 680974 total time: 11 minutes, 21 seconds, 129 milliseconds
So, how many threads should we use? I experimented with different values only to discover that there is a point of diminishing returns but was unable to define that point. I will say that 1000 threads took in excess of 20 minutes. Feel free to experiment for yourself. NOTE: These timings were observed when I ran the program on my 2.8-GHz Intel Core Duo MacBook Pro and my network bandwidth was 3 Mb/s DSL. When I ran the program using a pool of 10 threads on Amazon's Elastic Compute Cloud (EC2) with the files stored on Elastic Block Store (EBS), it took 58 minutes, 12 seconds.
Besides sending the PutAttributeRequests in parallel, there is another way you can improve performance. You can use a single BatchPutAttributesRequest to send attributes for up to 25 items in a single call. The program NetflixAlternateDataLoader.java (available here) does that. Each of the worker threads assembles all of the attributes for a single item but instead of sending it to SimpleDB, it simply adds it to a ConcurrentLinkedQueue. A separate thread, BatchRequestProcessorThread, dequeues the items and each time 25 have been dequeued, it builds a BatchPutAttributesRequest and sends it to SimpleDB. Here is output from that program using a thread pool size of 10:
total time: 6 minutes, 42 seconds, 542 milliseconds enqueued 17770 dequeued 17770 transmitted 17770 # transmissions = 711
Notice that the number of transmissions is 711, which is only about 4% that was required when data for one item at a time was sent.
The code I've examined proves that SimpleDB is easy to use. Some developers find that libraries other than the one I chose make it even easier. How you choose to store the data is entirely up to you and you do not have to involve a DBA. Adding one or more attributes does not have a disruptive effect. If, for example, we decided to include standard deviation in addition to min, max and average in our ratings attributes, we just do it on a going forward basis without have to go back and retrofit existing items with default or null values for the new attribute. We discovered that you do have to be aware of and willing to live with characteristics such as "eventually consistent". So, is SimpleDB for you? Only you can answer that question.