Amazon Web Services has announced the public beta of Amazon Elastic MapReduce, a web service that lets businesses, researchers, data analysts, and developers process data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3). Using Amazon Elastic MapReduce, you can instantly provision as much or as little capacity as you like to perform data-intensive tasks for distributed applications such as web indexing, data mining, log file analysis, machine learning, financial analysis, scientific simulation, and bioinformatics research. As with all AWS services, Amazon Elastic MapReduce customers will still only pay for what they use, with no up-front payments or commitments.
Prior to Amazon Elastic MapReduce, running Hadoop or other MapReduce-based clusters required time-consuming set-up, management, and cluster tuning. Amazon claims that the tool makes it more affordable and less time consuming to run parallel compute jobs, building on top of the on-demand, resizable compute capacity of Amazon EC2.
Using this service, customers can spin up and tear down Hadoop clusters on Amazon EC2 on a moment's notice. To assist customers in executing these highly distributed applications, AWS is providing a number of sample applications and tutorials to get started using Amazon Elastic MapReduce.
"Some researchers and developers already run Hadoop on Amazon EC2, and many of them have asked for even simpler tools for large-scale data analysis," AWS's said Adam Selipsky. "Amazon Elastic MapReduce makes crunching in the cloud much easier as it dramatically reduces the time, effort, complexity and cost of performing data-intensive tasks."
Amazon Elastic MapReduce creates data processing job flows that are executed by Hadoop software on the web-scale infrastructure of Amazon EC2. The service automatically launches and configures the number and type of Amazon EC2 instances specified by customers. It then kicks off a Hadoop implementation of the MapReduce programming model, which loads large amounts of user input data from Amazon S3 and then subdivides it for parallel processing using Amazon EC2 instances. As processing completes, data is re-combined and reduced into a final solution, and the results deposited back into Amazon S3. Users can configure, manipulate, and monitor job flows through web service APIs or via the AWS Management Console.


