Channels ▼
RSS

Parallel

Pydoop: Writing Hadoop Programs in Python


It is time to start the single-node Hadoop cluster. Run the following command to execute the script that will start NameNode, DataNode, JobTracker, and TaskTracker on the system:

./start-all.sh

It is easy to check whether the Hadoop processes are running by opening another terminal window and running jps. You should see the following names listed as a result of running jps (see Figure 8):

  • DataNode
  • SecondaryNameNode
  • TaskTracker
  • JobTracker
  • NameNode
  • Jps

Checking whether the Hadoop processes are running
Figure 8: Checking whether the Hadoop processes are running with jps.

It is also important to check whether the Hadoop processes are listening on the previously configured ports: 54310 and 54311. Open a terminal window and run the following command. If the results show one java process that is listening to 54310 and another one that is listening to 54311, it means that Hadoop processes are running according to the previously set configuration.

sudo netstat -plten | grep 543

You can use your favorite Web browser to visit the addresses for the different Hadoop Web interfaces and monitor the following different daemons:

  • http://localhost:50030/ — Web interface for the Hadoop JobTracker daemon (Figure 9, for example).
  • http://localhost:50060/ — Web interface for the Hadoop TaskTracker daemon.
  • http://localhost:50070/ — Web interface for the Hadoop NameNode daemon.

Web interface for the Hadoop JobTracker daemon.
Figure 9: Web interface for the Hadoop JobTracker daemon.

Installing Pydoop

Now that you have a single-node Hadoop 1.1.2 cluster up and running, you can install Pydoop. If you don't want to build Pydoop from source, you will have to install specific dependencies that will make the installation a bit complicated. You can choose to build from source and you will avoid many of the dependencies that will make you install outdated versions. Avoid installing unnecessary dependencies if you decide to build from source.

Because Ubuntu includes libboost-python1.50.0 and libboost-python1.49.0, you need a previous version to satisfy a Pydoop dependency to libboost-python1.46.1. You can download it from http://packages.ubuntu.com/precise/libboost-python1.46.1 and open it with Ubuntu Software Center. If you decide to avoid building from source, download the file and click Install in Ubuntu Software Center.

Another problematic dependency is hadoop-client because it requires CDH4 (Cloudera's Distribution Including Apache Hadoop). If you want to install CDH4, you need to execute the following commands in a new terminal window. The commands add the repositories for CDH4 that will allow you to install hadoop-client.

sudo sh -c "echo 'deb [arch=amd64] http://archive.cloudera.com/cdh4/ubuntu/precise/amd64/cdh precise-cdh4 contrib' > /etc/apt/sources.list.d/cloudera.list"
sudo sh -c "echo 'deb-src http://archive.cloudera.com/cdh4/ubuntu/precise/amd64/cdh precise-cdh4 contrib' >> /etc/apt/sources.list.d/cloudera.list"
sudo apt-get install curl
curl -s http://archive.cloudera.com/cdh4/ubuntu/precise/amd64/cdh/archive.key | sudo apt-key add -
sudo apt-get install hadoop-0.20-conf-pseudo hadoop-client

You will see a list of dependencies that are missing. Thus, run the following command and answer Y to each confirmation question:

sudo apt-get install -f

The next step if you don't want to build from source is to download the Pydoop Debian package and install it. You can download the latest version of Pydoop available in a Debian package here. Once you download the file python-pydoop_0.9.0-1_amd64.deb, you can install the package by running the following command (replace /home/gaston/Downloads/python-pydoop_0.9.0-1_amd64.deb with the full path to your downloaded python-pydoop_0.9.0-1_amd64.deb):

sudo dpkg -i /home/gaston/Downloads/python-pydoop_0.9.0-1_amd64.deb

If you want to build from source, the first step is to install the prerequisites to perform the build. Run the following command to install build-essential, python-all-dev, libboost-python-dev, and libssl-dev:

sudo apt-get install build-essential python-all-dev libboost-python-dev libssl-dev

Download pydoop-0.9.1.tar.gz. Then, go to the directory where you downloaded pydoop-0.9.1.tar.gz and run the following commands to decompress the TAR file and build Pydoop 0.9.1 from source:

tar xzf pydoop-0.9.1.tar.gz
cd pydoop-0.9.1
export HADOOP_HOME=/home/hduser/hadoop
export JAVA_HOME=/usr/lib/jvm/java-7-oracle
python setup.py build

If you want to perform a system-wide installation, just run the following command:

sudo JAVA_HOME=/usr/lib/jvm/java-7-oracle HADOOP_HOME=/home/hduser/hadoop python setup.py install --skip-build

Testing the Pydoop Installation

Python 2.7 includes both argparse and importlib. In case you are working with Python 2.6, you need to install those modules by running the following commands:

sudo apt-get install python-pip
pip install argparse
pip install importlib

Go to the directory in which you decompressed pydoop-0.9.1.tar.gz and then go to the test subdirectory. For example, in my case, I executed cd /home/gaston/Downloads/pydoop-0.9.1/test. Run the following commands to execute the basic Pydoop tests that don't work with HDFS:

export LD_PRELOAD=/lib/x86_64-linux-gnu/libssl.so.1.0.0
python test_basics.py

When you installed Hadoop, you defined the value for the fs.default.name property in /home/hduser/hadoop/conf/core-site.xml. Use the port number configured in that property to set the value for HDFS_PORT. Because the value is hdfs://localhost:54310, I set HDFS_PORT to 54310. Run the following commands to execute the 134 Pydoop tests that include HDFS tests (see Figure 10):

export LD_PRELOAD=/lib/x86_64-linux-gnu/libssl.so.1.0.0
export HDFS_PORT=54310
python all_tests.py

Checking the results of running 134 Pydoop tests
Figure 10: Checking the results of running 134 Pydoop tests.

Running Pydoop MapReduce Scripts

Once you checked that all Pydoop tests executed without problems, you can run some of the sample Pydoop MapReduce scripts before creating your own scripts. This way, you will learn how you can interact with HDFS and check the execution status of Pydoop MapReduce scripts.

Go to the directory in which you decompressed pydoop-0.9.1.tar.gz and then go to the examples/Pydoop_script subdirectory. In my case, I executed cd /home/gaston/Downloads/pydoop-0.9.1/examples/pydoop_script.

The basic structure for a Pydoop MapReduce script is the following:

def mapper(input_key, input_value, writer):
# Code that performs the computation…
writer.emit(intermediate_key, intermediate_value)

def reducer(intermediate_key, value_iterator, writer):
# Code that performs the computation…
writer.emit(output_key, output_value)

The transpose.py script transposes a tab-separated text matrix. The code is very easy to understand and defines the mapper and the reducer functions. The calls to writer.emit generate the results for both the mapper and the reducer functions.

import struct

def mapper(key, value, writer):
value = value.split()
for i, a in enumerate(value):
writer.emit(struct.pack(">q", i), "%s%s" % (key, a))

def reducer(key, ivalue, writer):
vector = [(struct.unpack(">q", v[:8])[0], v[8:]) for v in ivalue]
vector.sort()
writer.emit(struct.unpack(">q", key)[0], "\t".join(v[1] for v in vector))

The pydoop_script folder includes a sample text file, matrix.txt, with a valid input for the transpose.py Pydoop MapReduce script.

00 01 02
10 11 12
20 21 22
30 31 32
40 41 42

It is necessary to upload your input data to HDFS. Run the following commands to upload matrix.txt. The ls command will display the directory listing for /user/hduser in HDFS and the new matrix.txt file should be displayed.

hadoop fs -put input hdfs_input
/home/hduser/hadoop/bin/hadoop fs - ls /user/hduser
/home/hduser/hadoop/bin/hadoop fs -put matrix.txt matrix.txt
/home/hduser/hadoop/bin/hadoop fs - ls /user/hduser

You can also go to localhost:50075/browseDirectory.jsp?dir=%2Fuser%2Fhduser&namenodeInfoPort=50070 by following these steps:

  • Go to localhost:50070.
  • Click on Browse the filesystem.
  • Click user under Name.
  • Click hduser under Name. The Web browser will display the files for /user/hduser in HDFS. In this case, you will see matrix.txt.

Now that you have the HDFS input (matrix.txt), you can run the Pydoop script to perform the MapReduce job. Run the following command that specifies matrix.txt as the HDFS input and t_matrix as the HDFS output:

pydoop script transpose.py matrix.txt t_matrix

Go to localhost:50030 to use the Web interface for the Hadoop JobTracker daemon and check the status for the new job. You will see a new entry for transpose.py with details about the progress of both the Map and Reduce processes (see Figure 11).

Checking transpose.py Pydoop MapReduce details
Figure 11: Checking the details about the new transpose.py Pydoop MapReduce script job with the Web interface for the Hadoop JobTracker daemon.

Once transpose.py finishes the MapReduce job, you can retrieve the results from HDFS. Run the following commands to retrieve the output (t_matrix):

hadoop fs -get t_matrix{,}
sort -mn -k1,1 -o t_matrix.txt t_matrix/part-0000*

Conclusion

By writing just a few lines of code, you can easily create simple Pydoop Script MapReduce programs and execute them in your single-node Hadoop cluster. When Pydoop scripts aren't enough, you can start working with the more complete object-oriented Pydoop API and take full advantage of its features.

In this article, I've focused on installing, running and discovering Pydoop in a single-node Hadoop cluster and with the latest available Ubuntu version. If you have basic Python skills, you will be able to take full advantage of Pydoop by diving deeper on its features and creating more-complex MapReduce jobs.


Gaston Hillar is a frequent contributor to Dr. Dobb's.


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 

Video