It is time to start the single-node Hadoop cluster. Run the following command to execute the script that will start NameNode
, DataNode
, JobTracker
, and TaskTracker
on the system:
./start-all.sh
It is easy to check whether the Hadoop processes are running by opening another terminal window and running jps
. You should see the following names listed as a result of running jps
(see Figure 8):
DataNode
SecondaryNameNode
TaskTracker
JobTracker
NameNode
Jps
Figure 8: Checking whether the Hadoop processes are running with jps
.
It is also important to check whether the Hadoop processes are listening on the previously configured ports: 54310
and 54311
. Open a terminal window and run the following command. If the results show one java process that is listening to 54310
and another one that is listening to 54311
, it means that Hadoop processes are running according to the previously set configuration.
sudo netstat -plten | grep 543
You can use your favorite Web browser to visit the addresses for the different Hadoop Web interfaces and monitor the following different daemons:
- http://localhost:50030/ Web interface for the Hadoop
JobTracker
daemon (Figure 9, for example). - http://localhost:50060/ Web interface for the Hadoop
TaskTracker
daemon. - http://localhost:50070/ Web interface for the Hadoop
NameNode
daemon.
Figure 9: Web interface for the Hadoop JobTracker daemon.
Installing Pydoop
Now that you have a single-node Hadoop 1.1.2 cluster up and running, you can install Pydoop. If you don't want to build Pydoop from source, you will have to install specific dependencies that will make the installation a bit complicated. You can choose to build from source and you will avoid many of the dependencies that will make you install outdated versions. Avoid installing unnecessary dependencies if you decide to build from source.
Because Ubuntu includes libboost-python1.50.0
and libboost-python1.49.0
, you need a previous version to satisfy a Pydoop dependency to libboost-python1.46.1
. You can download it from http://packages.ubuntu.com/precise/libboost-python1.46.1 and open it with Ubuntu Software Center. If you decide to avoid building from source, download the file and click Install in Ubuntu Software Center.
Another problematic dependency is hadoop-client
because it requires CDH4 (Cloudera's Distribution Including Apache Hadoop). If you want to install CDH4, you need to execute the following commands in a new terminal window. The commands add the repositories for CDH4 that will allow you to install hadoop-client
.
sudo sh -c "echo 'deb [arch=amd64] http://archive.cloudera.com/cdh4/ubuntu/precise/amd64/cdh precise-cdh4 contrib' > /etc/apt/sources.list.d/cloudera.list"
sudo sh -c "echo 'deb-src http://archive.cloudera.com/cdh4/ubuntu/precise/amd64/cdh precise-cdh4 contrib' >> /etc/apt/sources.list.d/cloudera.list"
sudo apt-get install curl
curl -s http://archive.cloudera.com/cdh4/ubuntu/precise/amd64/cdh/archive.key | sudo apt-key add -
sudo apt-get install hadoop-0.20-conf-pseudo hadoop-client
You will see a list of dependencies that are missing. Thus, run the following command and answer Y
to each confirmation question:
sudo apt-get install -f
The next step if you don't want to build from source is to download the Pydoop Debian package and install it. You can download the latest version of Pydoop available in a Debian package here. Once you download the file python-pydoop_0.9.0-1_amd64.deb,
you can install the package by running the following command (replace /home/gaston/Downloads/python-pydoop_0.9.0-1_amd64.deb
with the full path to your downloaded python-pydoop_0.9.0-1_amd64.deb
):
sudo dpkg -i /home/gaston/Downloads/python-pydoop_0.9.0-1_amd64.deb
If you want to build from source, the first step is to install the prerequisites to perform the build. Run the following command to install build-essential
, python-all-dev
, libboost-python-dev
, and libssl-dev
:
sudo apt-get install build-essential python-all-dev libboost-python-dev libssl-dev
Download pydoop-0.9.1.tar.gz. Then, go to the directory where you downloaded pydoop-0.9.1.tar.gz
and run the following commands to decompress the TAR file and build Pydoop 0.9.1 from source:
tar xzf pydoop-0.9.1.tar.gz
cd pydoop-0.9.1
export HADOOP_HOME=/home/hduser/hadoop
export JAVA_HOME=/usr/lib/jvm/java-7-oracle
python setup.py build
If you want to perform a system-wide installation, just run the following command:
sudo JAVA_HOME=/usr/lib/jvm/java-7-oracle HADOOP_HOME=/home/hduser/hadoop python setup.py install --skip-build
Testing the Pydoop Installation
Python 2.7 includes both argparse
and importlib
. In case you are working with Python 2.6, you need to install those modules by running the following commands:
sudo apt-get install python-pip
pip install argparse
pip install importlib
Go to the directory in which you decompressed pydoop-0.9.1.tar.gz
and then go to the test
subdirectory. For example, in my case, I executed cd /home/gaston/Downloads/pydoop-0.9.1/test
. Run the following commands to execute the basic Pydoop tests that don't work with HDFS:
export LD_PRELOAD=/lib/x86_64-linux-gnu/libssl.so.1.0.0
python test_basics.py
When you installed Hadoop, you defined the value for the fs.default.name
property in /home/hduser/hadoop/conf/core-site.xml
. Use the port number configured in that property to set the value for HDFS_PORT
. Because the value is hdfs://localhost:54310
, I set HDFS_PORT
to 54310
. Run the following commands to execute the 134 Pydoop tests that include HDFS tests (see Figure 10):
export LD_PRELOAD=/lib/x86_64-linux-gnu/libssl.so.1.0.0
export HDFS_PORT=54310
python all_tests.py
Figure 10: Checking the results of running 134 Pydoop tests.
Running Pydoop MapReduce Scripts
Once you checked that all Pydoop tests executed without problems, you can run some of the sample Pydoop MapReduce scripts before creating your own scripts. This way, you will learn how you can interact with HDFS and check the execution status of Pydoop MapReduce scripts.
Go to the directory in which you decompressed pydoop-0.9.1.tar.gz
and then go to the examples/Pydoop_script
subdirectory. In my case, I executed cd /home/gaston/Downloads/pydoop-0.9.1/examples/pydoop_script
.
The basic structure for a Pydoop MapReduce script is the following:
def mapper(input_key, input_value, writer):
# Code that performs the computation…
writer.emit(intermediate_key, intermediate_value)
def reducer(intermediate_key, value_iterator, writer):
# Code that performs the computation…
writer.emit(output_key, output_value)
The transpose.py
script transposes a tab-separated text matrix. The code is very easy to understand and defines the mapper
and the reducer
functions. The calls to writer.emit
generate the results for both the mapper
and the reducer
functions.
import struct
def mapper(key, value, writer):
value = value.split()
for i, a in enumerate(value):
writer.emit(struct.pack(">q", i), "%s%s" % (key, a))
def reducer(key, ivalue, writer):
vector = [(struct.unpack(">q", v[:8])[0], v[8:]) for v in ivalue]
vector.sort()
writer.emit(struct.unpack(">q", key)[0], "\t".join(v[1] for v in vector))
The pydoop_script folder includes a sample text file, matrix.txt, with a valid input for the transpose.py
Pydoop MapReduce script.
00 01 02
10 11 12
20 21 22
30 31 32
40 41 42
It is necessary to upload your input data to HDFS. Run the following commands to upload matrix.txt
. The ls
command will display the directory listing for /user/hduser
in HDFS and the new matrix.txt file should be displayed.
hadoop fs -put input hdfs_input
/home/hduser/hadoop/bin/hadoop fs - ls /user/hduser
/home/hduser/hadoop/bin/hadoop fs -put matrix.txt matrix.txt
/home/hduser/hadoop/bin/hadoop fs - ls /user/hduser
You can also go to localhost:50075/browseDirectory.jsp?dir=%2Fuser%2Fhduser&namenodeInfoPort=50070
by following these steps:
- Go to
localhost:50070
. - Click on
Browse the filesystem
. - Click
user
underName
. - Click
hduser
underName
. The Web browser will display the files for/user/hduser
in HDFS. In this case, you will seematrix.txt
.
Now that you have the HDFS input (matrix.txt
), you can run the Pydoop script to perform the MapReduce job. Run the following command that specifies matrix.txt
as the HDFS input and t_matrix
as the HDFS output:
pydoop script transpose.py matrix.txt t_matrix
Go to localhost:50030
to use the Web interface for the Hadoop JobTracker
daemon and check the status for the new job. You will see a new entry for transpose.py
with details about the progress of both the Map and Reduce processes (see Figure 11).
Figure 11: Checking the details about the new transpose.py Pydoop MapReduce script job with the Web interface for the Hadoop JobTracker
daemon.
Once transpose.py finishes the MapReduce job, you can retrieve the results from HDFS. Run the following commands to retrieve the output (t_matrix
):
hadoop fs -get t_matrix{,}
sort -mn -k1,1 -o t_matrix.txt t_matrix/part-0000*
Conclusion
By writing just a few lines of code, you can easily create simple Pydoop Script MapReduce programs and execute them in your single-node Hadoop cluster. When Pydoop scripts aren't enough, you can start working with the more complete object-oriented Pydoop API and take full advantage of its features.
In this article, I've focused on installing, running and discovering Pydoop in a single-node Hadoop cluster and with the latest available Ubuntu version. If you have basic Python skills, you will be able to take full advantage of Pydoop by diving deeper on its features and creating more-complex MapReduce jobs.
Gaston Hillar is a frequent contributor to Dr. Dobb's.