Earlier this year, Dr. Dobb's presented a three-part tutorial on handling so-called "big data" using Hadoop. In this article, I explore Pydoop, which provides a simple Python API for Hadoop.
If you are new to Hadoop or need updates about its latest version, I suggest you read two excellent articles written by Tom White in the Dr. Dobb's tutorial series: "Hadoop: The Lay of the Land" and "Hadoop: Writing and Running Your First Project." I will assume that you understand the basics of Hadoop and MapReduce (those tutorials provide all the necessary information to continue reading this article).
Pydoop Script enables you to write simple MapReduce programs for Hadoop with mapper and reducer functions in just a few lines of code. When Pydoop Script isn't enough, you can switch to the more complete Pydoop API, which provides the ability to implement a Python Partitioner
, RecordReader
, and RecordWriter
. Pydoop might not be the best API for all Hadoop use cases, but its unique features make it suitable for specific scenarios and it is being actively improved. In this article, I'll explain how to install a single-node cluster with Pydoop and execute simple Pydoop Script MapReduce programs.
Working with a Python MapReduce and HDFS API
The researchers at the CRS4 research center recently released Pydoop 0.9.1. The product offers a programming model similar to Java's for MapReduce. You simply need to define classes that the MapReduce framework requires to instantiate, and then you call the methods. Because Pydoop is a CPython package, it makes it possible to access all standard Python libraries and third-party modules. Pydoop Script makes it even easier to write MapReduce modules by simply coding the necessary mapper
and reducer
functions.
Pydoop wraps Hadoop pipes and allows you to access the most important MapReduce components, such as Partitioner, RecordReader, and RecordWriter. In addition, Pydoop makes it easy to interact with HDFS (Hadoop Distributed File System) through a Pydoop HDFS API (pydoop.hdfs), which allows you to retrieve information about directories, files, and several file system properties. The Pydoop HDFS API makes it possible to easily read and write files within HDFS by writing Python code. In addition, the lower-level API provides features similar to the Hadoop C HDFS API, and so you can use it to build statistics of HDFS usage.
Creating a Single-Node Hadoop Cluster Ready for Pydoop
The installation instructions for Pydoop are a bit outdated, so I'll provide complete step-by-step instructions for building the latest Pydoop version from source in the most recent Ubuntu 12.10 (also known as Quantal Quetzal), 64-bit desktop version. This way, you will be able to work with Pydoop MapReduce jobs in a single-cluster Hadoop configuration. I'll start with a basic Ubuntu 12.10 64-bit installation with just the default OpenJDK installed, and I'll provide details on all the prerequisites that Pydoop 0.9.1 requires to run it on top of the most recent and stable Hadoop version 1.1.2. The installation consists of:
- installing Java 7
- creating a Hadoop user
- disabling IPv6
- installing Hadoop
- installing Pydoop dependencies and installing Pydoop.
I want to use Oracle Java 7 JDK instead of OpenJDK in Ubuntu. Oracle Java 7 JDK isn't included in the official Ubuntu repositories, and therefore, it is necessary to add a PPA with this JDK. Run the following command in a terminal window:
sudo add-apt-repository ppa:webupd8team/java
The terminal will display a confirmation. Press Enter to continue and the system will add the PPA (see Figure 1).
Figure 1: Oracle Java JDK PPA added to your system.
Now, run the following commands and press Y
when the system asks you whether you want to continue:
sudo apt-get update
sudo apt-get install oracle-java7-installer
The Oracle Java 7 JDK package configuration will display a dialog box asking you to accept the license, you must agree if you want to use the JDK. Press Enter, select the Yes button in the new dialog box that appears, then press Enter again.
Once the installation has finished you should see the following lines appearing in the terminal:
Oracle JDK 7 installed
Oracle JRE 7 browser plugin installed
Check the Java version by running the following command:
java -version
You should see something similar to the following lines. The minor version and build numbers might be different because there are frequent updates to Oracle Java JDK and its related PPA:
java version "1.7.0_21"
Java (TM) SE Runtime Environment (build 1.7.0_21-b11)
Java HotSpot(TM) 64-Bit Server VM (build 23.21-b01, mixed mode)
If the result of these commands don't display 1.7 as the first numbers of the build, you can run the following command to make Oracle Java 7 JDK the default Java:
sudo update-java-alternatives -s java-7-oracle
Run the following commands to create a Hadoop group named hadoop
and a user named hduser
.
sudo addgroup hadoop
sudo adduser --ingroup hadoop hduser
Enter and then repeat a password for the new hduser
user, the full name (Hadoop User
) and press Y
when the system requires you to confirm whether the information is correct (see Figure 2).
Figure 2: The results of the creation of a new hduser
added to the hadoop
group.
Now, it is necessary to configure SSH. Run the following commands to generate a public/private RSA key pair . You must enter the password defined for the hduser
user.
su - hduser
ssh-keygen -t rsa -P ""
Run the following command to check the SSH connection to localhost
.
ssh localhost
If you receive a Connection refused
error (see Figure 3), run the following command to install openssh-server
. Press Y
when the system asks you whether you want to continue with the package installation.
ssh localhost
Figure 3: The results of creating a public/private RSA key pair.
Open a new terminal window and run the following commands to add localhost
to the list of known hosts. You must enter the password defined for the hduser
user. Then, enter yes
when the system asks you whether you want to continue connecting.
su - hduser
ssh localhost
Edit the sudoers
file by using GNU nano. You must be extremely careful with the following steps because an error in the sudoers
file will make it impossible for you to use sudo
and you will have to use recovery mode to edit the file again. Open a new terminal window and run the following command:
pkexec visudo
GNU nano will show the editor for the /etc/sudoers.tmp
file. Add the following line at the end of the file to add hduser
into sudoers
(see Figure 4):
hduser ALL=(ALL) ALL
Figure 4: Editing /etc/sudoers.tmp with GNU nano.
Then, press Ctrl + X
to exit. Press Y
and then Enter
because you want to overwrite /etc/sudoers.tmp
with the changes. Then, you will see a What now?
question in the terminal. Press Q
because you want to quit and save changes to the /etc/sudoers.tmp
file. You will notice the word DANGER
in the Quit and Save Changes to sudoers
file option. As I mentioned before, this can be a dangerous configuration change, and therefore, you must be careful while editing the file.
Next, disable IPv6 to avoid problems with Hadoop and Pydoop in your single-cluster configuration. Run the following command to launch GEdit and open the /etc/sysctl.conf
file:
sudo gedit /etc/sysctl.conf
Copy the following lines at the end of the text document:
#disable ipv6
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1
Select File | Save
and close the GEdit window.
Reboot the operating system. Once Ubuntu starts again, open a terminal window and check whether IPv6 is disabled. The following command should now display "1" indicating IPv6 is disabled:
cat /proc/sys/net/ipv6/conf/all/disable_ipv6
Installing Hadoop
It is time to install Hadoop. Download the tarball from http://hadoop.apache.org/releases.html. I will use Release 1.1.2, made available in February 15, 2013. Download the file hadoop-1.1.2-bin.tar.gz
from http://apache.dattatec.com/hadoop/common/hadoop-1.1.2/ and run the following commands. Replace /home/gaston/Downloads
with the path to the downloaded hadoop-1.1.2-bin.tar.gz
file and enter the hduser
password when required:
su - hduser
cd /home/hduser
sudo cp /home/gaston/Downloads/hadoop-1.1.2-bin.tar.gz .
tar -xvzf hadoop-1.1.2-bin.tar.gz
There is going to be a new hadoop-1.1.2
folder within /home/hduser
; therefore, the Hadoop home directory is /home/hduser/hadoop-1.1.2
. Run the following command to move the /home/hduser/Hadoop-1.1.2
directory to /home/hduser/hadoop
and make it the new Hadoop home.
sudo mv hadoop-1.1.2 hadoop
sudo chown -R hduser:hadoop hadoop
By default, Ubuntu uses Bash as the default shell. It is necessary to update $HOME/.bashrc
for hduser
to set Hadoop-related environment variables (Pydoop requires HADOOP_HOME
). Run the following command to open /home/hduser/.bashrc
with GEdit.
sudo gedit /home/hduser/.bashrc
Add the following lines to the aforementioned file. Then, select File | Save
and close the GEdit window. This way, you will set HADOOP_HOME
, JAVA_HOME
, and add the Hadoop bin
directory to PATH
:
export HADOOP_HOME=/home/hduser/hadoop
export JAVA_HOME=/usr/lib/jvm/java-7-oracle
export PATH=$PATH:$HADOOP_HOME/bin
Now, it is necessary to configure a Hadoop-specific environment variable in hadoop-env.sh
. Run the following command to open /home/hduser/hadoop/conf/hadoop-env.sh
with GEdit.
sudo gedit /home/hduser/hadoop/conf/hadoop-env.sh
Search the following commented line (see Figure 5):
# export JAVA_HOME=/usr/lib/j2sdk1.5-sun
Figure 5: The commented line that you must replace in hadoop-env.sh
to set JAVA_HOME
.
Replace the line with the following uncommented line:
export JAVA_HOME=/usr/lib/jvm/java-7-oracle
Then, select File | Save
and close the GEdit window. This way, you will set JAVA_HOME
for Hadoop.
I will use the following ports for the different services:
- 54310: HDFS.
- 54311: MapReduce Job tracker.
Run the following commands to create the directory where Hadoop will store the data files (home/hduser/tmp
):
sudo mkdir /home/hduser/tmp
sudo chown hduser:hadoop /home/hduser/tmp
sudo chmod 750 /home/hduser/tmp
Run the following command to open the /home/hduser/hadoop/conf/core-site.xml
configuration file with GEdit to set the value of hadoop.tmp.dir
to the previously created temporary directory (home/hduser/tmp
) and the value of fs.default.name
.
sudo gedit /home/hduser/hadoop/conf/core-site.xml
Add the following lines between the <configuration>
and </configuration>
tags:
<property>
<name>hadoop.tmp.dir</name>
<value>/home/hduser/hadoop/tmp</value>
<description>The base for other Hadoop temporary directories and files.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
<description>The URI of the default file system.
</description>
</property>
A
nd save the file.
Then, run the following command to open the /home/hduser/hadoop/conf/mapred-site.xml
configuration file with GEdit to set the value of mapred.job.tracker
to the host and port where the MapReduce job tracker runs at (localhost:54311)
:
sudo gedit /home/hduser/hadoop/conf/mapred-site.xml
Add the following lines between the <configuration>
and </configuration>
tags (see Figure 6):
<property>
<name>mapred.job.tracker</name>
<value>localhost:54311</value>
<description>The host and port where the MapReduce job tracker runs at.</description>
</property>
And save the file.
Figure 6: Configuration of the mapred.job.tracker
property in mapred-site.xml
.
Run the following command to open the /home/hduser/hadoop/conf/hdfs-site.xml
configuration file with GEdit to set the value of dfs.replication
to 1
:
sudo gedit /home/hduser/hadoop/conf/hdfs-site.xml
Add the following lines between the <configuration>
and </configuration>
tags:
<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication.</description>
</property>
And save the file. Now, run the following commands to format the HDFS filesystem with NameNode
. NameNode
will initialize the filesystem in /home/hduser/hadoop/tmp/dfs/name
. Press Y
when NameNode
asks you whether you want to reformat the filesystem.
cd /home/hduser/hadoop/bin
./hadoop namenode -format
Figure 7: Reformatting the filesystem in /home/hduser/hadoop/tmp/dfs/name
.