Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

Channels ▼


Pydoop: Writing Hadoop Programs in Python

Earlier this year, Dr. Dobb's presented a three-part tutorial on handling so-called "big data" using Hadoop. In this article, I explore Pydoop, which provides a simple Python API for Hadoop.

If you are new to Hadoop or need updates about its latest version, I suggest you read two excellent articles written by Tom White in the Dr. Dobb's tutorial series: "Hadoop: The Lay of the Land" and "Hadoop: Writing and Running Your First Project." I will assume that you understand the basics of Hadoop and MapReduce (those tutorials provide all the necessary information to continue reading this article).

Pydoop Script enables you to write simple MapReduce programs for Hadoop with mapper and reducer functions in just a few lines of code. When Pydoop Script isn't enough, you can switch to the more complete Pydoop API, which provides the ability to implement a Python Partitioner, RecordReader, and RecordWriter. Pydoop might not be the best API for all Hadoop use cases, but its unique features make it suitable for specific scenarios and it is being actively improved. In this article, I'll explain how to install a single-node cluster with Pydoop and execute simple Pydoop Script MapReduce programs.

Working with a Python MapReduce and HDFS API

The researchers at the CRS4 research center recently released Pydoop 0.9.1. The product offers a programming model similar to Java's for MapReduce. You simply need to define classes that the MapReduce framework requires to instantiate, and then you call the methods. Because Pydoop is a CPython package, it makes it possible to access all standard Python libraries and third-party modules. Pydoop Script makes it even easier to write MapReduce modules by simply coding the necessary mapper and reducer functions.

Pydoop wraps Hadoop pipes and allows you to access the most important MapReduce components, such as Partitioner, RecordReader, and RecordWriter. In addition, Pydoop makes it easy to interact with HDFS (Hadoop Distributed File System) through a Pydoop HDFS API (pydoop.hdfs), which allows you to retrieve information about directories, files, and several file system properties. The Pydoop HDFS API makes it possible to easily read and write files within HDFS by writing Python code. In addition, the lower-level API provides features similar to the Hadoop C HDFS API, and so you can use it to build statistics of HDFS usage.

Creating a Single-Node Hadoop Cluster Ready for Pydoop

The installation instructions for Pydoop are a bit outdated, so I'll provide complete step-by-step instructions for building the latest Pydoop version from source in the most recent Ubuntu 12.10 (also known as Quantal Quetzal), 64-bit desktop version. This way, you will be able to work with Pydoop MapReduce jobs in a single-cluster Hadoop configuration. I'll start with a basic Ubuntu 12.10 64-bit installation with just the default OpenJDK installed, and I'll provide details on all the prerequisites that Pydoop 0.9.1 requires to run it on top of the most recent and stable Hadoop version 1.1.2. The installation consists of:

  • installing Java 7
  • creating a Hadoop user
  • disabling IPv6
  • installing Hadoop
  • installing Pydoop dependencies and installing Pydoop.

I want to use Oracle Java 7 JDK instead of OpenJDK in Ubuntu. Oracle Java 7 JDK isn't included in the official Ubuntu repositories, and therefore, it is necessary to add a PPA with this JDK. Run the following command in a terminal window:

sudo add-apt-repository ppa:webupd8team/java

The terminal will display a confirmation. Press Enter to continue and the system will add the PPA (see Figure 1).

Figure 1: Oracle Java JDK PPA added to your system.

Now, run the following commands and press Y when the system asks you whether you want to continue:

sudo apt-get update
sudo apt-get install oracle-java7-installer 

The Oracle Java 7 JDK package configuration will display a dialog box asking you to accept the license, you must agree if you want to use the JDK. Press Enter, select the Yes button in the new dialog box that appears, then press Enter again.

Once the installation has finished you should see the following lines appearing in the terminal:

Oracle JDK 7 installed
Oracle JRE 7 browser plugin installed

Check the Java version by running the following command:

java -version

You should see something similar to the following lines. The minor version and build numbers might be different because there are frequent updates to Oracle Java JDK and its related PPA:

java version "1.7.0_21"
Java (TM) SE Runtime Environment (build 1.7.0_21-b11)
Java HotSpot(TM) 64-Bit Server VM (build 23.21-b01, mixed mode)

If the result of these commands don't display 1.7 as the first numbers of the build, you can run the following command to make Oracle Java 7 JDK the default Java:

sudo update-java-alternatives -s java-7-oracle

Run the following commands to create a Hadoop group named hadoop and a user named hduser.

sudo addgroup hadoop
sudo adduser --ingroup hadoop hduser

Enter and then repeat a password for the new hduser user, the full name (Hadoop User) and press Y when the system requires you to confirm whether the information is correct (see Figure 2).

results of hadoop group
Figure 2: The results of the creation of a new hduser added to the hadoop group.

Now, it is necessary to configure SSH. Run the following commands to generate a public/private RSA key pair . You must enter the password defined for the hduser user.

su - hduser
ssh-keygen -t rsa -P ""

Run the following command to check the SSH connection to localhost.

ssh localhost

If you receive a Connection refused error (see Figure 3), run the following command to install openssh-server. Press Y when the system asks you whether you want to continue with the package installation.

ssh localhost

creating a public/private RSA key pair
Figure 3: The results of creating a public/private RSA key pair.

Open a new terminal window and run the following commands to add localhost to the list of known hosts. You must enter the password defined for the hduser user. Then, enter yes when the system asks you whether you want to continue connecting.

su - hduser
ssh localhost

Edit the sudoers file by using GNU nano. You must be extremely careful with the following steps because an error in the sudoers file will make it impossible for you to use sudo and you will have to use recovery mode to edit the file again. Open a new terminal window and run the following command:

pkexec visudo

GNU nano will show the editor for the /etc/sudoers.tmp file. Add the following line at the end of the file to add hduser into sudoers (see Figure 4):

hduser ALL=(ALL) ALL

Editing /etc/sudoers.tmp with GNU nano
Figure 4: Editing /etc/sudoers.tmp with GNU nano.

Then, press Ctrl + X to exit. Press Y and then Enter because you want to overwrite /etc/sudoers.tmp with the changes. Then, you will see a What now? question in the terminal. Press Q because you want to quit and save changes to the /etc/sudoers.tmp file. You will notice the word DANGER in the Quit and Save Changes to sudoers file option. As I mentioned before, this can be a dangerous configuration change, and therefore, you must be careful while editing the file.

Next, disable IPv6 to avoid problems with Hadoop and Pydoop in your single-cluster configuration. Run the following command to launch GEdit and open the /etc/sysctl.conf file:

sudo gedit /etc/sysctl.conf

Copy the following lines at the end of the text document:

#disable ipv6 
net.ipv6.conf.all.disable_ipv6 = 1 
net.ipv6.conf.default.disable_ipv6 = 1  
net.ipv6.conf.lo.disable_ipv6 = 1 

Select File | Save and close the GEdit window.

Reboot the operating system. Once Ubuntu starts again, open a terminal window and check whether IPv6 is disabled. The following command should now display "1" indicating IPv6 is disabled:

cat /proc/sys/net/ipv6/conf/all/disable_ipv6


Installing Hadoop

It is time to install Hadoop. Download the tarball from http://hadoop.apache.org/releases.html. I will use Release 1.1.2, made available in February 15, 2013. Download the file hadoop-1.1.2-bin.tar.gz from http://apache.dattatec.com/hadoop/common/hadoop-1.1.2/ and run the following commands. Replace /home/gaston/Downloads with the path to the downloaded hadoop-1.1.2-bin.tar.gz file and enter the hduser password when required:

su - hduser
cd /home/hduser
sudo cp /home/gaston/Downloads/hadoop-1.1.2-bin.tar.gz .
tar -xvzf hadoop-1.1.2-bin.tar.gz

There is going to be a new hadoop-1.1.2 folder within /home/hduser; therefore, the Hadoop home directory is /home/hduser/hadoop-1.1.2. Run the following command to move the /home/hduser/Hadoop-1.1.2 directory to /home/hduser/hadoop and make it the new Hadoop home.

sudo mv hadoop-1.1.2 hadoop
sudo chown -R hduser:hadoop hadoop

By default, Ubuntu uses Bash as the default shell. It is necessary to update $HOME/.bashrc for hduser to set Hadoop-related environment variables (Pydoop requires HADOOP_HOME). Run the following command to open /home/hduser/.bashrc with GEdit.

sudo gedit /home/hduser/.bashrc

Add the following lines to the aforementioned file. Then, select File | Save and close the GEdit window. This way, you will set HADOOP_HOME, JAVA_HOME, and add the Hadoop bin directory to PATH:

export HADOOP_HOME=/home/hduser/hadoop 
export JAVA_HOME=/usr/lib/jvm/java-7-oracle 

Now, it is necessary to configure a Hadoop-specific environment variable in hadoop-env.sh. Run the following command to open /home/hduser/hadoop/conf/hadoop-env.sh with GEdit.

sudo gedit /home/hduser/hadoop/conf/hadoop-env.sh

Search the following commented line (see Figure 5):

# export JAVA_HOME=/usr/lib/j2sdk1.5-sun

replacing line
Figure 5: The commented line that you must replace in hadoop-env.sh to set JAVA_HOME.

Replace the line with the following uncommented line:

export JAVA_HOME=/usr/lib/jvm/java-7-oracle

Then, select File | Save and close the GEdit window. This way, you will set JAVA_HOME for Hadoop.

I will use the following ports for the different services:

  • 54310: HDFS.
  • 54311: MapReduce Job tracker.

Run the following commands to create the directory where Hadoop will store the data files (home/hduser/tmp):

sudo mkdir /home/hduser/tmp
sudo chown hduser:hadoop /home/hduser/tmp
sudo chmod 750 /home/hduser/tmp

Run the following command to open the /home/hduser/hadoop/conf/core-site.xml configuration file with GEdit to set the value of hadoop.tmp.dir to the previously created temporary directory (home/hduser/tmp) and the value of fs.default.name.

sudo gedit /home/hduser/hadoop/conf/core-site.xml

Add the following lines between the <configuration> and </configuration> tags:

  <description>The base for other Hadoop temporary directories and files.</description>
  <description>The URI of the default file system.

And save the file.

Then, run the following command to open the /home/hduser/hadoop/conf/mapred-site.xml configuration file with GEdit to set the value of mapred.job.tracker to the host and port where the MapReduce job tracker runs at (localhost:54311):

sudo gedit /home/hduser/hadoop/conf/mapred-site.xml

Add the following lines between the <configuration> and </configuration> tags (see Figure 6):

  <description>The host and port where the MapReduce job tracker runs at.</description>

And save the file.

configuring tracker property
Figure 6: Configuration of the mapred.job.tracker property in mapred-site.xml.

Run the following command to open the /home/hduser/hadoop/conf/hdfs-site.xml configuration file with GEdit to set the value of dfs.replication to 1:

sudo gedit /home/hduser/hadoop/conf/hdfs-site.xml

Add the following lines between the <configuration> and </configuration> tags:

  <description>Default block replication.</description>

And save the file. Now, run the following commands to format the HDFS filesystem with NameNode. NameNode will initialize the filesystem in /home/hduser/hadoop/tmp/dfs/name. Press Y when NameNode asks you whether you want to reformat the filesystem.

cd /home/hduser/hadoop/bin
./hadoop namenode -format

Reformatting the filesystem
Figure 7: Reformatting the filesystem in /home/hduser/hadoop/tmp/dfs/name.

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.