Using the Hadoop MapReduce API

Mon, Jul 7, 2014

Hadoop’s MapReduce subsystem runs tasks, where each task may be composed of one or more jobs. A job is a MapReduce run of a single map phase and a single reduce phase. A large amount of the complexity in Hadoop is the (very poorly documented) set up and configuration for jobs and tasks.

The ToolRunner is a helper class which contains code to parse the command line of a MapReduce job started via the hadoop jar command. It parses “standard” Hadoop command line options, and can use them to modify the configuration of the task (see “Configuration”, below).

The ToolRunner interface is used to run classes implementing the Tool interface, which consists of a run() method that you need to supply. It is in the run() method that you should place the code to set up your MapReduce jobs.

Your class should inherit from Configured, which simply gives a location to store a Configuration object (see below).

You also need to write a stub main() method which uses the ToolRunner interface to parse the common options and set up the configuration:

A Configuration is a key/value store which is used to hold configuration details. This includes all of the Hadoop configuration, read in when the tool starts up. You can also add your own configuration keys to it; these will be available to mappers and reducers, and appears to be the best way of passing parameters to them. Within your tool, you can get hold of the Configuration object through the Configured superclass’s getConf() method: