Pangool User Guide

Named outputs

Thanks to the "named outputs" feature, Pangool offers the ability of having more than one output per TupleMapper or TupleReducer. Note that named outputs don't replace the main job output. They are extra outputs added to the job.

Adding named outputs

You can add named outputs to a Pangool job in the TupleMRBuilder in the following way:

 mr.addNamedOutput("urls", new HadoopOutputFormat(SequenceFileOutputFormat.class), Text.class, Text.class);

You can add as many as you want. It will then be possible to write to each of the named outputs from the mapper or the reducer in the following way:

 collector.getNamedOutput("urls").write(new Text("http://www.datasalt.com"), new Text("..."));

If you want to put away the lookup of the named output for every write, you can cache it up in the setup() method.

Named outputs for tuples

Pangool allows us to persist tuples efficiently in binary format from a named output. The code below shows how easy it is to add a named output for persisting tuples:

 Schema urlsSchema = new Schema("schema", Fields.parse("url:string, content:string"));
 mr.addNamedTupleOutput("urls", urlsSchema);

Then it is possible to write tuples from the mapper or the reducer in the following way:

 Tuple tuple = new Tuple(urlsSchema);
 tuple.set("url", "http://www.datasalt.com");
 tuple.set("content", "...");
 collector.getNamedOutput("urls").write(tuple, NullWritable.get());

Note that you have to use NullWritable as a second parameter in the write() method. This is a small bureaucracy needed in order to be compatible with the Hadoop "key, value" API.

Advanced: default named output

You can avoid declaring a huge amount of named outputs by using the methods setDefaultNamedOutput(...). By declaring a default "spec", you can open an arbitrary named output from your Mapper or Reducer and its format will default to that which has been defined there.

Advanced: specific context

Some Hadoop-based OutputFormats may require properties in the Hadoop Configuration for operating. Because Pangool can have an arbitrary number of those opened at a time, it can't use the Job's Configuration object for that purpose. In order to overcome this limitation some methods that configure named outputs accept a Map<String, String> which can be used to pass specific key, value configuration pairs to every named output.

Folder structure

Let's imagine that you have a Pangool job with 2 named outputs:

urls
other_named_output

Your job uses the urls named output for writing from the mapper and from the reducer, but it only writes to the other_named_output from the reducer. Let's imagine that our job has 2 map tasks and 3 reduce tasks. Then, the output folder will look like that:

 ── job_output
    ├── urls
    │   ├── part-r-00000
    │   ├── part-r-00001
    │   ├── part-r-00002
    │   ├── part-m-00000
    │   └── part-m-00001
    ├── other_named_output
    │   ├── part-r-000000
    │   ├── part-r-000001
    │   └── part-r-000002
    ├── part-r-00000
    ├── part-r-00001
    └── part-r-00002

part-m files correspond to the mapper output. part-r files corresponds to the reducer output.

The part-r files in the root correspond to the records emitted from the main job's output. That is, the ones emitted by calling to collector.write()

Next: Reduce-side Joins »