Thanks to the "named outputs" feature, Pangool offers the ability of having more than
one output per TupleMapper
or TupleReducer
. Note that named outputs don't replace the main job output. They are extra outputs added to the job.
TupleMRBuilder
in the following way:
mr.addNamedOutput("urls", new HadoopOutputFormat(SequenceFileOutputFormat.class), Text.class, Text.class);
You can add as many as you want. It will then be possible to write to each of the named outputs from the mapper or the reducer in the following way:
collector.getNamedOutput("urls").write(new Text("http://www.datasalt.com"), new Text("..."));
If you want to put away the lookup of the named output for every write, you can cache it up in
the setup()
method.
Pangool allows us to persist tuples efficiently in binary format from a named output. The code below shows how easy it is to add a named output for persisting tuples:
Schema urlsSchema = new Schema("schema", Fields.parse("url:string, content:string")); mr.addNamedTupleOutput("urls", urlsSchema);
Then it is possible to write tuples from the mapper or the reducer in the following way:
Tuple tuple = new Tuple(urlsSchema); tuple.set("url", "http://www.datasalt.com"); tuple.set("content", "..."); collector.getNamedOutput("urls").write(tuple, NullWritable.get());
NullWritable
as a second parameter in
the write()
method. This is a small bureaucracy needed in order to be compatible with the Hadoop "key, value" API.
You can avoid declaring a huge amount of named outputs by using the methods setDefaultNamedOutput(...)
. By declaring a default "spec", you can open an arbitrary named output from your Mapper or Reducer and its format will default to that which has been defined there.
Some Hadoop-based OutputFormats may require properties in the Hadoop Configuration for operating. Because Pangool can have an arbitrary number of those opened at a time, it can't use the Job's Configuration object for that purpose. In order to overcome this limitation some methods that configure named outputs accept a Map<String, String>
which can be used
to pass specific key, value configuration pairs to every named output.
Let's imagine that you have a Pangool job with 2 named outputs:
Your job uses the urls
named output for writing from the
mapper and from the reducer, but it only writes to the other_named_output
from the reducer. Let's imagine that our job has 2 map tasks and 3 reduce tasks.
Then, the output folder will look like that:
── job_output ├── urls │ ├── part-r-00000 │ ├── part-r-00001 │ ├── part-r-00002 │ ├── part-m-00000 │ └── part-m-00001 ├── other_named_output │ ├── part-r-000000 │ ├── part-r-000001 │ └── part-r-000002 ├── part-r-00000 ├── part-r-00001 └── part-r-00002
part-m
files correspond
to the mapper output. part-r
files corresponds to the reducer output.
The part-r files in the root correspond to the records emitted from the main job's output.
That is, the ones emitted by calling to collector.write()