Usually the way Pangool partitions will be sufficient for almost all the cases.
For Jobs that don’t use rollup, Pangool will partition by a combined hash of the group by fields.
For information on how Rollup is handled, check the rollup section of the guide.
However, there are some cases where you’d want to partition by other fields.
For instance, cases where you want to make your job more parallelizable by shuffling the
data and aggregating partial results in a second job afterwards).
In this cases you can use the convenience method:
grouper.setCustomPartitionFields(fields)
This method will use a combined hash of the fields that you’ll pass. These fields must be present in all the intermediate schemas defined,unless some field aliases are defined.
However if you wanted to create a custom, other than combined hash strategy,
you’ll need to implement your own org.apache.hadoop.mapreduce.Partitioner.
You can mimic the implementation of TupleHashPartitioner.
For instance:
public class MyRangePartitioner extends Partitioner<DatumWrapper<ITuple>, NullWritable> implements Configurable { public int getPartition(DatumWrapper<ITuple> key, NullWritable value, int numPartitions) { ITuple tuple = key.datum(); int year = (Integer)tuple.get("year"); if (year < 1900){ return 0; } else if (year < 2000){ return 1; } else { return ((year - 2000) % (numPartitions - 2)) + 2; } } }
And then assign it to your job.
TupleMRBuilder builder = new TupleMRBuilder("my_job"); /* .. configure it here builder.addIntermediateSchema(..); builder.setGroupByFields(..) ... */ Job job = builder.createJob(); job.setPartitionerClass(MyRangePartitioner.class);