Pangool User Guide

Mapred API

The Pangool MapReduce API is mainly formed by:

  • TupleMapper :

    Subclasses of this class will be ready to be used as Mappers in Pangool Jobs.
    This class requires two generic types: the ones that refer to the input format. This is because TupleMappers always emit tuples as intermediate output, so we only need to add the types relative to the input format. This class has three available methods: setup(), map() and cleanup().

  • TupleReducer :

    Subclasses of this class will be ready to be used as reducers in Pangool jobs.
    This class requires two generic types: the ones that refer to the output format.
    This is because TupleReducers always receive ITuple groups and values from the intermediate output, so we only need to add the types relative to the output format. This class has three available methods: setup(), reduce() and cleanup(). This class can also be used as a Combiner, as long as the output types are (ITuple, NullWritable).

  • TupleMRContext:

    (An instance of this class is received by both TupleMapper and TupleReducer. The user can get the standard Hadoop Context object through getHadoopContext() to use counters, progress(), etc.)

  • TupleRollupReducer :

    Reducer to be used when using rollup. It will have extra methods: onOpen(), onClose(). For information on rollup, check the rollup section in the user guide.

  • MapOnlyJobBuilder :

    Use this class to conveniently create jobs that only have Mapper steps.

  • TupleMRBuilder :

    Use this class to create job instances that use the Pangool API. The most important methods are:

    addIntermediateSchemaAllows the user to define intermediate Schemas. At least one must be defined. When performing joins, usually more than one schema will be defined (see the joins section for more information).
    addInputAllows the user to add an input Path with an associated input format and TupleMapper. You can add an arbitrary number of inputs with this same method.
    addTupleInputThis method must be used when reading tuple inputs (files that were generated by Pangool jobs that wrote tuples as output).
    setOutputAllows the user to define the job’s main output Path and format.
    setTupleOutputThis method must be used when writing tuples as the main output of the Job. It will have an associated Schema so that Pangool knows how to write the Tuples.
    addNamedOutputSee named outputs.
    addNamedTupleOutputSee named outputs.
    setDefaultNamedOutputSee named outputs.
    setTupleReducerSets the TupleReducer instance to be used.
    setTupleCombinerSets the TupleReducer instance to be used as Combiner for the Job.
    setGroupByFields / setOrderByConfigures how Pangool will sort and group by the intermediate tuples. For more info, check the “Group & Sort by” section.
    createJob()Returns the job instance read to be run.
  • TupleTextInputFormat
  • Use this input format for reading text files into Pangool's Tuples. See Text I/O for more info.

  • TupleTextOutputFormat
  • Use this output format for writing text files out of Pangool's Tuples. See Text I/O for more info.

  • IdentityTupleMapper :

    Use this Mapper implementation when your Mapper only needs to emit the Tuples as they are being read (when using Tuple inputs).

  • IdentityTupleReducer :

    Use this Reducer implementation when your Reducer only needs to emit the Tuples as they are being received in the Reducer (including all the Tuples in the values’ Iterator).