Pangool is an open-source implementation of what we call Tuple Map/Reduce based on the Hadoop Java MapReduce API.
Pangool is a Java, low-level MapReduce API. It aims to be a replacement for the Hadoop Java MapReduce API. By implementing an intermediate Tuple-based schema and configuring a Job conveniently, many of the accidental complexities that arise from using the Hadoop Java MapReduce API disappear. Things like secondary sort and reduce-side joins become extremely easy to implement and understand. Pangool's performance is comparable to that of the Hadoop Java MapReduce API. Pangool also augments Hadoop's API by making multiple outputs and inputs first-class and allowing instance-based configuration.
By using Tuples instead of (key, value) pairs, the user is not forced to write their custom data types (e.g. Writables) or use external serialization libraries when working with more than two fields.
However Pangool’s Tuples may contain arbitrary data types using custom serialization.
In Pangool you can say groupBy(“user”, “country”)
, sortBy(“user”, “country”, “name”)
. Pangool
will use an intelligent and efficient Partitioner, Sort and Group Comparator underneath just like an
advanced user would do with the plain Hadoop MapReduce API.
Doing reduce-side joins with Pangool is as simple as it can get. By using Tuples and configuring your MapReduce jobs properly, you can easily join various datasets and perform arbitrary business logic on them. Again, Pangool will know how to partition, sort and group by underneath in an efficient way.
Mapper, Combiner, Reducers, Input / Output Formats and Comparators can be passed via object instance. Pangool will serialize the instance into the DistributedCache and reinstantiate the object when needed. This way, boilerplate configuration code is no longer needed.
Multiple inputs & outputs in Pangool is part of its standard API.
Tuples may be persisted and used as input to other Jobs by using TupleOutputFormat / TupleInputFormat.
Pangool is an alternative to the Java Hadoop MapReduce API. The same things can be achieved by using one or another. Pangool’s performance is quite close to that of Hadoop’s MapReduce API (see our benchmark with other tools for a reference). Pangool just makes life easier to those that require the efficiency and flexibility of the plain Java Hadoop MapReduce API.