Frequently Asked Questions (F.A.Q.)

1 - Does the world need another API on top of Hadoop? / How does Pangool differ compared to e.g. Cascading, Pig, Hive or Crunch?

Pangool is a Java, low-level MapReduce API. By being a Java API, Pig and Hive are automatically not comparable. Now we will discuss about Crunch and Cascading.

Although Pangool shares some goals (simplify, remove accidental complexity) and some design patterns (Tuples) with some of these, Pangool is not comparable to any of the them in that they don’t completely replace the Hadoop Java MapReduce API.

Using one the above always comes with a tradeoff. There will be many problems easier to solve with such tools, but some other problems will remain easier to think and program using MapReduce. Also, these tools impose a certain penalty on the performance - and if you really care about performance you’ll end up coding with the plain MapReduce API.

Pangool’s performance is much closer to that (see our benchmark for reference).

2 - Ok, but... Why would I use Pangool instead of e.g. Cascading, Pig, Hive or Crunch?

Pangool aims to be a replacement for the plain Hadoop MapReduce Java API. That means you would use Pangool for the same reasons you would use Hadoop Java MapReduce, the difference is that Pangool makes everything much easier and smoother while allowing you to do the same things and retaining about the same performance - so that you no longer need to use the Hadoop Java MapReduce API anymore.

3 - Can I chain two or more Pangool Jobs in a flow easily?

Pangool is not a flow management library. You can chain Pangool Jobs just like you would chain plain Java Hadoop MapReduce Jobs. While this may be sufficient for some cases, in other cases it can be more convenient to use a flow management library, or higher-level tools.

We have been working on an experimental library called "Pangool-flow" that adds higher-level constructs and operations, flow management and parallel Job execution. It is still in development and not ready for use in production, but you can try it at your own risk.

4 - Is Pangool a Serialization API? Doesn't it all look like Avro?

Whereas Pangool’s Tuples resemble Avro Records, they are not exactly the same. Pangool is not a serialization library.

Pangool uses Tuples as an extension to the (Key, value) model imposed by the traditional MapReduce model - see our post about Tuple MapReduce for a reference on this. Although Pangool implements mappings for basic types, Tuple fields are serialization-agnostic. They may contain arbitrary data types that are serialized by Hadoop using custom serialization.

5 - How does using Pangool compare to using the plain Hadoop Java MapRed API? Is there a performance penalty?

Our initial benchmark shows that the penalty of using Pangool compared to the plain Hadoop MapReduce API is between 5 and 8%. Higher-level tools may impose a penalties above 100%. That is why we say Pangool aims to be a replacement for the Hadoop Java API.