Pangool - Introduction

Introduction to Pangool: Topical Word Count (2/3)

This introduction guides you into the basics of Pangool through a Word-Count-like example.

Now, you'll see how to leverage Pangool instance-based configuration for managing trivial state information.

Managing state

We’ll modify the previous example slightly in order to show another of Pangool's key feature: augmented Hadoop API by accepting instances instead of static classes.

A common use case when dealing with textual content is stop-word filtering. Let's modify the previous example slightly for being able to filter according to a stop word list. See the code below:

You can check the full code of this example on github by clicking here.

 public static class StopWordMapper extends TokenizeMapper {

   private Set stopWords = new HashSet();
		
   public StopWordMapper(List stopWords) {
     this.stopWords.addAll(stopWords);
     this.stopWords = Collections.unmodifiableSet(this.stopWords);
   }

   protected void emitTuple(Collector collector) throws IOException, InterruptedException {
     if(stopWords.contains(tuple.get("word"))) {
       return;
     }
     super.emitTuple(collector);
   }
 }

As you can see, this Mapper extends the one we created in the first part of the introduction and adds a List of stop words by constructor.

It then uses this list to filter words before writing the tuples to the intermediate output.

Let’s see how we can use this Mapper when creating a Pangool Job:

 List stopWords = Files.readLines(new File(args[2]), Charset.forName("UTF-8"));
	
 TupleMRBuilder cg = new TupleMRBuilder(conf, "Pangool Topical Word Count With Stop Words");
 cg.addIntermediateSchema(TopicalWordCount.getSchema());
 cg.setGroupByFields("topic", "word");
 StopWordMapper mapper = new StopWordMapper(stopWords);
 cg.addInput(new Path(args[0]), new HadoopInputFormat(TextInputFormat.class), mapper);

That’s it! Pangool will serialize the instance and recover it when needed. Remember that all state in your classes must be Serializable. If some of your class fields are not Serializable, remember to instantiate them in the setup() method instead of instantiating them directly in the class definition.

You can run this example by doing:

 hadoop jar $PANGOOL_EXAMPLES_JAR topical_word_count_with_stop_words [input] [output]

You can also use an input data generator for generating random input for this example:

 hadoop jar $PANGOOL_EXAMPLES_JAR topical_word_count_gen_data [out-file] [nRegisters] [nTopics]

What's next? Secondary sort & Named outputs!