In the Pangool release there is a Hadoop-executable JAR with Pangool examples. We'll show how to use it to execute the "Moving average" example.
The "Moving average" example calculates the moving average of unique visitors for different URLs for each date in a specific timeframe.
In the input file we'll have something like:
url1 2011-10-28 10 url1 2011-10-29 20 url1 2011-10-30 30 url1 2011-10-31 40
For moving averages of 3 days we'll have as output:
url1 2011-10-28 10.0 url1 2011-10-29 15.0 url1 2011-10-30 20.0 url1 2011-10-31 30.0
For each Pangool example there as an associated random input data generator. We can execute the Moving average input data generator like:
hadoop jar pangool-examples-*-hadoop.jar moving_average_gen_data url_regs.txt 10000 200
(If you missed the parameters, an meaningful help line will appear to guide you).
The generated file will be in the local filesystem. You can copy it to the HDFS:
hadoop fs -put url_regs.txt .
Now you can execute the example against the fresh "url_regs.txt" file:
hadoop jar pangool-examples-*-hadoop.jar moving_average url_regs.txt out-moving-average 5
After this has finished, you'd have generated 5-day moving averages for your input data file.
If you are working from an IDE such as Eclipse, you can execute the following code in a new class of your choice:
ToolRunner.run(new MovingAverageGenData(), new String[] { "url_regs.txt", "10000", "200" }); ToolRunner.run(new MovingAverage(), new String[] { "url_regs.txt", "out-moving-average", "5" });
Create a folder with the following structure:
oozie-app/ ├── lib │ ├── antlr-2.7.7.jar │ ├── antlr-3.0.1.jar [many more] │ ├── jetty-util-6.1.14.jar │ ├── jline-0.9.94.jar │ ├── joblibs │ ├── antlr-2.7.7.jar │ ├── antlr-3.0.1.jar [many more] │ ├── jetty-6.1.14.jar │ ├── jetty-util-6.1.14.jar │ ├── xml-apis-1.3.04.jar │ ├── xz-1.0.jar │ └── zookeeper-3.4.3.jar └── workflow.xml
Libraries on lib and joblibs folder must be the same. These folders must contain all dependencies of examples plus the pangool-examples.jar. In other words, it must contain all libraries that you can find in the lib folder if you decompress pangool-examples-*-hadoop.jar plus the pangool-examples.jar.
The file workflow.xml should looks like the following:<workflow-app xmlns='uri:oozie:workflow:0.1' name='java-main-wf'> <start to='java1' /> <action name='java1'> <java> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <configuration> <property> <name>mapred.job.queue.name</name> <value>default</value> </property> </configuration> <main-class>com.datasalt.pangool.examples.Driver</main-class> <arg>game_of_life</arg> <arg>-libjars</arg> <arg> joblibs/json-20090211.jar,joblibs/protostuff-api-1.0.1.jar,joblibs/avro-mapred-1.6.3.jar,joblibs/slf4j-api-1.6.4.jar,joblibs/jackson-mapper-lgpl-1.7.9.jar,joblibs/hive-service-0.10.0.jar,joblibs/lucene-highlighter-4.0.0-BETA.jar,joblibs/commons-digester-1.8.jar,joblibs/mockito-all-1.8.2.jar,joblibs/lucene-analyzers-common-4.0.0-BETA.jar,joblibs/velocity-1.7.jar,joblibs/commons-codec-1.6.jar,joblibs/servlet-api-2.5-20081211.jar,joblibs/lucene-queryparser-4.0.0-BETA.jar,joblibs/datanucleus-rdbms-2.0.3.jar,joblibs/stop.jar,joblibs/lucene-memory-4.0.0-BETA.jar,joblibs/datanucleus-core-2.0.3.jar,joblibs/wstx-asl-3.2.7.jar,joblibs/httpclient-4.1.3.jar,joblibs/commons-io-2.1.jar,joblibs/lucene-analyzers-morfologik-4.0.0-BETA.jar,joblibs/libfb303-0.9.0.jar,joblibs/lucene-misc-4.0.0-BETA.jar,joblibs/JavaEWAH-0.3.2.jar,joblibs/paranamer-2.3.jar,joblibs/protobuf-java-2.3.0.jar,joblibs/jetty-util-6.1.14.jar,joblibs/commons-compress-1.4.1.jar,joblibs/xercesImpl-2.9.1.jar,joblibs/stringtemplate-3.2.1.jar,joblibs/httpcore-4.1.4.jar,joblibs/lucene-analyzers-kuromoji-4.0.0-BETA.jar,joblibs/avro-ipc-1.6.3.jar,joblibs/commons-logging-api-1.0.4.jar,joblibs/xz-1.0.jar,joblibs/solr-core-4.0.0-BETA.jar,joblibs/libthrift-0.6.1.jar,joblibs/httpmime-4.1.3.jar,joblibs/commons-beanutils-1.7.0.jar,joblibs/hive-common-0.10.0.jar,joblibs/hcatalog-core-0.5.0-incubating.jar,joblibs/hive-cli-0.10.0.jar,joblibs/commons-beanutils-core-1.8.0.jar,joblibs/hive-builtins-0.10.0.jar,joblibs/pangool-examples-0.60.3.jar,joblibs/morfologik-stemming-1.5.3.jar,joblibs/commons-cli-1.2.jar,joblibs/jackson-mapper-asl-1.8.8.jar,joblibs/hive-metastore-0.10.0.jar,joblibs/commons-fileupload-1.2.1.jar,joblibs/protostuff-model-1.0.1.jar,joblibs/hive-shims-0.10.0.jar,joblibs/commons-logging-1.0.4.jar,joblibs/jline-0.9.94.jar,joblibs/jdo2-api-2.3-ec.jar,joblibs/protostuff-parser-1.0.1.jar,joblibs/jsr305-1.3.9.jar,joblibs/avro-1.6.3.jar,joblibs/lucene-queries-4.0.0-BETA.jar,joblibs/spatial4j-0.2.jar,joblibs/hive-pdk-0.10.0.jar,joblibs/lucene-spatial-4.0.0-BETA.jar,joblibs/commons-pool-1.5.4.jar,joblibs/morfologik-polish-1.5.3.jar,joblibs/hive-exec-0.10.0.jar,joblibs/javolution-5.5.1.jar,joblibs/hive-serde-0.10.0.jar,joblibs/jetty-6.1.14.jar,joblibs/slf4j-log4j12-1.5.8.jar,joblibs/antlr-runtime-3.0.1.jar,joblibs/lucene-analyzers-phonetic-4.0.0-BETA.jar,joblibs/jcsv-1.4.0.jar,joblibs/zookeeper-3.4.3.jar,joblibs/xml-apis-1.3.04.jar,joblibs/guava-11.0.2.jar,joblibs/asm-3.1.jar,joblibs/datanucleus-connectionpool-2.0.3.jar,joblibs/snappy-java-1.0.4.1.jar,joblibs/log4j-1.2.16.jar,joblibs/lucene-core-4.0.0-BETA.jar,joblibs/protostuff-core-1.0.1.jar,joblibs/jackson-core-lgpl-1.7.9.jar,joblibs/lucene-suggest-4.0.0-BETA.jar,joblibs/commons-io-1.3.2.jar,joblibs/commons-lang-2.5.jar,joblibs/antlr-2.7.7.jar,joblibs/opencsv-2.3.jar,joblibs/solr-solrj-4.0.0-BETA.jar,joblibs/commons-collections-3.2.jar,joblibs/jackson-core-asl-1.8.8.jar,joblibs/pangool-core-0.60.3.jar,joblibs/jackson-jaxrs-1.7.9.jar,joblibs/commons-configuration-1.6.jar,joblibs/derby-10.4.2.0.jar,joblibs/antlr-3.0.1.jar,joblibs/netty-3.2.7.Final.jar,joblibs/joda-time-2.0.jar,joblibs/morfologik-fsa-1.5.3.jar,joblibs/datanucleus-enhancer-2.0.3.jar,joblibs/servlet-api-2.5.jar,joblibs/commons-dbcp-1.4.jar,joblibs/lucene-grouping-4.0.0-BETA.jar,joblibs/protostuff-compiler-1.0.1.jar </arg> <arg>gameoflife-out</arg> <arg>2</arg> <arg>4</arg> </java> <ok to="end" /> <error to="fail" /> </action> <kill name="fail"> <message>Java failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message> </kill> <end name='end' /> </workflow-app>
As you can see, we have to include the list of jars your Job depends on. You can obtain this list with the following command: find joblibs|grep jar|sed -r 's/\.\///'|xargs|tr ' ' ','
Finally, you need a file job.properties with the following content:
nameNode=hdfs://localhost:9000 jobTracker=localhost:9001 queueName=default oozie.wf.application.path=${nameNode}/user/${user.name}/oozie-app
Now you are ready to upload the oozie-app to the hdfs and run the Oozie workflow:
hadoop fs -put oozie-app oozie-app oozie job -oozie http://localhost:11000/oozie -config oozie-app/job.properties -run