Pangool User Guide

SOLR Integration

Pangool comes with a smooth integration with SOLR. This integration is possible due to a stateful OutputFormat, TupleSolrOutputFormat. Let's see how to use it.

Configuring a TupleSolrOutputFormat

Firstly, you have to include SOLR into your project depedencies if you are using Maven. Probably you'll need to exclude some third-party dependencies, as they can conflict with Hadoop:

<dependency>
	<groupId>org.apache.solr</groupId>
	<artifactId>solr-core</artifactId>
	<version>4.0.0-BETA</version>
	<exclusions>
		<exclusion>
			<artifactId>jcl-over-slf4j</artifactId>
			<groupId>org.slf4j</groupId>
		</exclusion>
		<exclusion>
			<groupId>org.apache.zookeeper</groupId>
			<artifactId>zookeeper</artifactId>
		</exclusion>
		<exclusion>
			<groupId>org.slf4j</groupId>
			<artifactId>slf4j-jdk14</artifactId>
		</exclusion>
		<exclusion>
			<groupId>org.apache.commons</groupId>
			<artifactId>commons-io</artifactId>
		</exclusion>				
	</exclusions>	
</dependency>

The following code instantiates a TupleSolrOutputFormat with a certain Hadoop Configuration and a File pointing to "SOLR Home". "SOLR Home" must be a folder that contains a sub-folder "conf" with, at least, files "schema.xml" and "solrconfig.xml". These files will be used for creating the SOLR index. For those who used it, the idea is the same than that of SOLR-1301.

 Configuration conf = new Configuration();
 File solrHome = new File("my-solr-home");
 TupleSolrOutputFormat outputFormat = new TupleSolrOutputFormat(solrHome, conf);

It is OK to use TupleSolrOutputFormat both as the main output of the Job or as a "named output":

 Configuration conf = new Configuration();
 Path jobOutput = new Path("job-output");
 job.addNamedOutput("namedOutput1", new TupleSolrOutputFormat(new File("solr-home1"), conf), ITuple.class, NullWritable.class);
 job.addNamedOutput("namedOutput2", new TupleSolrOutputFormat(new File("solr-home2"), conf), ITuple.class, NullWritable.class);
 job.setOutput(jobOutput, new TupleSolrOutputFormat(new File("solr-home3"), conf), ITuple.class, NullWritable.class);

TupleDocumentConverter

In order to index Tuples they must be converted to SolrInputDocument. This conversion process is done by a TupleDocumentConverter. Pangool comes with a default TupleDocumentConverter called DefaultTupleDocumentConverter which will be OK for most of the cases. This converter maps Pangool primitive fields (INT, LONG, STRING, DOUBLE, FLOAT, BOOLEAN) to SOLR primitive fields. More advanced converters can be implemented and passed as a parameter to the OutputFormat:

 Configuration conf = new Configuration();
 File solrHome = new File("my-solr-home");
 TupleDocumentConverter converter = new MyCustomTupleDocumentConverter();
 TupleSolrOutputFormat outputFormat = new TupleSolrOutputFormat(solrHome, conf, converter);

Advanced configuration

It is possible to configure the OutputFormat further with the following extra constructor parameters:

  • outputZipFile (false): If user wants to produce a ZIP with the index.
  • batchSize (20): Number of documents that will go in each indexing batch.
  • threadCount (2): Number of threads in a pool that will be used for indexing.
  • queueSize (100): Maximum number of batches that can be pooled in the batch indexing thread pool.

Use case

For a example use case see MultiShakespeareIndexer.

Next: Tuple MapReduce API »