com.splout.db.hadoop
Class TablespaceGenerator

java.lang.Object
  extended by com.splout.db.hadoop.TablespaceGenerator
All Implemented Interfaces:
java.io.Serializable

public class TablespaceGenerator
extends java.lang.Object
implements java.io.Serializable

A process that generates the SQL data stores needed for deploying a tablespace in Splout, giving a file set table specification as input.

The input to this process will be:

The output of the process is a Splout deployable path with a PartitionMap . The format of the output is: outputPath + / + OUT_PARTITION_MAP for the partition map, outputPath + / + OUT_SAMPLED_INPUT for the list of sampled keys and outputPath + / + OUT_STORE for the folder containing the generated SQL store.

For creating the store we first sample the input dataset with TupleSampler and then execute a Hadoop job that distributes the data accordingly. The Hadoop job will use TupleSQLite4JavaOutputFormat.

See Also:
Serialized Form

Nested Class Summary
static class TablespaceGenerator.TablespaceGeneratorException
           
 
Field Summary
static java.lang.String OUT_INIT_STATEMENTS
           
static java.lang.String OUT_PARTITION_MAP
           
static java.lang.String OUT_SAMPLED_INPUT
           
static java.lang.String OUT_STORE
           
protected  PartitionMap partitionMap
           
protected  TablespaceSpec tablespace
           
 
Constructor Summary
TablespaceGenerator(TablespaceSpec tablespace, org.apache.hadoop.fs.Path outputPath, java.lang.Class callingClass)
           
 
Method Summary
protected  com.datasalt.pangool.tuplemr.TupleMRBuilder createMRBuilder(int nPartitions, org.apache.hadoop.conf.Configuration conf)
          Create TupleMRBuilder for launching generation Job.
protected  void executeViewGeneration(com.datasalt.pangool.tuplemr.TupleMRBuilder builder)
           
 void generateView(org.apache.hadoop.conf.Configuration conf, TupleSampler.SamplingType samplingType, TupleSampler.SamplingOptions samplingOptions)
          This is the public method which has to be called when using this class as an API.
 int getBatchSize()
           
protected static java.lang.String getPartitionByKey(com.datasalt.pangool.io.ITuple tuple, TableSpec tableSpec, JavascriptEngine jsEngine)
          Returns the partition key either by using partition-by-fields or partition-by-javascript as configured in the Table Spec.
 PartitionMap getPartitionMap()
          Returns the generated PartitionMap.
 int getRecordsToSample()
           
protected  void prepareOutput(org.apache.hadoop.conf.Configuration conf)
           
protected  PartitionMap sample(int nPartitions, org.apache.hadoop.conf.Configuration conf, TupleSampler.SamplingType samplingType, TupleSampler.SamplingOptions samplingOptions)
          Samples the input, if needed.
 void setBatchSize(int batchSize)
           
 void setRecordsToSample(int recordsToSample)
           
protected  void writeOutputMetadata(org.apache.hadoop.conf.Configuration conf)
          Write the partition map and other metadata to the output folder.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

tablespace

protected final transient TablespaceSpec tablespace

partitionMap

protected PartitionMap partitionMap

OUT_SAMPLED_INPUT

public static final java.lang.String OUT_SAMPLED_INPUT
See Also:
Constant Field Values

OUT_PARTITION_MAP

public static final java.lang.String OUT_PARTITION_MAP
See Also:
Constant Field Values

OUT_INIT_STATEMENTS

public static final java.lang.String OUT_INIT_STATEMENTS
See Also:
Constant Field Values

OUT_STORE

public static final java.lang.String OUT_STORE
See Also:
Constant Field Values
Constructor Detail

TablespaceGenerator

public TablespaceGenerator(TablespaceSpec tablespace,
                           org.apache.hadoop.fs.Path outputPath,
                           java.lang.Class callingClass)
Method Detail

generateView

public void generateView(org.apache.hadoop.conf.Configuration conf,
                         TupleSampler.SamplingType samplingType,
                         TupleSampler.SamplingOptions samplingOptions)
                  throws java.lang.Exception
This is the public method which has to be called when using this class as an API. Business logic has been split in various protected functions to ease understading of it and also to be able to subclass this easily to extend its functionality.

Throws:
java.lang.Exception

prepareOutput

protected void prepareOutput(org.apache.hadoop.conf.Configuration conf)
                      throws java.io.IOException
Throws:
java.io.IOException

writeOutputMetadata

protected void writeOutputMetadata(org.apache.hadoop.conf.Configuration conf)
                            throws java.io.IOException,
                                   JSONSerDe.JSONSerDeException
Write the partition map and other metadata to the output folder. They will be needed for deploying the dataset to Splout.

Throws:
java.io.IOException
JSONSerDe.JSONSerDeException

getPartitionByKey

protected static java.lang.String getPartitionByKey(com.datasalt.pangool.io.ITuple tuple,
                                                    TableSpec tableSpec,
                                                    JavascriptEngine jsEngine)
                                             throws java.lang.Throwable
Returns the partition key either by using partition-by-fields or partition-by-javascript as configured in the Table Spec.

Throws:
java.lang.Throwable

sample

protected PartitionMap sample(int nPartitions,
                              org.apache.hadoop.conf.Configuration conf,
                              TupleSampler.SamplingType samplingType,
                              TupleSampler.SamplingOptions samplingOptions)
                       throws TupleSampler.TupleSamplerException,
                              java.io.IOException
Samples the input, if needed.

Throws:
TupleSampler.TupleSamplerException
java.io.IOException

createMRBuilder

protected com.datasalt.pangool.tuplemr.TupleMRBuilder createMRBuilder(int nPartitions,
                                                                      org.apache.hadoop.conf.Configuration conf)
                                                               throws com.datasalt.pangool.tuplemr.TupleMRException,
                                                                      TupleSQLite4JavaOutputFormat.TupleSQLiteOutputFormatException
Create TupleMRBuilder for launching generation Job.

Throws:
com.datasalt.pangool.tuplemr.TupleMRException
TupleSQLite4JavaOutputFormat.TupleSQLiteOutputFormatException

executeViewGeneration

protected void executeViewGeneration(com.datasalt.pangool.tuplemr.TupleMRBuilder builder)
                              throws java.io.IOException,
                                     java.lang.InterruptedException,
                                     java.lang.ClassNotFoundException,
                                     TablespaceGenerator.TablespaceGeneratorException,
                                     com.datasalt.pangool.tuplemr.TupleMRException
Throws:
java.io.IOException
java.lang.InterruptedException
java.lang.ClassNotFoundException
TablespaceGenerator.TablespaceGeneratorException
com.datasalt.pangool.tuplemr.TupleMRException

getPartitionMap

public PartitionMap getPartitionMap()
Returns the generated PartitionMap. It is also written to the HDFS. This is mainly used for testing.


getRecordsToSample

public int getRecordsToSample()

setRecordsToSample

public void setRecordsToSample(int recordsToSample)

getBatchSize

public int getBatchSize()

setBatchSize

public void setBatchSize(int batchSize)


Copyright © 2012-2013 Datasalt Systems S.L.. All Rights Reserved.