com.splout.db.hadoop
Class TupleSampler
java.lang.Object
com.splout.db.hadoop.TupleSampler
- All Implemented Interfaces:
- java.io.Serializable
public class TupleSampler
- extends java.lang.Object
- implements java.io.Serializable
This class samples a list of TableInput
files that produce a certain Table Schema. There are two sampling
methods supported:
- DEFAULT: Inspired by Hadoop's TeraInputFormat. A Hadoop Job is not needed. Consecutive records are read from each
InputSplit.
- RESERVOIR: It uses a Map-Only Pangool Job for performing Reservoir Sampling over the dataset.
Sampling can be used by TablespaceGenerator
for determining a PartitionMap
based on the approximated
distribution of the keys.
- See Also:
- Serialized Form
Method Summary |
void |
sample(java.util.List<TableInput> inputFiles,
com.datasalt.pangool.io.Schema tableSchema,
org.apache.hadoop.conf.Configuration hadoopConf,
long sampleSize,
org.apache.hadoop.fs.Path outFile)
|
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
TupleSampler
public TupleSampler(TupleSampler.SamplingType samplingType,
TupleSampler.SamplingOptions options)
sample
public void sample(java.util.List<TableInput> inputFiles,
com.datasalt.pangool.io.Schema tableSchema,
org.apache.hadoop.conf.Configuration hadoopConf,
long sampleSize,
org.apache.hadoop.fs.Path outFile)
throws TupleSampler.TupleSamplerException
- Throws:
TupleSampler.TupleSamplerException
Copyright © 2012-2013 Datasalt Systems S.L.. All Rights Reserved.