A Pangool Tuple is an ordered list of Objects with an associated Schema. Schemas have an associated name and a list of Fields.
We can create an Schema
in a strongly-typed fashion:
Listfields = new ArrayList (); fields.add(Field.create("number1", Type.INT)); fields.add(Field.create("string1", Type.STRING)); fields.add(Field.create("string2", Type.STRING)); Schema schema1 = new Schema("schema1", fields);
Or we can also create it leveraging the Fields.parse()
method:
Schema schema1 = new Schema("schema1", Fields.parse("number1:int, string1:string, string2:string"));
Pangool supports any data type which is "Hadoop serializable", including Writables
,
Thrift, , ProtoStuff and Avro objects. It natively supports some primitive types as well.
The following table summarizes the available field types:
Field Type | Accepted java types | Deserialized java type | Serialization |
---|---|---|---|
Type.INT | Integer | Integer | 32-bit integer (variable length encoded) |
Type.LONG | Long | Long | 64-bit integer (variable length encoded) |
Type.FLOAT | Float | Float | 32-bit floating number |
Type.DOUBLE | Double | Double | 64-bit floating number |
Type.STRING | String, Utf8 | Utf8 | Integer encoded in variable length (size) + uft8 bytes |
Type.BOOLEAN | Boolean | Boolean | 1-byte boolean type |
Type.ENUM | Enum | Enum | As Type.INT |
Type.BYTES | byte[],ByteBuffer | ByteBuffer | length encoded as Type.INT + bytes |
Type.OBJECT | Any serializable object | As defined in Hadoop's (or custom) serialization | length encoded as Type.INT + serialized bytes |
Variable length integers and longs are encoded using less bytes for smaller numbers than for bigger ones. The smallest numbers can fit in just one byte.
For non-primitive types you can use Type.OBJECT
and set arbitrary objects.
There are two ways of serializing OBJECT fields:
io.serialization
:
Listfields = new ArrayList (); fields.add(Field.createObject("obj", IntWritable.class)); Schema schema1 = new Schema("schema1", fields);
Listfields = new ArrayList (); Field field = Field.createObject("my_avro_field", Record.class); field.setSerialization(AvroFieldSerialization.class); field.addProps("avro.schema", myAvroSchema.toString()); field.addProps("avro.reflection", false); Schema schema1 = new Schema("schema1", fields);
To know more about custom serialization take a look at this section.
Thrift / ProtoStuff
You can enable Pangool's native support for Thrift / ProtoStuff objects with:
ThriftSerialization.enableThriftSerialization(conf); ProtoStuffSerialization.enableProtoStuffSerialization(conf);
Where conf
is the Hadoop job configuration.
Tuples meet the ITuple interface and can be instantiated via the Tuple class. You can set tuple values by field name or field index.
ITuple tuple = new Tuple(schema1); tuple.set("number1", 35); tuple.set("string1", "foo");
The following methods are available in Pangool Tuples in order to avoid explicit casting in users' code:
public Integer getInteger(int pos); public Integer getInteger(String field); public Long getLong(int pos); public Long getLong(String field); public Float getFloat(int pos); public Float getFloat(String field); public Double getDouble(int pos); public Double getDouble(String field); public Boolean getBoolean(int pos); public Boolean getBoolean(String field); public String getString(int pos); public String getString(String field);
Tuples are just an ordered list of objects. You can access the elements in the tuple by the index of the field as it was defined in the schema:
tuple.set(1, "foo"); tuple.get(1); // Will return "foo"
That is, if the first field of you schema is "name", then you can get/set name by accessing the position 0.
Users should always use getString()
method, as strings may come sometimes as Utf8
objects.
An alternative way of setting strings is the following:
tuple.set("string1", new Utf8("foo"));
The following code is right:
Utf8 foo = new Utf8("foo"); tuple.set("string1", foo); assertTrue(foo == tuple.get("string1"); tuple.set("string1", "foo"); assertTrue("foo".equals(tuple.get("string1"));
TupleMapper
or TupleReducer
, strings will
always be wrapped by an Utf8
object.
Since 0.70 it is possible to easily mutate schemas with the Mutator
class. As an example,
the following code creates a mutated schema which contains only some fields of the original schema:
Schema schema = new Schema("a_schema", Fields.parse("a:int, b:int, c:double, e:string")); Schema mutated = Mutator.subSetOf("mutated", schema, "b", "e");
Pangool supports null values in tuples since version 0.60.0. Any field can support null values, but
it must be declared in the schema as nullable. The following example declares two fields,
nullableField
and nullableObj
as nullable:
Listfields = new ArrayList (); fields.add(Field.create("nullableField", Type.INT, true)); fields.add(Field.createObject("nullableObj", IntWritable.class, true)); Schema schema2 = new Schema("schema2", fields);
It is enough to add a true
value as parameter at the end of the Field.createX()
method
calls to declare a field as nullable. Also, nullable fields can be declared using the parse
method by adding the ?
character to the type declaration:
Schema schema2 = new Schema("schema2", Fields.parse("nullableField:int?, nullableObj:org.apache.hadoop.io.IntWritable?"));