Pangool User Guide

Tuples & Schemas & Types

A Pangool Tuple is an ordered list of Objects with an associated Schema. Schemas have an associated name and a list of Fields.

We can create an Schema in a strongly-typed fashion:

 List fields = new ArrayList();
 fields.add(Field.create("number1", Type.INT));
 fields.add(Field.create("string1", Type.STRING));
 fields.add(Field.create("string2", Type.STRING));
 Schema schema1 = new Schema("schema1", fields);

Or we can also create it leveraging the Fields.parse() method:

 Schema schema1 = new Schema("schema1", Fields.parse("number1:int, string1:string, string2:string"));

Types

Pangool supports any data type which is "Hadoop serializable", including Writables, Thrift, , ProtoStuff and Avro objects. It natively supports some primitive types as well.

The following table summarizes the available field types:

Field Type	Accepted java types	Deserialized java type	Serialization
Type.INT	Integer	Integer	32-bit integer (variable length encoded)
Type.LONG	Long	Long	64-bit integer (variable length encoded)
Type.FLOAT	Float	Float	32-bit floating number
Type.DOUBLE	Double	Double	64-bit floating number
Type.STRING	String, Utf8	Utf8	Integer encoded in variable length (size) + uft8 bytes
Type.BOOLEAN	Boolean	Boolean	1-byte boolean type
Type.ENUM	Enum	Enum	As Type.INT
Type.BYTES	byte[],ByteBuffer	ByteBuffer	length encoded as Type.INT + bytes
Type.OBJECT	Any serializable object	As defined in Hadoop's (or custom) serialization	length encoded as Type.INT + serialized bytes

Variable length integers and longs are encoded using less bytes for smaller numbers than for bigger ones. The smallest numbers can fit in just one byte.

Objects

For non-primitive types you can use Type.OBJECT and set arbitrary objects. There are two ways of serializing OBJECT fields:

Using Hadoop's serialization: The associated object's class will be used by Hadoop's serialization factory to choose the right serializers and deserializers defined in io.serialization:

  List fields = new ArrayList();
  fields.add(Field.createObject("obj", IntWritable.class));
  Schema schema1 = new Schema("schema1", fields);

Setting custom serialization: Pangool will use the serialization factory specified in the field's properties:

  List fields = new ArrayList();
  Field field = Field.createObject("my_avro_field", Record.class);
  field.setSerialization(AvroFieldSerialization.class);
  field.addProps("avro.schema", myAvroSchema.toString());
  field.addProps("avro.reflection", false);
  Schema schema1 = new Schema("schema1", fields);

To know more about custom serialization take a look at this section.

Thrift / ProtoStuff

You can enable Pangool's native support for Thrift / ProtoStuff objects with:

 ThriftSerialization.enableThriftSerialization(conf);
 ProtoStuffSerialization.enableProtoStuffSerialization(conf);

Where conf is the Hadoop job configuration.

Tuples

Tuples meet the ITuple interface and can be instantiated via the Tuple class. You can set tuple values by field name or field index.

 ITuple tuple = new Tuple(schema1);
 tuple.set("number1", 35);
 tuple.set("string1", "foo");

Special getters

The following methods are available in Pangool Tuples in order to avoid explicit casting in users' code:

 public Integer getInteger(int pos);
 public Integer getInteger(String field);
	
 public Long getLong(int pos);
 public Long getLong(String field);
	
 public Float getFloat(int pos);
 public Float getFloat(String field);
	
 public Double getDouble(int pos);
 public Double getDouble(String field);
	
 public Boolean getBoolean(int pos);
 public Boolean getBoolean(String field);
	
 public String getString(int pos);
 public String getString(String field);

Access by index

Tuples are just an ordered list of objects. You can access the elements in the tuple by the index of the field as it was defined in the schema:

 tuple.set(1, "foo");
 tuple.get(1); // Will return "foo"

That is, if the first field of you schema is "name", then you can get/set name by accessing the position 0.

Strings

Users should always use getString() method, as strings may come sometimes as Utf8 objects.

An alternative way of setting strings is the following:

 tuple.set("string1", new Utf8("foo"));

The following code is right:

 Utf8 foo = new Utf8("foo");
 tuple.set("string1", foo);
 assertTrue(foo == tuple.get("string1");
	
 tuple.set("string1", "foo");
 assertTrue("foo".equals(tuple.get("string1"));

Important: When Tuples come as an input of TupleMapper or TupleReducer, strings will always be wrapped by an Utf8 object.

Mutating schemas

Since 0.70 it is possible to easily mutate schemas with the Mutator class. As an example, the following code creates a mutated schema which contains only some fields of the original schema:

 Schema schema = new Schema("a_schema", Fields.parse("a:int, b:int, c:double, e:string"));
 Schema mutated = Mutator.subSetOf("mutated", schema, "b", "e");

Nulls support

Pangool supports null values in tuples since version 0.60.0. Any field can support null values, but it must be declared in the schema as nullable. The following example declares two fields, nullableField and nullableObj as nullable:

 List fields = new ArrayList();
 fields.add(Field.create("nullableField", Type.INT, true));
 fields.add(Field.createObject("nullableObj", IntWritable.class, true));
 Schema schema2 = new Schema("schema2", fields);

It is enough to add a true value as parameter at the end of the Field.createX() method calls to declare a field as nullable. Also, nullable fields can be declared using the parse method by adding the ? character to the type declaration:

 Schema schema2 = new Schema("schema2", Fields.parse("nullableField:int?, nullableObj:org.apache.hadoop.io.IntWritable?"));

Advanced: Nulls support may have some effect on serialization length of Tuples. If none of the schema fields are nullable then there is no serialization overhead. But if at least one field is nullable then Pangool adds a bit field to the serialized form. The bit field length varies depending on the number of nullable fields in the schema. 1 byte bit field is enough for schemas with less than 8 nullable fields, 2 bytes for schemas between 7 and 14 nullable fields, and so on. When using more than one intermediate schema, the number of bytes used depends on the number of intermediate schemas with non-common nullable fields.

Next: Creating Pangool Jobs »