Pangool User Guide

Text Input/Output

Pangool comes with a convenient InputFormat and OutputFormat for text files, which extend Hadoop's functionality to allow reading TSV / CSV, including the possibility of specifying arbitrary parsing specs and "null strings".

Using TupleTextInputFormat

TupleTextInputFormat's constructor has several available options:

schema: The target schema to use. A Tuple with this schema will be instantiated and fed into the Mapper.
hasHeader: If true, the reader will skip the first line of every file.
strictQuotes: If true, the reader will provide null values for any value which is not quoted.
separator: The character used to separate fields in the input files.
quoteCharacter: The character used to quote fields in the input files. Use NO_QUOTE_CHARACTER if the file has no quotes.
escapeCharacter: The character used to escape characters inside quoted fields in the input files. Use NO_ESCAPE_CHARACTER if the file has no escape character.
fieldSelector: An optional 0-index based selector that allows to read only some columns of the input files. The selector must match the target schema. That is, no matter how many columns the original file has, if the selector has 5 selected fields then the target schema must have 5 fields. Use null if no explicit field selection is needed.
nullString: A string which can be identified as null value (e.g. \N in MySQL dumps). Use null if no nullString is needed.

The following line instantiates an InputFormat which reads a simple TSV with no quotes nor escape characters:

 InputFormat inputFormat = new TupleTextInputFormat(schema, false, false, '\t', 
  TupleTextInputFormat.NO_QUOTE_CHARACTER, TupleTextInputFormat.NO_ESCAPE_CHARACTER, null, null);

The following line instantiates an InputFormat which reads a header-less CSV which uses " as quote character and \ as escape character:

 InputFormat inputFormat = new TupleTextInputFormat(schema, false, false, ',', '"', '\\', null, null);

The following line instantiates an InputFormat which reads strict quotes CSV files with header and null values optionally identified as \N, and selects only columns 1, 4 and 6 (remember that indexes are 0-based):

 FieldSelector selector = new FieldSelector(1, 4, 6);
 InputFormat inputFormat = new TupleTextInputFormat(schema, true, true, ',', '"', '\\', selector, "\\N");

Using TupleTextOutputFormat

TupleTextOutputFormat's constructor has several available options, similar to those in TupleTextInputFormat:

schema: The target schema to use. A Tuple with this schema will be instantiated and fed into the Mapper.
addHeader: If true, the writer will append a header line to every produced file.
separatorCharacter: The character to use for separating fields.
quoteCharacter: The character used to quote fields. If provided, output will be strictly quoted. Use NO_QUOTE_CHARACTER to not use quotes.
escapeCharacter: The character used to escape characters inside quoted fields. Used if quoteCharacter is provided. Use NO_ESCAPE_CHARACTER to not use it.
nullString: An optional string which can be used to serialize null values (e.g. \N as in MySQL dumps). Use null if no nullString is needed. If provided, this string won't be quoted even if the writer uses quotes.

The following line instantiates a TupleTextOutputFormat which serializes tuples as a simple TSV:

 OutputFormat outputFormat = new TupleTextOutputFormat(schema, false, '\t', 
  TupleTextOutputFormat.NO_QUOTE_CHARACTER, TupleTextOutputFormat.NO_ESCAPE_CHARACTER);

The following line instantiates a TupleTextOutputFormat which serializes tuples as strictly quoted CSV with null strings as \N :

 OutputFormat outputFormat = new TupleTextOutputFormat(schema, true, ',', '"', '\\', "\\N")

Advanced: Fixed-width fields reader

Some input text files have fixed-width fields. In this case there is no way to parse the file using delimiters, etc. Pangool also supports such files via a special constructor in TupleTextInputFormat. The following code instantiates an InputFormat that reads a fixed-width fields file with two fields:

 int fieldsPos[] = new int[] { 0, 3, 4, 5 };
 InputFormat inputFormat = new TupleTextInputFormat(schema, fieldsPos, false, "-");

The fieldsPos array is interpreted by taking pairs of consecutive numbers which indicate the absolute char position where every field starts. In this example, there are two fields: the first one is read from chars 0 to 3 and the second one from chars 4 to 5.

Next: SOLR Integration »