Pangool comes with a convenient InputFormat and OutputFormat for text files, which extend Hadoop's functionality to allow reading TSV / CSV, including the possibility of specifying arbitrary parsing specs and "null strings".
TupleTextInputFormat's constructor has several available options:
The following line instantiates an InputFormat which reads a simple TSV with no quotes nor escape characters:
InputFormat inputFormat = new TupleTextInputFormat(schema, false, false, '\t', TupleTextInputFormat.NO_QUOTE_CHARACTER, TupleTextInputFormat.NO_ESCAPE_CHARACTER, null, null);
The following line instantiates an InputFormat which reads a header-less CSV which uses " as quote character and \ as escape character:
InputFormat inputFormat = new TupleTextInputFormat(schema, false, false, ',', '"', '\\', null, null);
The following line instantiates an InputFormat which reads strict quotes CSV files with header and null values optionally identified as \N, and selects only columns 1, 4 and 6 (remember that indexes are 0-based):
FieldSelector selector = new FieldSelector(1, 4, 6); InputFormat inputFormat = new TupleTextInputFormat(schema, true, true, ',', '"', '\\', selector, "\\N");
TupleTextOutputFormat's constructor has several available options, similar to those in TupleTextInputFormat:
The following line instantiates a TupleTextOutputFormat which serializes tuples as a simple TSV:
OutputFormat outputFormat = new TupleTextOutputFormat(schema, false, '\t', TupleTextOutputFormat.NO_QUOTE_CHARACTER, TupleTextOutputFormat.NO_ESCAPE_CHARACTER);
The following line instantiates a TupleTextOutputFormat which serializes tuples as strictly quoted CSV with null strings as \N :
OutputFormat outputFormat = new TupleTextOutputFormat(schema, true, ',', '"', '\\', "\\N")
Some input text files have fixed-width fields. In this case there is no way to parse the file using delimiters, etc. Pangool also
supports such files via a special constructor in TupleTextInputFormat
. The following code instantiates an InputFormat
that reads a fixed-width fields file with two fields:
int fieldsPos[] = new int[] { 0, 3, 4, 5 }; InputFormat inputFormat = new TupleTextInputFormat(schema, fieldsPos, false, "-");
The fieldsPos
array is interpreted by taking pairs of consecutive numbers which indicate the absolute char position
where every field starts. In this example, there are two fields: the first one is read from chars 0 to 3 and the second one from
chars 4 to 5.