Parquet

Parquet is an efficient file format of the Hadoop ecosystem. Its main points are:

  • Column-oriented, even for nested complex types
  • Block-based compression
  • Ability to “push down” filtering predicates to avoid useless reads

Using Parquet or another efficient file format is strongly recommended when working with Hadoop data (rather than CSV data). Speedups can reach up to x100 on select queries.

Applicability

  • Parquet datasets can only be stored on HDFS. If the data is on S3 or Azure Blob Storage, then access needs to be setup through Hadoop with HDFS connections
  • Parquet datasets can be used as inputs and outputs of all recipes
  • Parquet datasets can be used in the Hive and Impala notebooks

Limitations and issues

Case-sensitivity

Due to differences in how Hive and Parquet treat identifiers, it is strongly recommended that you only use lowercase identifiers when dealing with Parquet files.

Date logical type

The “date” logical type is an extension of the “int32” physical type in Parquet, and is used by Hive 2 to store “DATE” columns in Parquet (prior to Hive 2, it was not possible to store Hive DATE columns in Parquet).

DSS does not currently read this “date” logical type. A DATE column from Hive 2 (or from a manually created Parquet file) will appear as an integer in DSS, representing the number of days since Epoch. You can manually convert this integer using a Preparation recipe:

  • Apply a formula column * 24 * 3600
  • Use a “convert UNIX timestamp” processor

Misc

  • Due to various differences in how Pig and Hive map their data types to Parquet, you must select a writing Flavor when DSS writes a Parquet dataset. Reading with Hive a Parquet dataset written by Pig (and vice versa) leads to various issues, most being related to complex types.