Working with partitions¶

Partitioning refers to the splitting of a dataset along meaningful dimensions. Each partition contains a subset of the dataset.

For a general introduction to partitioning, see DSS concepts

The two partitioning models¶

There are two major models for partitioning datasets: files-based partitioning and column-based partitioning.

This partitioning method is used for all datasets based on a filesystem hierarchy. This includes FS, HDFS, S3, RemoteFiles datasets.

In this method:

The partitioning is given by the organization of files in folders
The actual data in the files is NOT used to decide which records belong to which partition.

This partitioning method is used for datasets based on structured storage engines:

In this method, the partitioning is derived from information (generally a column) which is part of the data.

A very important point is that in this method, the schema of the dataset does contain the partitioning data.