Working with partitions¶
Partitioning refers to the splitting of a dataset along meaningful dimensions. Each partition contains a subset of the dataset.
For a general introduction to partitioning, see DSS concepts
The two partitioning models¶
There are two major models for partitioning datasets: files-based partitioning and column-based partitioning.
Files-based partitioning¶
This partitioning method is used for all datasets based on a filesystem hierarchy. This includes FS, HDFS, S3, RemoteFiles datasets.
In this method:
- The partitioning is given by the organization of files in folders
- The actual data in the files is NOT used to decide which records belong to which partition.
For more information, see Partitioning files-based datasets
Column-based partitioning¶
This partitioning method is used for datasets based on structured storage engines:
- All SQL databases
- NoSQL databases: MongoDB and Cassandra
In this method, the partitioning is derived from information (generally a column) which is part of the data.
A very important point is that in this method, the schema of the dataset does contain the partitioning data.