Execution engines¶

Design of the preparation¶

The design of a data preparation is always done on an in-memory sample of the data. See Sampling for more information.

Execution in analysis¶

When in an analysis, execution on the whole dataset happens when:

Exporting the prepared data
Running a machine learning model

In both cases, this uses a streaming engine: all data goes through the DSS server but does not need to be in memory.

Execution of the recipe¶

For execution of the recipe, DSS provides three execution engines:

Streaming¶

All data goes through the DSS server but does not need to be in memory.

Hadoop Mapreduce¶

When both the input and output datasets of a Data Preparation recipe are supported HDFS datasets, the data preparation recipe can run fully on Hadoop, as a MapReduce job.

To enable this behavior, go to the Settings / Build tab of the data preparation recipe and check “Run on Hadoop”. You do not need to fill the “Split size” parameter.

Spark¶

When Spark is installed (see: DSS and Spark), preparation recipe jobs can run on Spark.

We recommend that you only use this on HDFS or S3 datasets.