Execution engines¶
Design of the preparation¶
The design of a data preparation is always done on an in-memory sample of the data. See Sampling for more information.
Execution in analysis¶
When in an analysis, execution on the whole dataset happens when:
- Exporting the prepared data
- Running a machine learning model
In both cases, this uses a streaming engine: all data goes through the DSS server but does not need to be in memory.
Execution of the recipe¶
For execution of the recipe, DSS provides three execution engines:
Streaming¶
All data goes through the DSS server but does not need to be in memory.
Hadoop Mapreduce¶
When both the input and output datasets of a Data Preparation recipe are supported HDFS datasets, the data preparation recipe can run fully on Hadoop, as a MapReduce job.
To enable this behavior, go to the Settings / Build
tab of the data preparation recipe and check “Run on Hadoop”. You do not need to fill the “Split size” parameter.
Spark¶
When Spark is installed (see: DSS and Spark), preparation recipe jobs can run on Spark.
We recommend that you only use this on HDFS or S3 datasets.