DSS 3.1 Release notes¶
Migration notes¶
Migration paths to DSS 3.1¶
- From DSS 3.0: Automatic migration is supported, with the following restrictions and warnings
- From DSS 2.X: In addition to the following restrictions and warnings, you need to pay attention to the restrictions and warnings applying to your previous versions: see 2.0 -> 2.1 2.1 -> 2.2 2.2 -> 2.3 2.3 -> 3.0
- Migration from DSS 1.X is not supported. You must first upgrade to 2.0. See DSS 2.0 Relase notes
Limitations and warnings¶
- The usual limitations on retraining models and regenerating API node packages apply (see Upgrading a DSS instance for more information). Note that DSS 3.1 includes a vast overhaul of the machine learning part, so machine learning models trained with previous DSS will not work in DSS 3.1
How to upgrade¶
It is strongly recommended that you perform a full backup of your Data Science Studio data directory prior to starting the upgrade procedure.
For automatic upgrade information, see Upgrading a DSS instance
External libraries upgrades¶
Several external libraries bundled with DSS have been bumped to major revisions. Some of these libraries include some backwards-incompatible changes. You might need to upgrade your code.
Notable upgrades:
- ggplot 0.6 -> 0.9
- pandas 0.17 -> 0.18
- numpy 1.9 -> 1.10
- requests 2.9 -> 2.10
Version 3.1.5 - November 21st 2016¶
Data preparation¶
- Fix selection of partial column content
- Fix removal of a value in a “Delete matching rows” step
- Improve explanations for “Filter on invalid meaning” processor
- Fix error when removing a column which was used for coloring cells
- Fix unsaved changes to design sample in preparation recipes
- Add reference of all processors in documentation
Flow & Recipes¶
- Fix timezone issues on group and join recipes on Filesystem datasets
- Fix disabling of pre-filter in visual recipes
Charts¶
- Fix flickering and reset of zoom in map charts
- Fix disappearing smallest bubble in scatter plot
- Display an error message when trying to plot 100% stacked columns with negative values
Version 3.1.4 - October 3rd 2016¶
Hadoop & Spark¶
- Add support for HDP 2.5
- Add support for EMR 4.7 and 4.8
- Spark writing: Faster write for Parquet by using native Spark code
- Spark writing: don’t fail on invalid dates
- Pig: Fix PigStorage (for CSV files) on Pig 0.14+
- Fix possible hang when aborting Hive+Tez queries
- Improve logging inside the hproxy process
Datasets¶
- Fix Redshift support (bug introduced in 3.1.3)
- Add ability to load AWS credentials from environment
- Fix “COUNT” metric on Oracle
- Make fetch size configurable for all SQL datasets
- Several fixes for Teradata support
Machine learning¶
- Fix MROC AUC computation on Jupyter export of multiclass model
- H2O: bump version and fix support out-of-the-box on CDH’s Spark
Misc¶
- Fix dataset export from dashboard
- Add support for Markdown on custom “Homepage” messages
- SQL notebook: show aborted status immediately when aborting a query
- Add API to read metrics on managed folders
- Create the underlying folder of a managed folder upon addition
- Fix scrolling on API keys page
- Add ability to use case-insensitive logins on LDAP
- LDAP users will now be imported as readers by default
Version 3.1.3 - September 19th 2016¶
DSS 3.1.3 is a bugfix release. For more information about 3.1.X, see the release notes for 3.1.0.
Hadoop & Spark¶
- Add support for MapR 5.2
- Add partial support for Hive 2.1
- Add ability to pass arbitrary arguments to Spark, useful for –packages
Data preparation¶
- Fix random failure occuring in the “Holidays computer” processor
- Fix output data of the JSONPath extractor processor
- Fix date diff (reversed order)
Visual recipes¶
- Fix date filtering
Data viualization¶
- Add ability to use shapes in scatter plot
- Minor improvements in tooltip handling
Machine Learning¶
- Fix “Impute with Median” in MLlib on CDH 5.7/5.8
- Fix possible failure in clustering results
- Fix error in clustering recipe when filtering columns
- Add configurability of max features in random forest algorithms
Misc¶
- Metrics & Checks: Fix multiple SQL probes on the same datasets
- Performance improvements for custom exporters
- Performance improvements for Data Catalog
- Performance improvements on home page
- Small UI fixes in themes
- Small UI improvements here and there
- Update PostgreSQL driver (fixes result sets with more than 2B results)
Version 3.1.2 - August 22nd 2016¶
DSS 3.1.2 is a bugfix release. For more information about 3.1.X, see the release notes for 3.1.0.
ML¶
- Fixed “red/green” indicator for MAPE
- Improved visualization of decision trees
- Warn when trying to use numerical features for Naive Bayes
- Make GBT regression exportable to notebook
- Fixed clustering scoring recipes migrated from 3.0
- Add Impute with median on MLLib
- Don’t fail when rejected features are not present in the scoring recipe input
Datasets¶
- Configurable batch size for writing to ElasticSearch
- Fixed edition of columns on editable dataset
Automation¶
- Fix attachment of a dataset in the “Send message” step
- Fix intermittent failures with “Make API node package” step
- Add ability to directly use
get_custom_variables
in a custom check
Installation & Admin¶
- Fixed R integration, following changes in IRKernel
- Fixed “radial” layout on home page
- Optional reporting on internal metrics to Graphite
- Fixed “Cluster tasks” and “Per-connection data” views on Hadoop
Misc¶
- Major performance improvements in various areas, especially with large number of projects, datasets, or users
- Improved copy/paste of code from diff viewer
- Tighten permissions on managed folders
- Fixes for custom Scala recipe in plugin development environment
- Fixed
get_config
call on Python API - Don’t fail on homepage with broken Jupyter notebooks
- Fixed small UI issue on custom aggregations in grouping recipe
- Fixed extension of export filenames
- Fixed small UI issues with Chrome 52
- Don’t allow the custom formula processor’s edition form to overflow
Version 3.1.1 - August 10th 2016¶
DSS 3.1.1 is a bugfix release. For more information about 3.1.X, see the release notes for 3.1.0.
ML¶
- Fixed various errors in models status
- Fixed deployment of Vertica ML models when the target is not in the dataset to score
- Improved the autocomputed schema as output of scoring recipes
- Fixed bug when a custom evaluation function is partially defined
- Improved resiliency and error messages for custom evaluation functions
Version 3.1.0 - July 27th 2016¶
DSS 3.1.0 is a major upgrade to DSS with exciting new features.
For a summary of the major new features, see: https://www.dataiku.com/learn/whatsnew
New features¶
Scala recipe and notebook¶
You can now interact with Spark using Scala, the most native language for Spark processing.
This release brings to DSS:
- Spark-Scala recipes
- Spark-Scala notebooks
- Custom recipes (plugins) written in Scala
For more information, please see Spark-Scala recipes
H2O integration (through Sparkling-Water)¶
H2O is a distributed machine-learning library, with a wide range of algorithms and methods.
DSS now includes full support for H2O (in its “Sparkling Water” variant) in its visual machine learning interface.
For more information, please see H2O (Sparkling Water) engine
Advanced users can also leverage H2O through all Spark-based recipes and notebooks of DSS.
New DSS home page & workflow¶
The DSS home page now features:
- The ability to set a customizable “status” to projects, in order to materialize your workflow (draft, production, archived, …) in DSS
- The ability to filter projects by tags, status, owner, …
- The ability to sort projects
- A new “list” view with advanced details (contents of the project, activity monitoring, …)
- A new “flow” view to study the dependencies between projects
- Useful “Tips and Tricks”
Prebuilt notebooks¶
You can now use prebuilt templates for notebooks when creating a notebook from a dataset. This allows for reusable interactive analysis
DSS 3.1 comes with 4 prebuilt notebooks for analyzing datasets:
- PCA
- Correlations between variables
- Time series visualization and analytics
- Time series forecasting
New data sources¶
DSS can now connect to the following SQL databases
- IBM Netezza
- SAP HANA
- Google Bigquery (Read only)
- Microsoft Azure DWH
Machine learning visualizations¶
DSS now includes the following new visualizations in Machine Learning
- Decision tree(s) visualization for Decision Tree, Random Forests and Gradient Boosting
- Partial dependency plots for Gradient Boosting
More custom algorithms support¶
Custom algorithms are now supported in:
- Python Clustering (Python)
- Spark MLLib Prediction (Scala)
- Spark MLLib Clustering (Scala)
Custom Formats and Export¶
A brand new export mechanism has been introduced. It provides easier configuration and expands what can be supported.
It is now possible to write custom format extractors and exporters, either in Python or Java. See our plugins library for examples.
This notably provides a much improved support for export to Tableau (TDE files or Tableau Server): open any data from DSS in Tableau in just 2 clicks!
Other notable enhancements¶
Data preparation¶
- New processor: date filter
- New processor: compute distance between geopoints
Machine learning¶
- Handling of data types has been strongly overhauled, resulting in better reliability in machine learning
- Additional algorithms have been added in Spark MLLib
- DSS now supports clustering in the Spark MLLib implementation
- You can now export variables importance and coefficients data directly from the machine learning UI
- When doing dummy-encoding, DSS can now remove the last dummy to avoid collinearity (especially useful for regression models). DSS by default automatically uses the proper behavior according to the algorithm.
- When doing dummy-encoding, DSS has more options for handling features with large cardinalities (clip above a number of dummies, clip after a cumulative distribution, clip below a threshold in number of records)
- Much faster scoring in MLLib multiclass
- In scoring recipe, it is now possible to select the input columns to retain in output
- In scoring recipe, it is now possible to “unplug” the output schema from the input. This is especially useful in corner cases where the data type is incorrect
- Added support for in-database machine learning on Vertica, through Vertica 7 Advanced Analytics package
- Added links to original analysis from training recipe & saved models
Visual recipes¶
- The join recipe now has support for more join types: Inner, Left, Right, Outer, Cross, Natural and Advanced (left with optional dedup)
- The join recipe now has support for various kinds of inequality joins
Datasets & formats¶
- Very large Excel files can now be opened with small memory overhead
- New option for CSV and SQL: normalize doubles (ie: always add .0 to doubles). This makes operation between doubles and integers generally more reliable
- Add support for newer AWS S3 regions (like eu-central-1)
Automation (scenarios, bundles, metrics, checks)¶
- Counting records on small datasets will not use Hive anymore
- Custom checks (in a plugin) can now be used
Hadoop & Spark¶
- It is now possible to import Hive tables as HDFS datasets from the DSS UI
- You can now validate SparkSQL recipes without having to run them
Installation and setup¶
- The most standard Java options can now be set directly from the install.ini file. See /installing/java_env
- DSS can now use Conda for managing its internal Python environment instead of virtualenv/pip
- Enhanced the content of DSS diagnosis reports
Misc¶
- You can now expose a folder or a file in a folder on the Dashboard
- Error handling has been improved in numerous places. DSS will now more prominently display the actual errors, especially when using code recipes
- DSS now includes a public API for interacting with recipes
- New interaction features in plugins
- The schema of a dataset can be exported (to any supported formatter) from the settings screen
- Access to datasets from Python and R is much faster, especially for small datasets
- SQL connectors can now use custom JDBC URLs for advanced customization
- Custom variables are now available in Webapps
- New default pictures for users
- Lots of performance improvements, both in the backend and frontend
Notable bug fixes¶
- Very large Excel files can now be opened with small memory overhead
- Machine Learning: Imputation with Unicode values has been fixed
- Visual preparation: much faster drag & drop with Firefox
- Fixed a bunch of JS errors
- Visual recipes running on Hive or Impala will properly take into account the case-insensitivity of these DBs and not generate case-mismatched Parquet files anymore
- Fixed possible job failures in Kerberos-secured clusters
- Add multi-schema support to S3 -> Redshift syncing
- Don’t forget to clear a dataset before doing a redispatch-sync
- Switched to CartoDB tiles for maps