The Python environment¶
Adding Python packages to the default environment¶
The default Data Science Studio installation builds a Python virtual environment (virtualenv) which contains all packages required for Data Science Studio operation.
If you need to install additional third-party Python packages (to make them available to notebooks and recipes),
you must use the command DATA_DIR/bin/pip
, where DATA_DIR is the Studio data directory.
$ DATA_DIR/bin/pip list
$ DATA_DIR/bin/pip install PKG
As usual with Python package installation on Linux, you may need to install additional system dependencies if the target Python packages include native code. In particular you may need the system development tools (“build-essentials” on Debian/Ubuntu, “@Development tools” on RedHat/CentOS) and the Python interpreter header files (“python-dev” on Debian/Ubuntu, “python27-devel” on RedHat/CentOS 6.x, “python-devel” on RedHat/CentOS 7.x).
Warning
Using the system’s pip command will not work. Data Science Studio’s Python environment is fully isolated.
In addition to the above, you can add locally-managed Python code and resources in directory DATA_DIR/lib/python
.
This directory is created but left empty by the Data Science Studio installer, and is included in the Python
search path for both notebooks and recipes. You can use it to deploy additional Python modules used by your code
but not managed by pip.
Note
The additional Python packages installed by DATA_DIR/bin/pip
or added to DATA_DIR/lib/python
are preserved
by DSS upgrades.
Warning
Using this mechanism to upgrade or locally reinstall one of the standard Python packages shipped with DSS is not supported, and is likely to break DSS code in subtle manners, as more often than not backwards compatibility is incomplete. This is especially the case with the heavyweight packages of the Scientific Python suite (numpy, scipy, scikit-learn, pandas).
The default Python environment setup¶
Data Science Studio requires a Python 2.7 interpreter. As part of the standard DSS installation, the presence of the distribution default packages for Python 2.7 is checked and if necessary those are pulled by the dependency installation phase.
Note
On CentOS and RedHat 6.x, where the system’s version of Python is 2.6, Python 2.7 is pulled from the additional repository IUS (http://iuscommunity.org/pages/Repos.html).
The installation script locates the Python interpreter to use by looking up python2.7
in the standard PATH.
It then proceeds to build a Python virtual environment on top of this interpreter, containing the standard Python
packages shipped with Data Science Studio.
Data Science Studio uses this virtual environment to run all Python code, including IPython notebooks and Python dataset manipulation recipes.
The DATA_DIR/bin/pip
command can be used to list or otherwise manage the contents of this virtual environment, as described above.
For testing purposes, the Python virtual environment used by DSS can be launched with DATA_DIR/bin/python
Note
If several Python 2.7 systems are available on your server, you can control which one is used by DSS by adjusting the PATH environment variable
of the DSS Unix user account so that it is found by command python2.7
. You should NOT use environment variable DKUPYTHONBIN
for this as
this would switch to a different advanced installation mode, described below.
Warning
The native libraries of the standard Python packages shipped with DSS are built using UCS-4 Unicode characters. Make sure the
default Python interpreter used by DSS has been built with --enable-unicode=ucs4
. This is the default on most recent
Linux distributions, but it is not the default when building Python interpreters directly from source.
Rebuilding the Python environment¶
It is possible to rebuild the Python virtual environment, if necessary. This is the case if you moved or renamed Data Science Studio’s data directory, as Python virtual environments embed their full directory name. This may be also be the case if you want to reset the virtualenv to a pristine state following installation / desinstallation of additional packages.
The Python virtualenv is automatically created by the installer when it is not present. The sequence of operations to reinitialize it thus consists in removing the virtualenv and reinstalling DSS, keeping track of any local package which you want to reinstall afterwards:
# Stop Data Science Studio
DATADIR/bin/dss stop
# Save the list of locally-installed packages
DATADIR/bin/pip freeze -l >dss-local-packages.txt
# Remove the virtualenv, keeping backup
mv DATADIR/pyenv DATADIR/pyenv.backup
# Reinstall Data Science Studio (upgrade mode)
dataiku-dss-VERSION/installer.sh -d DATADIR -u
# Review and possibly edit the list of locally-installed packages
vi dss-local-packages.txt
# Reinstall local packages
DATADIR/bin/pip install -r dss-local-packages.txt
# Start Data Science Studio
DATADIR/bin/dss start
# When everything is considered stable, remove the backup
rm -rf DATADIR/pyenv.backup
Advanced: using a fully custom Python environment¶
For non-standard needs, you can force Data Science Studio to use an externally-maintained Python 2.7 installation by defining the DKUPYTHONBIN environment variable for the Linux user account running the Studio.
Warning
Using this mode is not officially supported.
This variable points to the Python binary to use. It should be defined before running the installer, and for all subsequent runs of the Studio startup or management scripts. You would typically define it as follows:
$ echo "export DKUPYTHONBIN=/usr/local/bin/python" >>$HOME/.profile
When this variable is defined, the precompiled third-party Python packages shipped with DSS are not used. You must make sure that the
interpreter started by $DKUPYTHONBIN
contains all packages required by DSS. Please refer to the script
INSTALL_DIR/scripts/install/install-python-packages.sh
, found in the Data Science Studio installation directory, for this purpose.
Using Anaconda Python¶
DSS supports using Anaconda Python instead of standard system-provided Python. In that mode, the DSS installer builds an Anaconda environment, containing the standard set of packages required by DSS, instead of a virtualenv-based environment, and uses it for all Python-based tasks.
As for virtualenv-based installations, it is possible to manually add supplementary packages to this environment, for use in recipes and notebooks.
Prerequisites¶
- You must have a 64-bit version of Anaconda (https://www.continuum.io/downloads) or Miniconda (http://conda.pydata.org/miniconda.html) installed on the DSS host.
- Anaconda/Miniconda are supported in versions 4.0 and 4.2.
- The binary directory for Anaconda must be in the PATH for the DSS user account. In particular, the
conda
command must be accessible to this user. - You must have access to a repository of standard Anaconda packages, either through an outgoing Internet connection (direct or using a proxy), or through a local mirror.
Installation¶
The DSS installer switches to Anaconda mode when given the -C
flag:
dataiku-dss-VERSION/installer.sh -d DATADIR -p PORT -C
It will then download all required packages/versions from the Anaconda repository (plus a few custom ones which are provided directly from the DSS
installation directory), and build an Anaconda environment from them in directory DATADIR/condaenv
.
Once an Anaconda environment is built in DATADIR/condaenv
it is used instead of the standard virtualenv in DATADIR/pyenv
.
Upgrading an Anaconda-based DSS installation installs the new set of required packages/versions in the DSS Anaconda environment, preserving manually-installed additional packages, or upgrading them in case of versioning conflicts.
Further operations¶
Adding / removing / listing additional packages from the DSS-managed Anaconda environment can be done using the standard conda
commands:
conda list -p DATADIR/condaenv
conda install -p DATADIR/condaenv PACKAGE
Warning
Uninstalling / upgrading / downgrading the standard packages installed by DSS is not supported and may lead to subtle compatibility problems.
Adding / removing / listing additional packages may be done through the pip
command, when the required packages are not available as conda packages:
DATADIR/bin/pip list
DATADIR/bin/pip install PACKAGE
For testing purposes, it is possible to run the DSS Anaconda environment outside DSS using:
DATADIR/bin/python
It is possible to migrate a virtualenv-based DSS installation to Anaconda mode by running the installer in “upgrade” mode and adding the -C
flag:
dataiku-dss-VERSION/installer.sh -d DATADIR -u -C
It is possible to migrate back n Anaconda-based DSS installation to standard virtualenv mode by moving away the conda environment and re-running the installer in “upgrade” mode:
mv DATADIR/condaenv DATADIR/condaenv.BAK
dataiku-dss-VERSION/installer.sh -d DATADIR -u