ElasticSearch¶
Data Science Studio can both read and write datasets on ElasticSearch versions 1.1 to 5.6.
Append Mode (to append to an elasticsearch dataset instead of replacing) is not supported.
Define an ElasticSearch connection¶
- Go to Administration > Connections
- Click the “New connection” button and pick ElasticSearch
- Enter a name for the new connection, and the required connection parameters, then test and save the new connection
Note
The port parameter should be ElasticSearch’s HTTP API port (9200 by default), not the Java API port.
Managed ElasticSearch datasets¶
If you allow DSS to write managed dataset into the ElasticSearch connection, you can use this connection to create output datasets for recipes.
Creating such a dataset creates a new index on your ElasticSearch server, with
the name of the dataset by default, and its data as a type also the name of the
dataset by default. For example, if your ElasticSearch server is hoster on
localhost:9200
, a managed dataset named Articles
stores its data into
localhost:9200/articles/articles
. This name will not change if you rename
the dataset in case you are relying on its presence, so if you rename the
dataset and want those names to remain similar, you should edit the index and
type names after renaming the dataset, then rebuild it and manually delete the
previous index.
Warning
You should not create other types in the index that are managed by DSS, they might be deleted or altered.
By default, fields get the default ElasticSearch mapping, e.g. string are
analyzed and indexed (mapped to text
in ElasticSearch 5+). If you want
access to a non-analyzed version(mapped to keyword
in ElasticSearch 5+) of
some or all of your columns, you can list those columns (comma-separated, or
*
for all string columns) in the dataset settings. You can also specify your
own complete type mapping.
If your dataset is partitioned, then one index per partition is created (prefixed by the index name) and the index name is actually an ElasticSearch alias that points to all the partition’s indices. You can still search or delete from the alias normally.
External ElasticSearch datasets¶
You can also import existing data from ElasticSearch into DSS. Simply create an ElasticSearch dataset and specify the index and type name of the data. If the connection is writable, DSS can also overwrite that data, but the type mapping will not be modified by DSS and the index/type, not created if they don’t already exist.
Your index may be an alias if it’s only used for reading, or for writing if it only points to one index (otherwise ElasticSearch refuses the write operation).
You can partition your external dataset in DSS: simply specify the partitioning column and the type of partitioning (value or time-based). You can only partition on one column for external datasets.
Note
The partitioning column must have fielddata
enabled, which is the case by default for keyword
fields in Elasticsearch 5+
but not for text
.