[go: up one dir, main page]

Dataiku Reference Doc.
  • Product
    • Features
      • Connectivity
      • Data Wrangling
      • Machine Learning
      • Data Mining
      • Data Visualization
      • Data Workflow
      • Realtime Scoring
      •  
      • Code or Click
      • Collaboration
      • Deployment
      • Enterprise Readiness
    • Plugins
    • Samples
    • Technology
    • Editions
  • Solutions
    • Use cases
    • Industries
    • Departments
    • Customers
  • Learn
    • Learn Dataiku DSS
    • All How-To's
    • Reference Doc.
    • Q & A
    • What's new
    • Support
  • Resources
    • White Papers
    • Reference Doc.
    • Webinars
    • Success Stories
  • Company
    • Our Story
    • Team
    • Careers
    • News
    • Events
    • Customers
    • Partners
  • Blog
  • Contact us
  • Get Started
  • Installing DSS
    • Requirements
    • Installing a new DSS instance
    • Upgrading a DSS instance
    • Updating a DSS license
    • Other installation options
      • Install on macOS
      • Install on AWS
      • Install on Azure
      • Install a virtual machine
    • Setting up Hadoop and Spark integration
    • R integration
    • Customizing DSS installation
    • Installing database drivers
    • Java runtime environment
    • Python integration
    • Installing a DSS plugin
    • Configuring LDAP authentication
    • Working with proxies
    • Migration operations
  • DSS concepts
  • Connecting to data
    • Supported connections
    • Upload your files
    • Server filesystem
    • HDFS
    • Amazon S3
    • Google Cloud Storage
    • Azure Blob Storage
    • FTP
    • SCP / SFTP (aka SSH)
    • HTTP
    • SQL databases
      • MySQL
      • PostgreSQL
      • HP Vertica
      • Amazon Redshift
      • EMC Greenplum
      • Teradata
      • Oracle
      • Microsoft SQL Server
      • SAP HANA
      • IBM Netezza
      • Google Bigquery
      • IBM DB2
      • Snowflake
    • Cassandra
    • ElasticSearch
    • Managed folders
    • “Files in folder” dataset
    • HTTP (with cache)
    • Dataset plugins
    • Data connectivity macros
    • Making relocatable managed datasets
    • Data ordering
  • Exploring your data
    • Sampling
    • Analyze
  • Schemas, storage types and meanings
    • Definitions
    • Basic usage
    • Schema for data preparation
    • Creating schemas of datasets
    • Handling of schemas by recipes
    • List of recognized meanings
    • User-defined meanings
  • Data preparation
    • Processors reference
      • Extract from array
      • Fold an array
      • Sort array
      • Concatenate JSON arrays
      • Discretize (bin) numerical values
      • Change coordinates system
      • Copy column
      • Rename columns
      • Concatenate columns
      • Delete/Keep columns by name
      • Count occurrences
      • Convert currencies
      • Extract date elements
      • Compute difference between dates
      • Format date with custom format
      • Parse to standard date format
      • Split e-mail addresses
      • Enrich from French department
      • Enrich from French postcode
      • Extract ngrams
      • Extract numbers
      • Fill empty cells with fixed value
      • Filter rows/cells on date range
      • Filter rows/cells with formula
      • Filter invalid rows/cells
      • Filter rows/cells on numerical range
      • Filter rows/cells on value
      • Find and replace
      • Flag rows/cells on date range
      • Flag rows with formula
      • Flag invalid rows
      • Flag rows on numerical range
      • Flag rows on value
      • Fold multiple columns
      • Fold multiple columns by pattern
      • Fold object keys
      • Formula
      • Fuzzy join with other dataset (memory-based)
      • Generate Big Data
      • Compute distance between geopoints
      • Extract from geo column
      • Geo-join
      • Resolve GeoIP
      • Create GeoPoint from lat/lon
      • Extract lat/lon from GeoPoint
      • Flag holidays
      • Split invalid cells into another column
      • Join with other dataset (memory-based)
      • Extract with JSONPath
      • Group long-tail values
      • Translate values using meaning
      • Normalize measure
      • Negate boolean value
      • Force numerical range
      • Generate numerical combinations
      • Convert number formats
      • Nest columns
      • Unnest object (flatten JSON)
      • Extract with regular expression
      • Pivot
      • Python function
      • Split HTTP Query String
      • Remove rows where cell is empty
      • Round numbers
      • Simplify text
      • Split and fold
      • Split and unfold
      • Split column
      • Transform string
      • Tokenize text
      • Transpose rows to columns
      • Triggered unfold
      • Unfold
      • Unfold an array
      • Convert a UNIX timestamp to a date
      • Fill empty cells with previous/next value
      • Split URL (into protocol, host, port, …)
      • Classify User-Agent
      • Generate a best-effort visitor id
      • Zip JSON arrays
    • Filtering and flagging rows
    • Managing dates
    • Reshaping
    • Geographic processing
    • Sampling
    • Execution engines
  • Data Visualization
    • Sampling and charts engines
    • Standard chart types
    • Geographic charts (Beta)
    • Color palettes
  • Machine learning
    • Prediction (Supervised ML)
    • Clustering (Unsupervised ML)
    • Features handling
    • Machine learning training engines
      • Scikit-learn / XGBoost engine
      • MLLib (Spark) engine
      • H2O (Sparkling Water) engine
      • Vertica
    • Scoring engines
  • The Flow
    • Limiting concurrent executions
  • Visual recipes
    • Sync: copying datasets
    • Grouping: aggregating data
    • Window: analytics functions
    • Distinct: get unique rows
    • Join: joining datasets
    • Splitting datasets
    • Top N: retrieve first N rows
    • Stacking datasets
    • Sampling datasets
    • Sort: order values
    • Pivot recipe
    • Download recipe
  • Recipes based on code
    • The common editor layout
    • Python recipes
    • R recipes
    • SQL recipes
    • Hive recipes
    • Pig recipes
    • Impala recipes
    • Spark-Scala recipes
    • PySpark recipes
    • Spark / R recipes
    • SparkSQL recipes
    • Shell recipes
    • Variables expansion in code recipes
  • Code notebooks
    • SQL notebook
    • Python notebooks
    • Predefined notebooks
  • Webapps
    • “Standard” web apps
    • Shiny web apps
    • Bokeh web apps
    • Publishing webapps on the dashboard
  • Code reports
    • R Markdown reports
  • Dashboards
    • Dashboard concepts
    • Display settings
    • Insights reference
      • Chart
      • Dataset table
      • Model report
      • Managed folder
      • Jupyter Notebook
      • Webapp
      • Metric
      • Scenarios
  • Working with partitions
    • Partitioning files-based datasets
    • Partitioned SQL datasets
    • Specifying partition dependencies
    • Partition identifiers
    • Recipes for partitioned datasets
    • Partitioned Hive recipes
    • Partitioned SQL recipes
    • Partitioning variables substitutions
  • DSS and Hadoop
    • Setting up Hadoop integration
    • Connecting to secure clusters
    • Setup a new HDFS connection
    • DSS and Hive
    • DSS and Impala
    • Hadoop multi-user security
    • Distribution-specific notes
      • Cloudera CDH
      • Hortonworks HDP
      • MapR
      • Amazon Elastic MapReduce
      • Microsoft Azure HDInsight
      • Google Cloud Dataproc
    • Using multiple Hadoop filesystems
    • Teradata Connector For Hadoop
  • DSS and Spark
    • Usage of Spark in DSS
    • Setting up Spark integration
    • Spark configurations
    • Usage notes per dataset type
    • Spark pipelines
    • Limitations and attention points
  • DSS and Python
    • Installing Python packages
    • Reusing Python code
    • Using Matplotlib
    • Using Bokeh
    • Using Plot.ly
    • Using Ggplot
  • DSS and R
    • Installing R packages
    • Reusing R code
    • Using ggplot2
    • Using Dygraphs
    • Using googleVis
    • Using ggvis
  • Code environments
    • Operations (Python)
    • Operations (R)
    • Base packages
    • Using Conda
    • Automation nodes
    • Non-managed code environments
    • Plugins’ code environments
    • Custom options and environment
    • Troubleshooting
  • Collaboration
    • Version control
  • Plugins
    • Installing plugins
    • Installing plugins offline
    • Writing your own plugin
    • Plugin author reference guide
      • Plugins and components
      • Parameters
      • Writing recipes
      • Writing DSS macros
      • Writing DSS Filesystem providers
      • Custom chart elements
      • Other topics
  • Automation scenarios, metrics, and checks
    • Definitions
    • Scenario steps
    • Launching a scenario
    • Reporting on scenario runs
    • Custom scenarios
    • Variables in scenarios
    • Metrics
    • Checks
    • Custom probes and checks
  • Automation node and bundles
    • Installing the Automation node
    • Creating a bundle
    • Importing a bundle
  • API Node: Real-time service
    • Introduction
    • Concepts
    • Installing the API node
    • Your first API service
    • Exposing a visual prediction model
    • Exposing a Python prediction model
    • Exposing a R prediction model
    • Exposing a Python function
    • Exposing a R function
    • Exposing a SQL query
    • Exposing a lookup in a dataset
    • Enriching prediction queries
    • API node user API
    • Using the apinode-admin tool
    • API node administration API
    • High availability and scalability
    • Managing versions of your endpoint
    • Logging and auditing
    • Health monitoring
  • Advanced topics
    • Sampling methods
    • Formula language
    • Custom variables expansion
  • File formats
    • Delimiter-separated values (DSV)
    • Fixed width
    • Parquet
    • Avro
    • Hive SequenceFile
    • Hive RCFile
    • Hive ORCFile
    • XML
    • JSON
    • Excel
    • ESRI Shapefiles
  • DSS APIs
    • The DSS public API
      • Features
      • Public API Keys
      • Public API Python client
      • The REST API
    • The internal Python API
      • Interacting with datasets
      • Performing SQL, Hive and Impala queries
      • Executing partial recipes
      • Interacting with Pyspark
      • Managed folders in Python API
      • Interacting with saved models
      • Interacting with metrics
      • API for custom recipes
      • API for custom datasets
      • API for custom formats
      • API for custom FS providers
      • Custom scenarios API
      • Creating static insights
    • The Javascript API
    • The R API
      • Creating static insights
    • The Scala API
  • Security
    • Main permissions
    • Connections security
    • User profiles
    • Exposed objects
    • Dashboard authorizations
    • Multi-user security
      • Comparing security modes
      • Concepts
      • Prerequisites and limitations
      • Setup
      • Operations
      • Interaction with Hive and Impala
      • Interaction with Spark
      • Advanced topics
    • Audit Trail
    • Advanced security options
    • Single Sign-On
  • Operating DSS
    • dsscli tool
    • The data directory
    • Backing up
    • Logging in DSS
    • DSS Macros
    • Managing DSS disk usage
  • Troubleshooting
    • Diagnosing and debugging issues
    • Obtaining support
    • Common issues
      • DSS does not start / Cannot connect
      • Cannot login to DSS
      • DSS crashes / The “Disconnected” overlay appears
      • Websockets problems
      • Cannot connect to a SQL database
      • A job fails
      • A scenario fails
      • A ML model training fails
      • “Your user profile does not allow” issues
    • Error codes
      • ERR_CODEENV_EXISTING_ENV: Code environment already exists
      • ERR_CODEENV_INCORRECT_ENV_TYPE: Wrong type of Code environment
      • ERR_CODEENV_INVALID_CODE_ENV_ARCHIVE: Invalid code environment archive
      • ERR_CODEENV_MISSING_ENV: Code environment does not exists
      • ERR_CODEENV_MISSING_ENV_VERSION: Code environment version does not exists
      • ERR_CODEENV_NO_CREATION_PERMISSION: User not allowed to create Code environments
      • ERR_CODEENV_NO_USAGE_PERMISSION: User not allowed to use this Code environment
      • ERR_CODEENV_UNSUPPORTED_OPERATION_FOR_ENV_TYPE: Operation not supported for this type of Code environment
      • ERR_CONNECTION_API_BAD_CONFIG: Bad configuration for connection
      • ERR_CONNECTION_AZURE_INVALID_CONFIG: Invalid Azure connection configuration
      • ERR_CONNECTION_S3_INVALID_CONFIG: Invalid S3 connection configuration
      • ERR_CONNECTION_SQL_INVALID_CONFIG: Invalid SQL connection configuration
      • ERR_CONNECTION_SSH_INVALID_CONFIG: Invalid SSH connection configuration
      • ERR_DATASET_ACTION_NOT_SUPPORTED: Action not supported for this kind of dataset
      • ERR_DATASET_HIVE_INCOMPATIBLE_SCHEMA: Dataset schema not compatible with Hive
      • ERR_DATASET_INVALID_CONFIG: Invalid dataset configuration
      • ERR_DATASET_INVALID_FORMAT_CONFIG: Invalid format configuration for this dataset
      • ERR_DATASET_INVALID_METRIC_IDENTIFIER: Invalid metric identifier
      • ERR_DATASET_INVALID_PARTITIONING_CONFIG: Invalid dataset partitioning configuration
      • ERR_DATASET_PARTITION_EMPTY: Input partition is empty
      • ERR_ENDPOINT_INVALID_CONFIG: Invalid configuration for API Endpoint
      • ERR_FOLDER_INVALID_PARTITIONING_CONFIG: Invalid folder partitioning configuration
      • ERR_FSPROVIDER_CANNOT_CREATE_FOLDER_ON_DIRECTORY_UNAWARE_FS: Cannot create a folder on this type of file system
      • ERR_FSPROVIDER_DEST_PATH_ALREADY_EXISTS: Destination path already exists
      • ERR_FSPROVIDER_FSLIKE_REACH_OUT_OF_ROOT: Illegal attempt to access data out of connection root path
      • ERR_FSPROVIDER_HTTP_CONNECTION_FAILED: HTTP connection failed
      • ERR_FSPROVIDER_HTTP_INVALID_URI: Invalid HTTP URI
      • ERR_FSPROVIDER_HTTP_REQUEST_FAILED: HTTP request failed
      • ERR_FSPROVIDER_ILLEGAL_PATH: Illegal path for that file system
      • ERR_FSPROVIDER_INVALID_CONFIG: Invalid configuration
      • ERR_FSPROVIDER_INVALID_FILE_NAME: Invalid file name
      • ERR_FSPROVIDER_LOCAL_LIST_FAILED: Could not list local directory
      • ERR_FSPROVIDER_PATH_DOES_NOT_EXIST: Path in dataset or folder does not exist
      • ERR_FSPROVIDER_ROOT_PATH_DOES_NOT_EXIST: Root path of the dataset or folder does not exist
      • ERR_FSPROVIDER_SSH_CONNECTION_FAILED: Failed to establish SSH connection
      • ERR_HIVE_HS2_CONNECTION_FAILED: Failed to establish HiveServer2 connection
      • ERR_METRIC_DATASET_COMPUTATION_FAILED: Metrics computation completely failed
      • ERR_METRIC_ENGINE_RUN_FAILED: One of the metrics engine failed to run
      • ERR_MISC_ENOSPC: Out of disk space
      • ERR_MISC_EOPENF: Too many open files
      • ERR_OBJECT_OPERATION_NOT_AVAILABLE_FOR_TYPE: Operation not supported for this kind of object
      • ERR_PLUGIN_CANNOT_LOAD: Plugin cannot be loaded
      • ERR_PLUGIN_COMPONENT_NOT_INSTALLED: Plugin component not installed or removed
      • ERR_PLUGIN_DEV_INVALID_COMPONENT_PARAMETER: Invalid parameter for plugin component creation
      • ERR_PLUGIN_DEV_INVALID_DEFINITION: The descriptor of the plugin is invalid
      • ERR_PLUGIN_INVALID_DEFINITION: The plugin’s definition is invalid
      • ERR_PLUGIN_NOT_INSTALLED: Plugin not installed or removed
      • ERR_PLUGIN_WITHOUT_CODEENV: The plugin has no code env specification
      • ERR_PLUGIN_WRONG_TYPE: Unexpected type of plugin
      • ERR_PROJECT_INVALID_ARCHIVE: Invalid project archive
      • ERR_PROJECT_INVALID_PROJECT_KEY: Invalid project key
      • ERR_RECIPE_CANNOT_CHANGE_ENGINE: Cannot change engine
      • ERR_RECIPE_CANNOT_CHECK_SCHEMA_CONSISTENCY: Cannot check schema consistency
      • ERR_RECIPE_CANNOT_CHECK_SCHEMA_CONSISTENCY_EXPENSIVE: Cannot check schema consistency: expensive checks disabled
      • ERR_RECIPE_CANNOT_CHECK_SCHEMA_CONSISTENCY_NEEDS_BUILD: Cannot compute output schema with an empty input dataset. Build the input dataset first.
      • ERR_RECIPE_CANNOT_CHECK_SCHEMA_CONSISTENCY_ON_RECIPE_TYPE: Cannot check schema consistency on this kind of recipe
      • ERR_RECIPE_CANNOT_CHECK_SCHEMA_CONSISTENCY_WITH_RECIPE_CONFIG: Cannot check schema consistency because of recipe configuration
      • ERR_RECIPE_CANNOT_CHANGE_ENGINE: Not compatible with Spark
      • ERR_RECIPE_CANNOT_USE_ENGINE: Cannot use the selected engine for this recipe
      • ERR_RECIPE_INCONSISTENT_I_O: Inconsistent recipe input or output
      • ERR_RECIPE_PDEP_UPDATE_REQUIRED: Partition dependecy update required
      • ERR_RECIPE_SPLIT_INVALID_COMPUTED_COLUMNS: Invalid computed column
      • ERR_SCENARIO_INVALID_STEP_CONFIG: Invalid scenario step configuration
      • ERR_SECURITY_CRUD_INVALID_SETTINGS: The user attributes submitted for a change are invalid
      • ERR_SECURITY_GROUP_EXISTS: The new requested group already exists
      • ERR_SECURITY_INVALID_NEW_PASSWORD: The new password is invalid
      • ERR_SECURITY_INVALID_PASSWORD: The password hash from the database is invalid
      • ERR_SECURITY_MUS_USER_UNMATCHED: The DSS user is not configured to be matched onto a system user
      • ERR_SECURITY_PATH_ESCAPE: The requested file is not within any allowed directory
      • ERR_SECURITY_USER_EXISTS: The requested user for creation already exists
      • ERR_SECURITY_WRONG_PASSWORD: The old password provided for password change is invalid
      • ERR_SPARK_FAILED_DRIVER_OOM: Spark failure: out of memory in driver
      • ERR_SPARK_FAILED_TASK_OOM: Spark failure: out of memory in task
      • ERR_SPARK_FAILED_YARN_KILLED_MEMORY: Spark failure: killed by YARN (excessive memory usage)
      • ERR_SPARK_PYSPARK_CODE_FAILED_UNSPECIFIED: Pyspark code failed
      • ERR_SQL_CANNOT_LOAD_DRIVER: Failed to load database driver
      • ERR_SQL_DB_UNREACHABLE: Failed to reach database
      • ERR_SQL_IMPALA_MEMORYLIMIT: Impala memory limit exceeded
      • ERR_SQL_POSTGRESQL_TOOMANYSESSIONS: too many sessions open concurrently
      • ERR_SQL_TABLE_NOT_FOUND: SQL Table not found
      • ERR_SQL_VERTICA_TOOMANYROS: Error in Vertica: too many ROS
      • ERR_SQL_VERTICA_TOOMANYSESSIONS: Error in Vertica: too many sessions open concurrently
      • ERR_TRANSACTION_FAILED_ENOSPC: Out of disk space
      • ERR_TRANSACTION_GIT_COMMMIT_FAILED: Failed committing changes
    • Known issues
  • Release notes
    • DSS 4.1 Release notes
    • DSS 4.0 Release notes
    • DSS 3.1 Release notes
    • DSS 3.0 Relase notes
    • DSS 2.3 Relase notes
    • DSS 2.2 Relase notes
    • DSS 2.1 Relase notes
    • DSS 2.0 Relase notes
    • DSS 1.4 Relase notes
    • DSS 1.3 Relase notes
    • DSS 1.2 Relase notes
    • DSS 1.1 Release notes
    • DSS 1.0 Release Notes
    • Pre versions
  • Other Documentation
  • Third-party acknowledgements
 
Dataiku DSS
  • Docs »
  • Schemas, storage types and meanings

Schemas, storage types and meanings¶

  • Definitions
    • Storage types
      • Why use precise storage types ?
    • Meanings
  • Basic usage
    • Changing meaning and storage type
      • Changing the storage type
      • Changing the meaning
      • Editing advanced schema
  • Schema for data preparation
    • Schema in visual analysis
    • Schema in prepare recipe
  • Creating schemas of datasets
    • Schema of new external datasets
      • SQL and Cassandra datasets
      • Text-based files datasets
      • “Typed” files datasets
    • Schema of managed datasets
    • Modifying the schema
  • Handling of schemas by recipes
    • Sample, Filter, Group, Window, Join, Split, Stack
    • Sync
    • Prepare
    • Hive, Impala, Pig, SQL
    • Python, R, PySpark, SparkR
    • Machine Learning (scoring)
    • SparkSQL
    • Shell
  • List of recognized meanings
    • Basic meanings
      • Text
      • Decimal
      • Integer
      • Boolean
      • Date / Dates (needs parsing)
      • Object / Array
      • Natural language
    • Geospatial meanings
      • Latitude / Longitude
      • Geopoint
      • Geometry
      • Country
      • US State
    • Web-specific meanings
    • Other meanings
  • User-defined meanings
    • Kinds of user-defined meanings
      • Declarative
      • Values list
      • Values mapping
      • Pattern
    • Autodetecting user-defined meanings
Next Previous

© Copyright 2018, Dataiku.

Sphinx theme provided by Read the Docs