Formula language¶
DSS includes a language to write formulas, much like a spreadsheet.
Formulas can be used:
- In data preparation, to create new columns, filter rows or flag rows
- More generally, to filter rows in many places of DSS:
- In the Filtering recipe, to filter rows
- In Machine Learning, to define the extracts to use for train and test set
- In the Python and Javascript APIs, to obtain partial extracts from the datasets
- In the Public API, to obtain partial extracts from the datasets
- In the grouping, window, join and stack recipes, to perform pre and post filtering
Basic usage¶
Note
We also have a Howto that lists common use cases with examples
Formulas define an expression, that applies row per row.
Assuming that you have a dataset with columns N1 (numeric), N2 (numeric) and S (string) , here are a few example formulas:
2 + 2
N1 + N2
min(N1, N2)
# Returns the smallest of N1 and N2replace(S, 'old', 'new')
# Returns the value of S with ‘old’ replaced by ‘new’if (N1 > N2, 'big', 'small')
# Returns big is N1 > N2, small otherwise
Boolean values¶
The formula language uses “True” and “False”, with quotes, as boolean values for true and false.
Reading column values¶
In almost all formulas, you’ll need to read the values of the columns for the current row.
When the column has a “simple” name (i.e: starting by a letter, contains only letters, numbers and underscores), you just need to use the name of the column in the fomula: N1 + 4
For other cases, you can use:
strval("column with spaces")
: returns the value of “column with spaces” as a stringnumval("column with spaces")
: returns the value of “column with spaces” as a number
Variables typing¶
Variables in the formula language can have one of the following types: string, integer, decimal, array, object, boolean, date
Regardles of the schema of the dataset, or the meanings of the columns (in the case of working on dynamically-generated columns), the following rules are applied:
- If column values are “standard” decimal, they are automatically parsed to decimal
- If column values are “standard” integer, they are automatically parsed to integer
- Else, they are kept as string
Columns containing dates, arrays, objects or booleans are not automatically converted to the matching Formula type. However:
- all functions that require an array will automatically attempt to convert a string input to array (using the regular DSS JSON syntax)
- all functions that require a date will automatically attempt to convert a string input to date (using the ISO-8601 format)
In other words, if the column “begin_date” is a date (and thus contains properly-formatted ISO-8601):
type(begin_date)
returns “string”inc(begin_date, 2, "days")
works as expected because the inc function performed automatic conversion to date- To manually obtain an array or object, use the
parseJson
function - To manually obtain a date, use the
asDate
function
Operators¶
The formula language supports the classical arithmetic operators:
- Regular math operators:
+
,-
,*
,/
- Comparison operators (evaluate to booleans):
>
,>=
,<
,<=
,==
,!=
- Arithmetic operators:
//
(integer division) and%
(modulo) - Boolean operators:
&&
,||
- The
+
operator also performs string concatenation. - The comparison operators (and only them) can operate on dates. For dates arithmetic, see the
diff
andinc
functions.
Array and object operations¶
Formula support accessing array elements and object keys using the traditional Python/Javascript syntax:
array[0]
object["key"]
object.key
(only valid if ‘key’ is a valid identifier, i.e. matches[A-Za-z0-9_]*
)
Note: this requires that you actually have an array or object. You might need to use the parseJson
function (see above paragraph about typing)
Object notations¶
For all functions, you can use them the “regular” way: replace(str, 'a', 'b')
or in the “object” way: str.replace('a', 'b')
In object notation, the first argument to the function is replaced by its ‘context’.
For example, the two syntaxes are equivalent:
length(trim(replace(foo, 'a', 'b')))
foo.replace('a', 'b').trim().length()
Array functions¶
arrayContains(array a, item) boolean¶
Returns whether the array a
contains item
arrayDedup(array a) array¶
Returns array a with duplicates removed
arrayIndexOf(array a, item) int¶
the index (starting from 0) of item
in the array a
. Return -1 if item
is not in a
arrayLen(array a) int¶
Returns the length of the array a
arrayReverse(array a) array¶
Reverses array a
arraySort(array a) array¶
Sorts array a
get(o, number or string from, optional number to) Depends on actual arguments¶
If o has fields, returns the field named ‘from’ of o. If o is an array, returns o[from, to]. if o is a string, returns o.substring(from, to)
join(array a, string sep) string¶
Returns the string obtained by joining the elements of array a
with the separator sep
For example, the expression join(date_elements, '-')
on a column date_elements
produces:
date_elements | output |
---|---|
[2007, 7, 15] | 2007-7-15 |
[2016, 1, 8] | 2016-1-8 |
objectKeys(object o) array¶
Returns the keys of an object as an array
objectValues(object o) array¶
Returns the keys of an object as an array
slice(o, number from, optional number to) Depends on actual arguments¶
If o
is an array, returns o[from, to]
. If o
is a string, returns o.substring(from, to)
substring(o, number from, optional number to) Depends on actual arguments¶
If o
is an array, returns o[from, to]
. If o
is a string, returns o.substring(from, to)
Boolean functions¶
and(boolean a, boolean b) boolean¶
Returns a AND b
not(boolean b) boolean¶
Returns the opposite of b
or(boolean a, boolean b) boolean¶
Returns a OR b
Date functions¶
asDate(o, format1, format2, … (all optional)) date¶
Returns o converted to a date. If you don’t give a format, ISO-8601 is used. You can give an ordered list of possible formats using this syntax: https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html
datePart(date d, string part) number¶
Returns part of a date. The available parts are: ‘hours’ (or ‘hour’ or ‘h’), ‘minutes’ (or ‘minute’ or ‘min’), ‘seconds’ (or ‘second’ or ‘s’), ‘years’ (or ‘year’), ‘months’ (or ‘month’), ‘days’ (or ‘day’ or ‘d’), ‘weeks’ (or ‘week’ or ‘w’), ‘weekday’ (the name of the day of the week, capitalized), ‘time’ (number of milliseconds since epoch)
Returned parts are always in local timezone and english
diff(o1, o2, optional string unit) number¶
Returns the difference between two dates in given time units. The available units are: ‘hours’ (or ‘hour’ or ‘h’), ‘minutes’ (or ‘minute’ or ‘min’), ‘seconds’ (or ‘second’ or ‘s’), ‘years’ (or ‘year’), ‘months’ (or ‘month’), ‘days’ (or ‘day’ or ‘d’), ‘weeks’ (or ‘week’ or ‘w’)
inc(date d, number value, string unit) date¶
Returns a date incremented or decremented by the given amount in the given unit of time. The available units are: ‘hours’ (or ‘hour’ or ‘h’), ‘minutes’ (or ‘minute’ or ‘min’), ‘seconds’ (or ‘second’ or ‘s’), ‘years’ (or ‘year’), ‘months’ (or ‘month’), ‘days’ (or ‘day’ or ‘d’), ‘weeks’ (or ‘week’ or ‘w’)
now() date¶
Returns the current time
trunc(date d, string unit) date¶
Truncates a date to a given unit. The available units are: ‘hours’ (or ‘hour’ or ‘h’), ‘minutes’ (or ‘minute’ or ‘min’), ‘seconds’ (or ‘second’ or ‘s’), ‘years’ (or ‘year’), ‘months’ (or ‘month’), ‘days’ (or ‘day’ or ‘d’), ‘weeks’ (or ‘week’ or ‘w’)
Math functions¶
abs(number d) number¶
Returns the absolute value of a number
acos(number d) number¶
Returns the arc cosine of an angle, in the range 0 through PI
asin(number d) number¶
Returns the arc sine of an angle in the range of -PI/2 through PI/2
atan(number d) number¶
Returns the arc tangent of an angle in the range of -PI/2 through PI/2
atan2(number x, number y) number theta¶
Converts rectangular coordinates (x, y) to polar (r, theta)
ceil(number d) number¶
Returns the ceiling of a number
combin(number n, number k) number¶
Returns the number of combinations for n
elements as divided into k
cos(number d) number¶
Returns the trigonometric cosine of an angle
cosh(number d) number¶
Returns the hyperbolic cosine of a value
dec2hex(long) string¶
Returns an hexadecimal representation of the input number
degrees(number d) number¶
Converts an angle from radians to degrees.
even(number d) number¶
Rounds the number up to the nearest even integer
exp(number n) number¶
Returns the exponential of a number
fact(number i) number¶
Returns the factorial of a number
factn(number i) number¶
Returns the factorial of a number
floor(number d) number¶
Returns the floor of a number
gcd(number d, number e) number¶
Returns the greatest common denominator of the two numbers
hash(string) long¶
Returns a 64 bits numerical hash of the input (not crypto-secure)
hex2dec(string) long¶
Returns a decimal representation of an hexadecimal string
lcm(number d, number e) number¶
Returns the greatest common denominator of the two numbers
ln(number n) number¶
Returns the natural log of a number
log(number n) number¶
Returns the base 10 log of a number
max(number a, number b) number¶
Returns the greater of two numbers
min(number a, number b) number¶
Returns the smaller of two numbers
mod(number a, number b) number¶
Returns a modulus b
multinomial(number d1, number d2 …) number¶
Calculates the multinomial of a series of numbers
odd(number d) number¶
Rounds the number up to the nearest odd integer
pow(number a, number b) number¶
Returns a
to the power of b
quotient(number numerator, number denominator) number¶
Returns the integer portion of a division
radians(number d) number¶
Converts an angle in degrees to radians
rand(optional int min, optional int max) double or int¶
Without arguments, returns a random float between 0 and 1. With min
and max
arguments, returns a random integer between min
(inclusive) and max
(exclusive)
round(number n) number¶
Returns the rounding of number to the nearest integer
sin(number d) number¶
Returns the trigonometric sine of an angle
sinh(number d) number¶
Returns the hyperbolic sine of an angle
sum(array a) number¶
Sums the numbers of an array. Skips non-number elements from the array.
tan(number d) number¶
Returns the trigonometric tangent of an angle
tanh(number d) number¶
Returns the hyperbolic tangent of a value
toNumber(o) number¶
Returns o converted to a number
Object functions¶
hasField(o, string name) boolean¶
Returns whether o has field name
htmlAttr(Element e, String s) string¶
Selects a value from an attribute on an Html Element
htmlText(Element e) string¶
Selects the text from within an element (including all child elements)
innerHtml(Element e) string¶
The innerHtml of an HTML element
jsonize(value) JSON literal value¶
Quotes a value as a JSON literal value
objectDel(key, [key…]) object¶
Removes one or several keys from an object and returns it. The keys must not be null
objectNew(k1, v1, k2, v2, …) object¶
Creates a new object, optionally pre-filled with key/values. Must get an even number of arguments, as successive key-value pairs. Giving 0 arguments is possible and will return an empty object (you can use objectPut to add to it)
objectPut(object o, key, value) object¶
Adds a key/value pair to an object and returns it. key must not be null
ownText(Element e) string¶
Gets the text owned by this HTML element only; does not get the combined text of all children.
parseHtml(string s) HTML object¶
Parses a string as HTML
parseJson(string s) object or array¶
Parses a JSON string as an object or array
select(Element e, String s) HTML Elements¶
Selects an element from an HTML elementn using selector syntax
type(object o) string¶
Returns the type of o
String functions¶
chomp(string str, string tail) string¶
Removes tail
from the end of str
if it’s there, otherwise leave it alone.
contains(string s, string frag) boolean¶
Returns whether s
contains frag
endsWith(string s, string sub) boolean¶
Returns whether s
ends with sub
escape(string s, string mode) string¶
Escapes a string depending on the given escaping mode. Supported modes: ‘html’,’xml’,’csv’,’url’,’javascript’
format(string format, object… args) string¶
Formats a string using printf-like formatting.
For example '%4d-%02d'.format(2004,2)
gives '2004-02'
.
This uses the Java Formatter syntax
fromBase64(string s, optional string charset) string¶
Returns the string whose Base64 representation is given. By default, the string is read using the UTF-8 charset.
indexOf(string s, string sub) number¶
Returns the index of the first ocurrence of sub
in s
. Returns -1 if there is no such occurrence.
lastIndexOf(string s, string sub) number¶
Returns the index of the last ocurrence of sub
in s
. Returns -1 if there is no such occurrence.
length(array or string o) number¶
Returns the length of o
match(string, string or regexp) array of strings¶
Returns an array of the groups matching the given regular expression
For example, match('hello world', 'he(.*).*(rl)d')
returns ['ll', 'rl']
md5(string s) string¶
Returns the MD5 hash of a string
partition(string s, string or regex frag, optional boolean omitFragment) array¶
Returns an array of strings [a,frag,b]
where a
is the part before the first occurrence of frag
in s
and b
is the part after the occurrence. If omitFragment
is true, frag
is not returned in the array.
replace(string s, string or regex f, string replacement) string¶
Replaces all occurrences of a substring or regex in a string
replaceChars(string s, string f, string r) string¶
Returns the string obtained by replacing all chars in f
with the char in s
at that same position
For example, replaceChars('abcba', 'bc', 'yz')
returns ayzya
The function can be used to delete characters, by replacing them by nothing, like in: replaceChars('abcba', 'bc', 'y')
returns ayya
rpartition(string s, string or regex frag, optional boolean omitFragment) array¶
Returns an array of strings [a,frag,b]
where a
is the part before the last occurrence of frag
in s
and b
is the part after the occurrence. If omitFragment
is true, frag
is not returned in the array.
sha1(string s) string¶
Returns the SHA-1 hash of a string
split(string s, string or regex sep, optional boolean preserveAllTokens) array¶
Returns the array of strings obtained by splitting s
with separator sep
. If preserveAllTokens
is true, then empty segments are preserved.
splitByCharType(string s) array¶
Returns an array of strings obtained by splitting s grouping consecutive chars by their unicode type
splitByLengths(string s, number length1, …) array¶
Returns the array of strings obtained by splitting a string into substrings with the given lengths
For example, splitByLengths('abcdefgh', 2, 3, 1)
returns ['ab','cde','f']
The function generates one element in the array for each length passed as parameter. If the input string is not long enough, the last elements are empty.
startsWith(string s, string sub) boolean¶
Returns whether s
starts with sub
strip(string s) string¶
Returns copy of the string, with leading and trailing whitespace omitted.
toBase64(string s, optional string charset) string¶
Returns the Base64 representation of a string. By default, the string is written using the UTF-8 charset.
toLowercase(string s) string¶
Converts a string to lowercase
toString(o, string format (optional)) string¶
Returns o converted to a string
toTitlecase(string s) string¶
Converts a string to titlecase
toUppercase(string s) string¶
Converts a string to uppercase
trim(string s) string¶
Returns copy of the string, with leading and trailing whitespace omitted.
unescape(string s, string mode) string¶
Unescapes all escaped parts of the string depending on the given escaping mode. Available modes: ‘html’,’xml’,’csv’,’url’,’javascript’
unicode(string s) string¶
Returns the input string as an array of the unicode codepoints (numbers)
unicodeType(string s) string¶
Returns an array of strings describing each character of the input string in their full unicode notation
Value access functions¶
numval(object o) number¶
Returns the numerical value of a given cell or expression. If the value of the cell or expression is not numerical, an empty value is returned.
For example, with columns type
(with values “high” and “low”), cost_high
and cost_low
, the expression numval('cost_' + type)
produces:
cost_low | cost_high | type | output |
---|---|---|---|
200 | 500 | low | 200.0 |
100 | 400 | high | 400.0 |
xx | 400 | low | |
strval(object o, optional string defaultValue) string¶
Returns the numerical value of a given cell or expression, or the default value if the value of the cell or expression is empty.
For example, with columns type
(with values “high” and “low”), code_high
and code_low
, the expression strval('code_' + type, 'N/A')
produces:
code_low | code_high | type | output |
---|---|---|---|
200 | 500 | low | 200 |
100 | 4x0 | high | 4x0 |
400 | low | N/A | |
Control structures¶
Control structures allow you to perform advanced operations.
Beware: control structures cannot use object notation !
filter(array a, variable v, expression e) (returns: array)¶
Evaluates expression a to an array. Then for each array element, binds its value to variable name v, evaluates expression e, and pushes the result onto the result array if the result is truish.
Note
filter(myarray, v, v < 2)
is equivalent to this Python syntax: [v for v in myarray if v < 2]
and returns the array with only the elements below 2
forEeach(array a, variable v, expression e) (returns: array)¶
Evaluates expression a to an array. Then for each array element, binds its value to variable name v, evaluates expression e, and pushes the result onto the result array.
Note
forEach(myarray, v, v + 2)
is equivalent to this Python syntax: [v+2 for v in myarray]
forEachIndex(array a, variable i, variable v, expression e) (returns: array)¶
Evaluates expression a to an array. Then for each array element, binds its value to variable name v and its index to variable name i, evaluates expression e, and pushes the result onto the result array.
Note
forEachIndex(myarray, i, v, v + i)
is equivalent to this Javascript syntax: myarray.map(function(v, i) { return v+i ;})
forRange(from, to, step, variable v, expression e) (returns: array)¶
Iterates, tarting at from, incrementing by step each time while less than to. At each iteration, binds the variable v to the iteration value, evaluates expression e, and pushes the result onto the result array.
Note
forRange(0, 100, 3, v, v * 2)
is equivalent to this Python syntax: [v * 2 for v in xrange(0, 100, 3)]
and returns [0, 6, 12, 18 …. 198]
if(boolean, expression_true, expression_false) (returns: anything)¶
Evaluates to expression_true if the condition is true, to expression_false otherwise
objectFilter(expression a, variable k, variable v, expression test) object¶
Evaluates expression a to an object. Then for each element (k, v) of this object, binds its key to variable name k, its value to variable name v, evaluates expression test which should return a boolean. If the boolean is true, pushes (k, v) onto the result object.
with(expression o, variable v, expression e) (returns: any)¶
Evaluates expression o and binds its value to variable name v. Then evaluates expression e and returns that result.
The with
control allows you to “split” a very big expression into more manageable chunks. It also makes reusing the result of a complex computation easier to read and faster to process.
Tests¶
Beware: tests cannot use object notation
isBlank(expression o) boolean¶
Returns whether o is null or an empty string
isError(expression o) boolean¶
Returns whether o is an error
isNonBlank(expression o) boolean¶
Returns whether o is not null and not an empty string
isNotNull(expression o) boolean¶
Returns whether o is not null
isNull(expression o) boolean¶
Returns whether o is null
isNumeric(expression o) boolean¶
Returns whether o can represent a number