MASARYK
U N I V E R S I T Y
FACULTY OF INFORMATICS
Orchestration tool for the modular
computing subsystem of the
ANALYZA platform
Master's Thesis
BC. RICHARD KYČERKA
Brno, Spring 2022
MASARYK
U N I V E R S I T Y
FACULTY OF INFORMATICS
Orchestration tool for the modular
computing subsystem of the
ANALYZA platform
Master's Thesis
BC. RICHARD KYČERKA
Advisor: RNDr. Tomas Rebok, Ph.D.
Department of Computer Systems and Communications
Brno, Spring 2022
Declaration
Hereby I declare that this paper is my original authorial work, which
I have worked out on my own. A l l sources, references, and literature
used or excerpted during elaboration of this work are properly cited
and listed in complete reference to the due source.
Be. Richard Kyčerka
Advisor: RNDr. Tomáš Rebok, Ph.D.
iii
Acknowledgements
I would like to thank my supervisor, RNDr. Tomas Rebok, Ph.D., for his
advice, help and support during my work on this thesis.
iv
Abstract
In recent years, the concept of 'Big Data' gained the attention of many
analysts, since the ability to collect and process data on a massive scale
is getting more and more accessible. Many projects focused on Big Data
emerged and one of those projects is the A N A L Y Z A platform, which
focuses on cross-domain analysis of heterogeneous data. Analysis
steps are isolated mini-applications, composed into data workflows.
The aim of this thesis was to review the field of workflow orchestration
tools mostly operating in the Kubernetes platform and pick the most
suitable one to adopt by the A N A L Y Z A platform. The choice fell
on Argo Workflows, which has been proven to be the most suitable
tool to use in A N A L Y Z A platform. The thesis justifies the choice by
detailed analysis of Argo Workflows capabilities and demonstrates it
on a real world data workflow use case.
Keywords
workflow, orchestration, Docker, Kubernetes, data analysis, Big Data,
Airflow, Prefect, Argo Workflows, Snakemake, A N A L Y Z A
v
Contents
Introduction 1
1 Big Data analysis 3
1.1 Cloud computing 3
1.2 Workflows 3
2 Airflow 5
2.1 Steps 5
2.1.1 Pre-TaskFlow era tasks 5
2.1.2 TaskFlow era tasks 6
2.1.3 XComs 6
2.2 Workflows 7
2.2.1 Pre-TaskFlow era DAGs 7
2.2.2 TaskFlow era DAGs 8
2.3 Deployment 8
2.3.1 Executors 9
2.4 Documentation and community 10
2.5 Summary 10
3 Prefect 11
3.1 Tasks 11
3.2 Workflows 12
3.3 Deployment 12
3.3.1 Prefect Cloud 13
3.3.2 Prefect Core 13
3.3.3 Prefect Cloud vs Server 14
3.3.4 Agents 14
3.4 Documentation and community 14
3.5 Prefect Orion 15
3.6 Summary 15
4 Snakemake 16
4.1 Tasks 16
4.2 Workflows 17
4.2.1 Workflow execution 17
vi
4.3 Deployment 18
4.3.1 API 18
4.4 Documentation and community 18
4.5 Summary 20
5 Argo Workflows 21
5.1 Tasks 21
5.2 Workflows 22
5.3 Deployment 23
5.4 API 23
5.5 Documentation and community 24
5.6 Summary 24
6 Honorable mentions 25
6.1 Luigi 25
6.2 Flyte 26
6.3 Dagster 27
6.4 Nextflow 28
7 ANALYZA platform and new orchestrator service requirement analysis
29
7.1 Orchestrator service 30
7.1.1 Analytical modules 30
7.1.2 Analytical operations 31
7.1.3 Deployment 32
7.2 Requirement analysis 33
7.3 New orchestration tool choice 34
8 Argo Workflows fundamentals 35
8.1 Workflows 35
8.1.1 Input and Outputs 37
8.1.2 Workflow management 38
8.1.3 Workflow execution 38
8.1.4 Workflow templates 40
8.2 UI 41
8.3 Auth 42
8.4 User management 42
8.5 Argo API 44
vii
8.6 Argo deployment 44
8.6.1 Scaling 45
8.7 Secrets 47
9 Requirements implementation in Argo 48
9.1 Public services 48
9.1.1 Kubernetes resources revised 48
9.1.2 Create service in Argo 49
9.1.3 Garbage collection 55
9.2 Custom API 55
9.2.1 Changes to the API 56
9.2.2 Implementation 57
10 Argo features that might be used in future 58
10.1 Workflow definition synchronization 58
10.2 Usage of volume claims 59
10.3 Workflow notifications 59
10.4 Prometheus metrics 60
11 Demonstration and evaluation of results 61
11.1 Demonstration environment 61
11.2 Demonstration scenario 62
11.3 Evaluation 65
12 Conclusion 66
A Workflow Examples 67
A . l Container 67
A.2 Script 68
A.3 Resource 69
A.4 Suspend 70
A.5 Steps 71
A.6 D A G 72
A.7 Parameters 73
A.8 Artifacts 74
A.9 Test scenario 75
A.9.1 Service creation 75
A.9.2 Download csv 78
viii
A.9.3 Shorten csv
A.9.4 MySQL write
A.9.5 Print csv
List of Tables
9.1 Custom Argo API Contract
List of Figures
2.1 Airflow architecture schema 9
4.1 Snakemake workflow example 19
5.1 Argo architecture schema 23
6.1 Luigi workflow example 25
6.2 Flyte workflow example 26
6.3 Dagster workflow example 27
6.4 Nexflow workflow example 28
7.1 ANALÝZA platform architecture 29
7.2 Analytical operation definition example 31
8.1 Argo workflow controller architecture 40
8.2 Argo UI displaying the state of a workflow run 41
11.1 MySQL demo step 62
11.2 MySQL DB demo step 63
11.3 Print step demo logs 64
11.4 Print demo step outputs 64
xi
Introduction
In recent years, Big Data has gained considerable attention from governments,
academia, and enterprises. The term "big data" refers to collecting,
analyzing, and processing voluminous data. Having large data
sets and reasonable computational power, many projects emerged that
took the chance and implemented various tools to analyze these data
sets.
One such project is ANALYZA[1] platform originated as research
at Masaryk University for the Police of Czech republic with a focus
on cross-domain data analysis. Project ANALÝZA use-cases include
image processing or network data analysis. ANALÝZA platform covers
the whole process of data analysis, from storage in specialized
distributed data lakes, through the complex analysis steps to visualization
in specialized component providing various analytic functions
and view on analysis results.
Individual analysis steps are stand-alone mini-applications, where
one complex analysis may combine multiple steps into data workflows.
The result of such analysis may be various views on the data and their
relationships. Workflow steps need to be executed in the correct order
to respect dependencies between steps. This composition may be nontrivial
and specialized orchestrator service, as a part of the ANALÝZA
platform, is taking care of this scheduling.
The goal of this research is to analyze the field of workflow orchestration
tools and pick one to adapt by ANALÝZA platform to succeed
its in-house built orchestration solution. The research puts emphasis
on functionality portfolio, scalability, security, and usability in production
environments. The new orchestration tool functional requirements
are then demonstrated in various examples based on the workflow
use case style used in ANALÝZA project.
The thesis can be divided into three parts. The first part, i.e. chapters
1 to 6, is focused on research in the field of well-known workflow
orchestration tools and provides a brief overview of the most common
ones. The overview shows how are the workflows defined, what is
the deployment and execution strategy and discussion about community
and documentation.
1
The second part, chapter 7, introduces A N A L Y Z A project architecture
together with deployment and workflow execution strategy.
Architecture analysis leads to a list of requirements that the new orchestration
engine must fulfill.
The final part, chapter 8 and onwards, introduces the orchestration
tool choice, describes its functionalities in detail with examples,
and is finalized by a demonstration on a realistic use case.
2
1 Big Data analysis
Big data represents the large, diverse sets of information that grows
at an exponential rate. Unfortunately, big data is so large that none
of the traditional data management tools can store or process it efficiently.
More than the volume of data, the way organizations utilize
data matters.
1.1 Cloud computing
One of the main challenges in developing Big Data processing solutions
is to define the right architecture to deploy Big Data software
in production systems. Big Data systems, are large-scale applications
that handle online and batch data that is growing exponentially.
For that reason, a reliable, scalable, secure and easy to administer platform
is needed to bridge the gap between the massive volumes of data
to be processed, software applications and low-level infrastructure.
Kubernetes1
is one of the best options available to deploy applications
in large-scale infrastructures. Using Kubernetes, it is possible
to handle all the online and batch workloads required to feed analytics
and machine learning applications. Kubernetes works with virtualized
containerized applications, where the most popular is virtualization
technology is Docker2
. Application containerizing provides portability
and scalability needed in order to fully utilize cloud computing.
1.2 Workflows
In the data science field, when working with Big Data, the measurements
often need to undergo some kind of transformation to provide
the analysts with various views on the data.
These transformations can be called data workflows or pipelines.
Workflows are composed of set of tasks, where tasks are single units
of work in data processing chain.
https://kubernetes.io/
https://docs.docker.com/get-started/overview/
3
i . BIG DATA ANALYSIS
Responsibility of a task can be for example downloading a dataset
from a website into a local data warehouse. Another task can take this
data and do some corrections, like unifying formats of records in data
sets and saving results into a local database. The next task can take
the data from the database and run machine learning model training.
A l l these tasks together form a workflow.
Intuitively, these steps need to be executed in the correct order,
since one can't run machine learning model training before downloading
data sets from the website. This implies natural dependencies
between tasks.
For easier representation, an abstraction of Directed Acyclic Graphs
(DAG) [2] has been introduced, where graph nodes represent tasks
involved in the process and edges represent dependencies between
tasks.
Workflow management tools are here to help data scientists with
data transformations. Each tool has its own implementation of tasks
and workflows, but the idea of DAGs remains the same.
4
2 Airflow
Apache Airflow is a workflow management tool founded by Airbnb
in October 2014, later open-sourced under Top-Level Apache Software
Foundation1
project. Airflow was one of the first widely used tools
of the workflow management kind. In December 2020, the massive
upgrade to Airflow 2.0 was released, targeting the biggest issues, like
scheduler performance or the introduction of TaskFlow API allowing
task definition in a functional way.
2.1 Steps
Workflow steps, or tasks in Airflow terminology, can be arbitrary
pieces of Python code. There are two ways to define tasks, depending
on the Airflow version. Both task types can be combined together
within one workflow.
2.1.1 Pre-TaskFlow era tasks
The original Airflow implementation of tasks was heavily relying
on Operator2
usage. Operators are pre-defined templates for tasks
and Airflow comes with a set of pre-defined Operators, but user can
write his own custom Operators. A n example of an Operator can be
PythonOperator, which takes the Python function as an input argument
and transforms the function into a workflow step.
https://www.apache.org/
2
https://airflow.apache.org/docs/apache-airflow/stable/concepts/
operators.html
5
2. AIRFLOW
1 def p r i n t _ f u n c t i o n ( x ) :
2
o
p r i n t x
J
4 t l == P y t h o n O p e r a t o r (
5 t a s k _ i d = ' p r i n t ' ,
6 p y t h o n _ c a l l a b l e = p r i n t _ f u n c t i o n ,
7 op_kwargs = {"x" : " H e l l o u W o r l d " } ,
8 dag=dag,
9 )
Listing 2.1: Airflow task definition in pre-TaskFlow era.
2.1.2 TaskFlow era tasks
The TaskFlow API introduced the @dag and ©task concepts, creating
the abstraction, where steps are ©task decorated functions, and workflows
are @dag decorated functions calling multiple ©task functions
inside. This approach has greatly reduced the verbosity of definitions
and automatically dependency calculations.
1 Q t a s k Q
2 def p r i n t _ f u n c t i o n ( x ) :
3 p r i n t x
Listing 2.2: Airflow task definition in TaskFlow era.
2.1.3 XComs
Airflow's XComs3
are the inter-task communication mechanism. It can
be thought of as a shared board, where tasks push to and pull information
from. Having two tasks A and B, where A calculates some value,
push it to the shared board and task B pulls the value and adjusts its
own calculation based on this information.
It's important to remember that XComs are designed to transfer
just small amounts of metadata rather than big volumes of data.
https://airflow.apache.org/docs/apache-airflow/stable/concepts/
xcoms.html
6
2. AIRFLOW
2.2 Workflows
D A G 4
is a collection of tasks, with some additional flow properties,
like schedules or parameters. DAG definition again differs depending
on whether TaskFlow API is used or not. DAGs are stored in the metadata
database, which is part of the Airflow deployment. Submission
of DAGs can be done via either REST API, UI, or CLI.
2.2.1 Pre-TaskFlow era DAGs
DAGs defined in the traditional way is a D A G object with a collection
of Operators and DAG properties. Dependencies are set as upstream/downstream
between operators, signalized by » sign, e.g. 11 »t2
signalizes tl is dependent on tl.
1 d e f p r i n t _ f u n c t i o n ( x ) :
2 p r i n t x
3
4 w i t h DAG(
5 ' e x a m p l e ' ,
6 d e f a u l t _ a r g s — {
' d e p e n d s _ o n _ p a s t ' : F a l s e ,
' e m a i l ' : [' a i r f l o w @ e x a m p l e . c o m ' ] ,
' e m a i l _ o n _ f a i l u r e ' : F a l s e ,
10 ' e m a i l _ o n _ r e t r y ' : F a l s e ,
' r e t r i e s ' : 1 ,
12 ' r e t r y _ d e l a y ' : t i m e d e l t a ( m i n u t e s — 5 ) ,
13 },
14 d e s c r i p t i o n — ' H e l l o , _ , w o r l d J D A G ' ,
15 s c h e d u l e _ i n t e r v a l = t i m e d e l t a ( d a y s = l ) ,
s t a r t _ d a t e = d a t e t i m e (2022, 1 , 1 ) ,
17 c a t c h u p — F a l s e ,
18 t a g s = [ ' e x a m p l e ' ] ,
19 ) as d a g :
20 t l = B a s h O p e r a t o r (
21 t a s k _ i d — ' p r i n t _ d a t e ' ,
22 bash_command—' d a t e ' ,
23 d a g - d a g ,
24 )
25 t2 = P y t h o n O p e r a t o r (
26 t a s k _ i d = ' p r i n t ' ,
27 p y t h o n _ c a l l a b l e = p r i n t _ f u n c t i o n ,
28 o p _ k w a r g s = {"x" : " H e l l o ^ W o r l d " },
29 d a g - d a g ,
30 ti » t2
31 )
Listing 2.3: Airflow DAG definition, pre-TaskFlow
https://airflow.apache.org/docs/apache-airflow/stable/concepts/
dags.html
7
2. AIRFLOW
2.2.2 TaskFlow era DAGs
Utilizing the @dag and ©task decorators, the DAG definition becomes
a lot clearer, better readable, and easier to develop. Dependencies can
be set manually, or when possible, they are calculated by Airflow.
1 @dag
2 def example () :
3 t l = BashOperator (
4 task_id='print_date ' ,
5 bash_command=' date ' ,
6 )
7 ©task
8 def p r i n t _ f u n c t i o n (x) :
9 print x
10 ©task
11 def generate_number () :
12 return random. randint (0 , 10)
13
14 random_number = generate_number ()
15 random_message = p r i n t _ f u n c t i o n (random_number)
16 t l » random_message
17 )
Listing 2.4: DAG example using TaskFlow API
2.3 Deployment
Airflow deployment is quite complex, so the preferred type of deployment
is using Helm5
charts to deploy in the Kubernetes cluster
or managed installation, where the user pays someone else to host
and maintain the installation in the SaaS model.
For development purposes, local deployment can be done via DockerCompose
or running Airflow in a Python environment.
'https://helm.sh/
8
2. AIRFLOW
Metadata
Database
User
Interface
Figure 2.1: Airflow architecture schema
Source: https://airflow.apache.org/docs/apache-airflow/stable/
concepts/overview.html
2.3.1 Executors
DAGs are processed by executors6
, which are processes running inside
the Scheduler process and can be divided into local and remote
types. Local executors keep the task execution under the scheduler
process. Remote executors delegate the work to Dask7
, Celery8
backend
or Kubernetes. Executor choice is installation-wide and can be set
in the Airflow configuration file.
Tasks from a single flow can, but don't necessarily need to be
executed on the same machine.
Kubernetes executor
Kubernetes executor is a process running under the Scheduler process
that has access to a Kubernetes cluster. When a DAG is invoked via Kubernetes
Executor, the executor requests Kubernetes API for a worker
https://airflow.apache.org/docs/apache-airflow/stable/executor/
index.html#executor
7
Dask is an open-source library for parallel computing - https: //dask. org/
8
https://docs.celeryq.dev/en/latest/index.html#
9
2. AIRFLOW
Pod where the task is executed and after execution and reporting
result, the pod is dismissed.
2.4 Documentation and community
Airflow documentation is quite extensive since there are many concepts
to be explained. Diagrams and code snippets are here to help
with the understanding of individual concepts. However, some documentation
sections are quite confusing and somewhat hard to under-
stand.
One of the biggest advantages of Airflow is its massive community
built over the last years. The community is also very active in contributing
to the Airflow codebase9
with hundreds of discussed issues
on GitHub and also many open pull requests fixing bugs or suggesting
improvements.
2.5 Summary
Apache Airflow is a mature, heavy-armed orchestrator with tradition
and many users accumulated during the last years.
Users considering using Airflow should be aware of a steep learning
curve and somewhat overwhelming installation process. This is
mainly caused by many architectural changes over the years. In the near
past, it feels like it's just trying to catch up on the competitors that
started later than Airflow with fresh and better-designed architecture
(taking advantage of Airflow design flows knowledge), rather than
coming up with new ideas.
For those who can overcome the initial struggles, Airflow can
deliver a reliable and robust service.
https://github.com/apache/airflow
10
3 Prefect
Prefect is an open-source workflow management system. The main
motivation for Prefect to rise was dissatisfaction with some of the Airflow's
concepts (at that time the go-to workflow management system),
like writing DAGs in an imperative way, weak DAG scheduling mechanism,
or inability of passing big data chunks between tasks. Prefect
chose Python as the language of choice for both, writing workflows
and the platform itself.
Prefect allows user to define tasks and flows in two ways; the imperative
and functional. The more elegant and more convenient is the functional
way; it brings the advantage of having less boilerplate code and
better readability. On the other hand, the imperative declaration offers
more fine-grained control over tasks and flows. In later sections, only
the functional task API will be considered.
3.1 Tasks
In Prefect, tasks are defined as Python functions annotated by @task
decorator1
. Task level properties like the number of retries or task
identification are passed as arguments to the decorator.
Prefect comes with a decent library of mostly community-driven
pre-defined tasks providing the ability to integrate with third-party
services2
. For reference, such tasks could execute a shell script, create
a Jira issue or post a message via Slack.
https://docs.prefect.io/core/concepts/tasks
https://docs.prefect.io/core/task_library
11
3. PREFECT
3.2 Workflows
Flows3
are Prefect's implementation of DAGs. Tasks are instantiated
inside of a flow object, possibly using the return value of one task
as input for other tasks, creating a natural dependency between these
tasks. Dependencies are then automatically detected by Prefect.
The development and debugging can be done locally4
on the user's
machine without any available Prefect deployment.
In order to run a workflow on running deployment, it needs to be
registered to the Prefect backend using the Prefect CLI5
.
1 @ t a s k ( n a m e = " n u m b e r L J g e n e r a t o r " )
2 d e f g e n e r a t e _ n u m b e r () :
3 r e t u r n r a n d i n t ( 0 , 2 2 )
4
5 @task(name=" s a y u t a s k " )
6 d e f s a y ( x ) :
7 p r i n t ( x )
8
9
10 w i t h F l o w ( ' S i m p l e ^ P r e f e c t ^ f l o w ') as f l o w :
n number — g e n e r a t e _ n u m b e r ()
12 a n o t h e r _ n u m b e r — g e n e r a t e _ n u m b e r ()
13 say (number + a n o t h e r _ n u m b e r )
Listing 3.1: Prefect workflow example
3.3 Deployment
Prefect architecture is divided into two parts, backend, and agents.
Backend stores metadata about registered flows, provides UI, and takes
care of task scheduling. Flows are executed on agents, which can be
deployed in several ways, depending on the use case and available
infrastructure. Prefect comes with two implementations of backends,
the cloud-managed called Prefect Cloud and the on-premise one called
Prefect Core.
'https://docs.prefect.io/core/concepts/flows
\https://docs.prefect.io/core/getting_started/basic-core-flow.html
'https://docs.prefect.io/api/latest/cli/register
12
3. PREFECT
3.3.1 Prefect Cloud
Prefect Cloud encapsulates all the backend components into a cloudmanaged
solution. This means the user doesn't need to deploy in his
infrastructure anything but one or more of the agent variants. Prefect
Cloud is easy to set up and enables user to run his first workflow
in a few moments. Cloud backend is hosted on Prefect side and there
are several pricing options including a free one, although with some
restrictions like a limited number of registered flows or a limited
number of flow runs per month.
3.3.2 Prefect Core
The core version of Prefect is basically a lightweight on-premise version
of Prefect Cloud. Depending on the desired scale, Prefect can
be deployed on a local machine using docker-compose or scaling large
in Kubernetes cluster using helm charts6
.
The docker-compose option is more suitable for evaluating and testing
whether Prefect suits the user's needs rather than running in production.
For production deployment, the Kubernetes deployment
seems to be the only viable option.
Helm charts are provided by the Prefect team which makes the charts
more trustworthy, documentation related to Prefect core Helm charts
is well written and covers modifications to the default configuration.
6
https://github.com/PrefectHQ/server/tree/master/helm/
prefect-server
13
3. PREFECT
3.3.3 Prefect Cloud vs Server
The cloud solution provides many advantages compared to Server7
:
• Authorization and Permissions - the ability to manage users
authenticated through AuthO8
, their roles and team assignment
• Available to access from everywhere - as long as the user has
a valid API key
• Better performance and scaling
• Effortless maintenance and upgrades
• Premium support
3.3.4 Agents
Agents are processes, where flows are being executed. A l l agents
need to be registered either in Prefect Cloud or Prefect core backend.
Registered agents periodically poll the backend for work scheduled
to be done.
Agents can be deployed on user's local machine, executing flows
as processes or Docker containers, but also in Kubernetes, where flows
are executed as Kubernetes jobs or AWS ECS, where flows are executed
as AWS ECS Tasks.
While flows are being executed on agents, the work can be delegated
further to Dask using Executors9
.
3.4 Documentation and community
Prefect documentation is very well written, enriched by many code
snippets, and easy to understand. Its content can be divided into two
parts; one looking at the concepts with higher abstraction and the second
explaining concepts in more detail. User can find code examples,
video and written tutorials. Many companies wrote on their blogs1 0
https://docs.prefect.io/orchestration/server/overview.html#
prefect-server-vs-prefect-cloud-which-should-i-choose
8
https://authO.com/
9
https://docs.prefect.io/api/latest/executors.html#executors
1 0
https://www.prefect.io/why-prefect/case-studies/
14
3. PREFECT
about their experiences with Prefect and why they chose it or migrated
from other tools.
Prefect has grown a big community reflecting in quick answers
on Slack or contributions on GitHub.
3.5 Prefect Orion
On October 5th, 2021, Prefect announced the new generation of their
product, Prefect Orion[3] along with technical alpha, expected to release
in early 2022. The objective of Orion is to introduce dynamic,
DAG-free workflows; a better development experience, and transparent
and observable orchestration rules.
3.6 Summary
Prefect is a great workflow orchestration tool for users who can afford
to use the Prefect Cloud license model and are familiar with Python
programming. Especially, if the user can make use of Dask integration.
The transition from pure Python code to Prefect workflow is intuitive
and doesn't require many changes to the existing code.
The advantage of being able to run workflows without taking care
of backend infrastructure is huge and unique in the field of orchestration
tools. The setup process is clear and straightforward, the development
process is user-friendly with the option to test and debug
workflow code locally.
The concept documentation is brief, yet it tells the user everything
he needs to know.
Prefect team is actively pushing forward the feature development
giving the promise of future proof investment of time and effort
in adopting Prefect as the go-to workflow orchestrator.
15
4 Snakemake
Snakemake1
[4] [5] is a tool to create reproducible and scalable data
analyses created by bioinformatics, but aiming to be understandable
by everyone. The implementation of DAGs is highly influenced
by the G N U Make[6] paradigm; workflow is a set of rules applied
in a chain to transform inputs to outputs executed in isolated Conda2
environments.
4.1 Tasks
Snakemake's implementation of tasks is called rules3
. Rules are written
in a special file named Snakefile using custom, Python-based lan-
guage4
, aiming to be clear and human-readable without boilerplate
code around.
The main rule components are name, input, output, and action.
Some additional rule properties are available too, like specifying
Conda environment, output log file location, or parameters.
1 r u l e NAME:
2 i n p u t : " p a t h / t o / i n p u t f i l e "
3 o u t p u t : " p a t h / t o / o u t p u t f i l e "
4 s h e l l : " s o m e c o m m a n d u { i n p u t } u { o u t p u t } "
Listing 4.1: Snakemake rule template, executing shell command
The rule applies the action by taking the input and storing the output
in specified locations. The action can be a shell command, Python
script, or scripts in other languages like R, Julia, or Rust. Some users
may appreciate the integration with Jupyter5
notebooks too.
^ttps://snakemake.github.io/
2
https://docs.conda.io/en/latest/
3
https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html
4
https://snakemake.readthedocs.io/en/stable/snakefiles/writing_
snakefiles.html#grammar
5
https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#
jupyter-notebook-integration
16
4. SNAKEMAKE
Input and output files don't need to be on the local host machine,
it could point to remote storages6
like S3, FTP or HTTP endpoint.
This remote storage is accessed via special functions called wrappers.
Wrappers7
encapsulate general actions that can be reused. There is
also a central wrapper repository where users can share their wrappers
with others.
Inputs and outputs can use wildcards8
and regular expressions
to make rules generic e.g. by processing all files with common pattern.
Each rule can have its own isolated run environment provided
by Conda or taking the isolation even further, each rule can be run
in its own container.
4.2 Workflows
The user doesn't specify what the workflow looks like, he just specifies
the target file and Snakemake determines rules that need to be applied
to generate the output. Alternatively, the user can specify the rule to be
executed. Invocation of workflows is done via CLI.
Snakemake workflows offer high customization in terms of resources
that can be provided to rule execution. Such options include
the number of threads, amount of memory, and disk usage.
4.2.1 Workflow execution
Workflow execution can be held locally on the user's machine or the execution
could be delegated to the cloud or cluster. When executing
remotely, inputs and outputs are stored on a shared filesystem or remote
storage like S3.
Workflows also support caching to avoid repeated workflow evaluations
on the same inputs. When executing in the cloud, these subresults
are stored on external storage.
https://snakemake.readthedocs.io/en/stable/snakefiles/remote_
files.html#remote-files
7
https://snakemake.readthedocs.io/en/stable/tutorial/additional_
features.html?highlight=wrapper#tool-wrappers
8
https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#
wildcards
17
4. SNAKEMAKE
Workflow runs can be examined via generated H T M L report that
includes information like runtime statistics or visualization of the workflow
topology.
4.3 Deployment
Snakemake doesn't require any kind of long-running backend. Workflow
executions are invoked by the user on demand. The execution
environment is specified as an option to the CLI invocation.
4.3.1 API
Since Snakemake doesn't offer any centralized server thus no API
is exposed by default. However, there's an option to run workflows
with special flag to send execution details to a Panoptes9 , 1 0
server.
However, this feature is still under development.
4.4 Documentation and community
Snakemake's documentation1 1
is detailed and well written. It includes
tutorials with examples precisely explaining the workflow lifecycle.
The community is built mainly by bioinformatics and many scientific
papers reference Snakemake as their choice for data processing
tool used in the research. Questions about Snakemake are mainly discussed
on StackOverflow, currently capturing more than a thousand
threads.
https://getpanoptes.io/
1 0
https://snakemake.readthedocs.io/en/stable/executing/monitoring.
html
n
https://snakemake.readthedocs.io/
18
4- SNAKEMAKE
1 # configfile: "config.yamI"
2
3 • rule ail:
input:
expand(
"results/plots/(country}.hist .pdf"
country=config['countries"]
)
Legend
; domain knowledge
Q technical knowledge
0 Snakemake knowledge
# trivial
rule download data:
output:
"data/worIdcitiespop.csv"
log:
"logs/download.log"
conda:
"envs/curl.yaml"
shell:
"curl -L https://burntsushi.net/stuff/worldcitiespop.csv > {output} 2> {log}
rule select._by..country:
6
:
9
6
6
0
1
6
6
O
input: c
r
'data/worldcitiespop. csv" import sys
output: sys.stderr = open{snakemake.log[9], "w")
"resuIts/by-count ry/{country}.csv"
log: impart matplotlib.pyplot as pit
"logs/select-by-ecuntry/{country}.log" import pandas as pd
conda:"envs/xsv.yaml" cities = pd.read_csv(snakemake.input[9])
shell:
"xsv search -s Country '{wildcards.country}1 1
pit .hist(cities["Population"], bins=59)
" {input} > {output} 2> {log}"
pit.savefig(snakemake.output[9])
rule plot histogram:
input:
"resuits/by-country/{country}.csv"
output:
"result5/plots/{country}.hist.svg"
container:
"docker;//faizanbashir/python-data5cience:3. 5"
log:
"logs/plot-hist/{country}.log"
script:
"scripts/plot - hist.py"
rule convert_to_pdf:
input:
"{prefix}.svg"
output:
"{prefix},pdf"
log:
"logs/convert-to-pdf/{prefix}.log"
wrapper:
"0.47.o/utiis/cairosvg"
knowledge
trivial
• snakemake
• technics!
domain
r-j m m ID !--
category
Figure 4.1: Snakemake workflow example
(a) section shows how a workflow looks like, with each line labeled with
color depending on user knowledge, (b) DAG of jobs, the color of nodes are
corresponding with colors of rule names, (c) content of plot-hist.py script
referenced in 'plot_histogram' rule, (d) knowledge requirements for readability
by statement category.
Source:https://fl000research.eom/articles/10-33/vl#f3
19
4. SNAKEMAKE
4.5 Summary
Snakemake has earned its place in mostly bioinformatic circles by ensuring
the highly demanded reproducibility, robustness, but also scalability
via integration with high-performance clusters.
At first, it might be challenging for the user to get his head around
the rule definition grammar. Once the user gets hang of it, Snakemake
becomes a powerful and effective orchestration tool with strong synergy
with Jupyter notebooks. Adopting Snakemake might be a big
step forward for analysts using plain Python or bash scripts as data
processing pipelines.
Snakemake is aiming for the data science audience, rather than
mainstream companies. This is reflected by the absence of central
server and UI, rather focusing on effective task scheduling and individual
task execution performance on computation clusters and
grids.
20
5 Argo Workflows
Argo is a Kubernetes oriented orchestration tool, standing out by defining
workflows in a fully declarative way; extending Kubernetes Custom
Resource Definition2
. The design of the tool allows to run workflows
in massive scale, roughly thousands of workflows per day.
Argo is developed under Cloud Native Computing Foundation3
project. For anybody to use Argo effectively, one needs to be familiar
with Docker and have a really good understanding of Kubernetes
concepts.
5.1 Tasks
Argo tasks are represented as units called templates4
" and executed
exclusively in Kubernetes pods. Templates are closely tied to container
execution; running a container with arguments or having script
wrapped up and executed inside of a container. In the latter case,
script source code is stored in the Y A M L workflow definition file.
Other template types allow performing cluster resource operations.
Container and script templates are defined as Kubernetes container
specs with all related properties.
Tasks can share data using the combination of artifacts and Persistent
Volume Claims that is also used for storing workflow results.
Workflows and templates can be parametrized, also allowing to feed
the output of one task into inputs to other tasks.
https://axgoproj.github.io/
2
https://kubernetes.io/docs/concepts/extend-kubernetes/
api-extension/custom-resources/
3
https://www.cncf.io/
4
https://argoproj.github.io/argo-workflows/workflow-concepts/
#template-types
21
5. A R G O WORKFLOWS
5.2 Workflows
Workflows are declared in yaml files as Workflow,spec5
, the Kubernetes
specification of a workflow. The building blocks of a workflow
are templates and an entrypoint; the template executed first.
Submission of workflows to the server is done via Argo CLI, UI,
or REST API. Workflows are then stored as Kubernetes resources,
in EtcD6
, by default. Alternatively, database storage can also be con-
figured.
Workflows are represented by two types of template invocators7
:
• steps executing task templates sequentially
• DAG executing tasks respecting the dependencies between them.
Dependencies must be set manually.
Workflows also support caching to avoid redundant work and artifact
producing/consuming mechanisms.
Steps in a workflow may or may not be executed based on conditions,
like some exact value of a workflow parameter.
1 a p i V e r s i o n : a r g o p r o j . i o / v l a l p h a l
2 k i n d : W o r k f l o w
3 m e t a d a t a :
g e n e r a t e N a m e : h e l l o — w o r l d
5 spec:
6 e n t r y p o i n t : w h a l e s a y
7 t e m p l a t e s :
8 - name: w h a l e s a y
9 c o n t a i n e r :
image: d o c k e r / w h a l e s a y
11 command: [ c o w s a y ]
12 a r g s : [ " h e l l o u w o r l d " ]
Listing 5.1: Argo workflow running the whalesay container
5
https://argoproj.github.io/argo-workflows/fields/#workflowspec
6
https://kubernetes.io/docs/tasks/administer-cluster/
configure-upgrade-etcd/
7
https://argoproj.github.io/argo-workflows/workflow-concepts/
#template-invocators
22
5- A R G O WORKFLOWS
5.3 Deployment
The official and Argo-supported installation and upgrade process is
applying Kubernetes manifests located in the Argo GitHub repository
User can see what is being deployed and manifests can be customized
if needed. The main functionality components are the workflow controller,
which takes care of scheduling, and the Argo server, which
runs the API server and UI.
ARGO Workflow Overview
aitio namespace
/ workflow ^ (. , ,,\
( controller J ( * , 0
° "')
User Namespace
IrVfi wfl wfl
step step step
pcd pod pod
Main Container
Figure 5.1: Argo architecture schema
Source: https://argoproj.github.io/argo-workflows/architecture/
5.4 API
Argo server comes with exposed REST API allowing user to communicate
with the server via authenticated HTTP requests, which may
come in handy to use with other automation tools.
There are also officially supported client libraries to wrap API calls
for the user, written in Java, Python, and Golang.
23
5. A R G O WORKFLOWS
5.5 Documentation and community
Documentation is in some places brief and some concepts could be
described in more detail. The documentation is complemented by a really
nice user interactive tutorial covering the most important concepts
in the prepared Argo playground.
Argo GitHub repository contains many workflow examples and also
a very informative tutorial by examples.
5.6 Summary
Argo Workflows is a great orchestrating tool for anyone who already
invested in the Kubernetes environment and has workflow use-cases,
where steps are more complex, containerized units in order to utilize
Argo optimally. Workflow definitions might sometimes get tricky,
especially when the user is not used to writing Y A M L files but in return,
Argo's capabilities are quite unique.
24
6 Honorable mentions
This chapter is dedicated to workflow orchestration tools that earned
their place in the workflow orchestration tool circles but didn't make
it into the final comparison. The reason for this is that these tools were
too far from ANALÝZA use-case, or didn't provide any exceptional
functionalities compared to selected, more popular tools.
6.1 Luigi
Developed by Spotify Luigi1
uses Python to orchestrate long-running
batch jobs. Tasks are classes, where functions are used to define inputs,
outputs, dependencies and action. Flows can be executed locally or using
the central scheduler deployment. Flows are not defined explicitly,
Luigi creates them based on task requirements.
import l u i g i
class M y T a s k ( l u i g i . T a s k ) :
[param • l u i g i . Parameter(defauft=42'j !
| d e £ r e q u i r e s ) s e l f ) : ^ ^ ^ ^ j
i r e t u r n S o m e O t h e r T a s k ( s e l r > p a r a i n ) J
[ ~ d e f r u n t s e l f } : }
| f = s e l f . o u t p u t ( ) . o p e n ( ' w ' ) 1
i p r i n t » t t " h e l l o , w o r l d " I
I f . c l o s e < } i
i d e f o u t p u t * s e l f ) : /
I r e t u r n l u i g i . L o c a l T a r g e t ( ' / t m p / f r i o / b a r - % s . t x t ' % s e l f . ( i a r a m ) !
i f \ n a m e = = m a i n
\ l u i g i . r u n
The business logic of the task Whereat writes output What other tasks it depends on
Figure 6.1: Luigi workflow example
Source: https://luigi.readthedocs.io/en/stable/tasks.html
^ttps://luigi.readthedocs.io/en/stable/index.html
25
6. HONORABLE MENTIONS
6.2 Flyte
Flyte2
, a workflow orchestrator developed by Lyft, writing workflows
in Python and defining workflows and tasks by using ©workflow
and ©task decorators. Workflows are then executed in a Flyte cluster,
deployed locally, or in Kubernetes.
1 import t y p i n g
2 import pandas as pd
3 import numpy as np
4
5 from f l y t e k i t import t a s k , workflow
6
7 @task
8» def generate_normal_df ( m i n t , mean: f l o a t , slgma: f l o a t ) pd.DacaFrame:
r e t u r n pd.DataFrame({"numbers": np.random.normal(mean, sigma,size=n)})
10
11 @task|
12- d e f compute_stats(df: pd.DataFrame) -=• t y p i n g . T u p l e [ f l o a t , f l o a t ] :
13 r e t u r n f l o a t ( d f [ " n u m b e r s " ] . m e a n ( ) ) , f l o a t ( d f [ " n u m b e r s " ] . s t d ( ) }
14
15 ^workflow
16- d e f wf(n: i n t = 200, mean: f l o a t = O.O, Sigma: f l o a t = 1.0) -> t y p i n g . T u p l e [ f l o a t , f l o a t ] :
r e t u r n compute_stats(df=generate_normal_df(n=n, mean=mean, sigma=sigma))
Figure 6.2: Flyte workflow example
Source: https://docs.flyte.org/en/latest/getting_started/index.html
2
h t t p s : / / f l y t e . o r g /
26
6. H O N O R A B L E MENTIONS
6.3 Dagster
Dagster3
is a Python orchestration tool combining computation steps,
or ops as they call it, into jobs and further into graphs annotating Python
functions by @op, ©graph and ©job operators. Dagster's workflow
model is close to writing natural Python code, where nested function
calls create implicit dependencies. The core, long-running Dagster
deployment is composed of Daemon, Web Server, and Repositories.
Workflows are run by Executors, e.g. Docker, Kubernetes, or Celery
executors.
• addons
•
A single outout can be passed to multiple hputs on downstream ops. In this example, the outpjt from the tirst op is passed to two different ops. Tie outputs ot those ops are combined and passed to the final op.
Figure 6.3: Dagster workflow example
Source: https://docs.dagster.io/concepts/ops-jobs-graphs/jobs-graphs
'https://dagster.io/
27
6. HONORABLE MENTIONS
6.4 Nextflow
Nextflow4
, similarly to Snakemake, defines workflows using custom
Domain Specific Language as an extension to the Groovy5
programming
language, which is a super-set of Java. Workflows are composed
of channels and processes, where each process has inputs, outputs,
and action. Workflows are executed locally or remotely via Nextflow
driver, living in a Kubernetes cluster, which emits workflow jobs.
// Declare syntax v e r s i o n
nextflow.enable.ds1=2
// S c r i p t parameters
params,query = '7sorne/data/sample,f a"
params.db = "/some/path/pdb"
process blastSearch {
input:
path query
path db
output:
path "top h i t s . t x t "
biastp -db $db -query $query -outfmt 6 > biostresult
cat biast_resuit j head -n 1& j cut -f 2 > top_hits.txt
y
process e x t r a c t T o p H i t s {
input:
path t o p h i t s
outputi
path "sequences.txt"
blastdbcmd -db $db -er\try_batch $top_hits > sequences.txt
}
workflow {
def query ch = Channel.fromPath(params.query)
b l a s t S e a r c h ( q u e r y _ c h J params.db) | e x t r a c t T o p H i t s | view
}
Figure 6.4: Nexflow workflow example
Source: https://www.nextflow.io/docs/latest/basic.html
4
https://www.nextflow.io/
5
https://groovy-lang.org/
28
7 ANALÝZA platform and new orchestrator service
requirement analysis
Project ANALYZA[1] is a complex computational platform designed
to handle the whole process of various large-scale data analyses,
from storage, through computation to visualization. A N A L Y Z A places
emphasis on scalability, flexibility, and reliability.
The system architecture is composed of three core components:
• Data warehouse - where all the data to be analyzed are centralized
on one place.
• Computation subsystem - set of data analyzing and transforming
steps wrapped into data workflows and managed by the orchestrator
service.
• Visualisation component - provides multiple views on results
of data analysis and initiates analytical operation executions.
O
Cloud
infrastructure Local data warehouse
s .2
OH ä
55 3
•n 'ji
P >
Metadata
storage
Distributed
storage
Interactive rfo^
analytical modules <-u%
^r)
Remote data
warehouse
Remote data
warehouse
Automatic
analytical modules
Orchestrating service
Figure 7.1: ANALÝZA platform architecture
Source: [1]
This chapter is dedicated to the analysis of the Computation subsystem,
especially the orchestrator service and its functionalities.
The computational subsystem is a Kubernetes service running ondemand
analytical modules (tasks), composed by the orchestrator
29
7- ANALÝZA PLATFORM A N D NEW ORCHESTRATOR SERVICE REQUIREMENT ANALYSIS
service. Individual analytical modules are composed into analytical
operations (workflows), transforming data for further data analysis.
The orchestrator service runs module containers as Pods in the Kubernetes
namespace in correct order with all needed inputs.
7.1 Orchestrator service
The orchestrator service was created as a part of the A N A L Y Z A platform
since at that time the field of data orchestration tools was lacking
in tools fulfilling the project requirements. As time moved on, many
new workflow orchestration tools emerged and there is now an option
to swap from the current managed solution to a community-driven
open-source service, lifting the burden of developing and maintaining
the current orchestrator by the A N A L Y Z A team.
7.1.1 Analytical modules
Tasks, represented by analytical modules, are Docker containers, further
composed into analytical operations. Two types of analytical modules
are available:
• Job - a short-running unit of work performing some kind of data
processing action with a strictly bounded lifecycle. Jobs are executed
as Kubernetes Jobs.
• Service - long-running module, whose execution is not blocking
other modules and is running until the end of the entire analytical
operation. Services are executed as Deployments and depending
on the required internal/external ports, a Service of type
ClusterlP/NodePort respectively.
The analytical module is defined by its type (Job or Service), Docker
image, exposed container ports, parameters, and Kubernetes resource
requirements. Property IsBlocking determines whether the module is
running as a daemon1
.
daemon - service running in the background
30
7- ANALÝZA PLATFORM A N D NEW ORCHESTRATOR SERVICE REQUIREMENT ANALYSIS
7.1.2 Analytical operations
Operation definitions are written in JSON format and stored in the definition
database (PostgreSQL, MySQL, or SQLite). The operation contains
a list of nodes and edges, representing dependencies between individual
modules. Nodes are wrappers for analytical modules. Edges
define dependencies between Nodes (modules) in a similar manner
as in directed acyclic graphs.
{
"modules": [
{
"name": " m o d u l - t e s t - 1 " ,
"type": " s e r v i c e " ,
" d e s c r i p t i o n " : " t e s t i n g s e r v i c e " ,
"image" : " t e s t ^ " ,
"tag": "1.0",
" i s B l o c k i n g " : true,
" h e a l t h P o r t " : 6069,
" p o r t s " : [
"name": " p a r t i " ,
"number": 8380,
" i s P u b l i c " : t r u e
}
),
"parameters": [
-t
"name" : "paratnl",
" d e f a u l t V a l u e " : " v a l u e l " ,
" d e s c r i p t i o n " : "paraml d e s c r i p t i o n " ,
" i s S t r i c t " : t r u e
" r e s o u r c e s " : {
"cpus1 1
: 1,
"ram": 250
]
Figure 7.2: Analytical operation definition example
Source: [1]
31
7- ANALÝZA PLATFORM A N D NEW ORCHESTRATOR SERVICE REQUIREMENT ANALYSIS
Orchestrator service comes with 2 CLI tools to operate on opera-
tions:
• Client CLI - provides communication with deployed orchestrator
service with start/stop /get resource commands.
• Designer CLI - serves CRUD operations on definition database.
Additional to the client CLI, and the preferred way, to invoke operations
is the REST API. The REST API exposes endpoints to manage
all the orchestrator resources, like starting and stopping operations
or listing running operations and modules.
Orchestrator service tries to optimize workflow execution by analyzing
the operation to be run and if possible, reusing already running
modules started by other operations.
7.1.3 Deployment
Orchestrator service deployment is simple containing:
• ConfigMap2
- contains input parameters for the service like
database credentials, data warehouse address or kubernetes
config used further when executing operations (this is needed
because orchestrator is managing Kubernetes resources).
• orchestrator deployment - specifies the orchestrator image and tag
to be deployed. Serves the A P I and takes care of operation
scheduling.
• orchestrator-svc service - orchestrator service is exposed as a single
Kubernetes Service to expose REST API.
https://kubernetes.io/docs/concepts/configuration/configmap/
32
7- ANALÝZA PLATFORM A N D NEW ORCHESTRATOR SERVICE REQUIREMENT ANALYSIS
7.2 Requirement analysis
Based on the orchestrator analysis, requirements for its successor can
be formulated:
• Kubernetes support - individual analytical modules are encapsulated
in virtualized containers for easier reusability and ensuring
scalability and resource management.
• parametrized workflows - analysis steps may be experimental
and is required to run workflows with different input parameters
to achieve the most accurate outputs.
• dependencies between tasks - analysis may be a simple chain
of steps, but also a complex computation tree with some steps
strongly dependent on outputs of previous steps. The orchestrator
service must be able to respect such dependencies.
• inter-task communication - dependencies between modules can
be of many types, one such can be a step requiring normalized
data stored somewhere in the data warehouse by the previous
workflow step, another may be just using computed parameter
from previous step. Orchestrator service should provide wide
range of possibilities to ensure such communication.
• publicly available services - it is a common use case in A N A LÝZA
project to run various types of analysis on large datasets
where, as a result of the analysis, is a service endpoint exposed
to the internet; outside of the Kubernetes cluster the analysis
was run on. This use case introduces a non-trivial configuration
that chosen workflow engine must fulfill.
• isolation to secure data - subject data to be analyzed may be
of a sensitive character and it may be vulnerable to sending
data to some cloud storage. It is required that the orchestrator
service is able to run in isolated environments that may even be
disconnected from the internet.
• security - not only processed data should remain secure, but also
the service itself and also communication with the service
33
7- ANALÝZA PLATFORM A N D NEW ORCHESTRATOR SERVICE REQUIREMENT ANALYSIS
• open-source - the open-source character of the solution can
bring many advantages, like the ability to view the source code,
suggest improvements and bug fixes, or being free of charge.
• actively developed - when introducing any new technology
into the existing project stack, it is appropriate to have a vision,
that the chosen technology has a strong and stable developer
base (especially in open-source projects) and will be supported
in the future to prevent the need to replace the tool with new
solution in near future.
• scalability - new orchestration solution should be able to scale
well while maintaining high availability.
• API - analytical operations are in most use-cases invoked via REST
API. new orchestration solution should keep this contract.
7.3 New orchestration tool choice
From the list of researched tools, most of them are to some extent
meeting the requirements. One of them, Argo, has exceptional support
for Docker and Kubernetes, while the rest of reviewed tools offer
Docker/Kubernetes support just as one of the execution options; that
means taking workflow source code and wrapping it in a container
and running it in the Kubernetes cluster. On the other side, Argo is fully
focusing on Kubernetes, and workflow definition is close to the current
orchestrator. Taking these facts into account, Argo seems to be the best
candidate for the new orchestrator role.
34
8 Argo Workflows fundamentals
Argo was introduced on a high level in chapter 5. This chapter explains
Argo Workflows fundamentals in detail in order to explain Argo's
application in the ANALÝZA orchestrator service. From now on, Argo
Workflows will be shortened to Argo, to avoid confusion with other
projects from Argo portfolio1
.
8.1 Workflows
Workflows, and other workflow derivatives like WorkflowTemplate
(see section 8.1.4), CronWorkflows2
, etc., are implemented as Kubernetes
Custom Resources3
, thus their definitions follow Kubernetes
object manifest format.
1 apiVersion: argoproj. io / v1 alphal
2 kind: Workflow
3 metadata:
generateName: hello-world -
spec
5 spec:
6 entrypoint: whalesay
template
7 templates:
- name: whalesay
9 container:
image: docker/whalesay
command: [ cowsay ]
12 args: [ " h e l l o u w o r l d " ]
# new type of k8s spec
# name of the workflow
# invoke the whalesay
# name of the template
Listing 8.1: Argo workflow example
^ttps://axgoproj.github.io/
2
https://axgoproj.github.io/argo-workflows/cron-workflows/
#cron-workflows
3
https://kubernetes.io/docs/concepts/overview/
working-with-objects/kubernetes-objects/
35
8. A R G O WORKFLOWS FUNDAMENTALS
The spec part is where Argo-specific configuration takes place. The core
structure of spec is defined as templates and an entrypoint.
Entrypoint is the root of the workflow, and there can be multiple
entrypoints of one workflow definition (in the same manner as DAGs
can have multiple roots, i.e. multiple vertices with in-degree equal
to zero). Default entrypoint can then be overridden on submission.
Every step of a workflow is a template and there are several types
of templates, further divided into two groups:
1. Template Definitions - steps from which the workflow is com-
posed:
(a) container4
(A.1) - the most common template type. It specifies
the Docker container to be run as a workflow step.
Container input arguments, image, and others are configured
here.
(b) script5
(A.2) - extends the container template by injecting
script source code to the container to be executed. This may
come in handy for simple tasks, where it's not necessary
to build a complete Docker image to run a few lines of code.
The source code is stored directly in the definition Y A M L
file, so for the sake of readability, the script should stay
short and clear.
(c) resource6
(A.3) - manages Kubernetes objects directly. This
is basically reduced kubectf tool functionality built directly
into the workflow. Supported operations are to get, create,
apply, delete, replace, or patch cluster resources. Kubernetes
manifest to be applied is specified directly as one of the template
configuration fields.
(d) suspend8
(A.4) - pauses the workflow execution for a specified
amount of time or indefinitely until user input is prohttps://axgoproj.github.io/axgo-workflows/f
ields/#container
5
https://argoproj.github.io/argo-workflows/f ields/#scripttemplate
6
https://argoproj.github.io/argo-workflows/f i e l d s /
#resourcetemplate
7
https://kubernetes.io/docs/reference/kubectl/
8
https://argoproj.github.io/argo-workflows/f ields/#suspendtemplate
36
8. A R G O WORKFLOWS FUNDAMENTALS
vided. The workflow can then be resumed via the CLI, API
call, or UI.
2. Template Invocators - compose Template Definitions into workflows
based on the complexity of workflow dependencies:
(a) Steps(A.5) - execute a serie of tasks in the order they are
defined. Steps can run sequentially or in parallel.
(b) DAG(A6) - the more complex invocator. Composes tasks
into Directed Acyclic Graph, respecting all the dependen-
cies.
Templates can be nested together, it is even possible to run a workflow
as a part of another workflow.
8.1.1 Input and Outputs
In most cases, templates will have specified inputs and outputs (these
can be chained together; one task can take the output of the previous
task as its own input).
There are two types of arguments:
• Parameters(A.7) - string/integer values.
• Artifacts (A.8) - arbitrary resources like files or archives loaded
from remote storage like S3, and then mounted on the specified
location of a container.
When talking about input arguments, it's important to distinguish
between workflow-level and template-level arguments.
• workflow arguments - usually the user input to the workflow invocation,
defined at the root of the Argo definition spec; on the same
level as entrypoint
• template arguments - defined for each template (step). These arguments
can be referencing values of workflow-level arguments
or other workflow variables.
37
8. A R G O WORKFLOWS FUNDAMENTALS
Workflow level parameters, as one of the workflow argument options,
offer an option to alter workflow behavior without changing
the workflow source. Parameters have their default values hardcoded
in the workflow source and can be overridden when submitting workflow.
Parameters, also as other workflow variables9
can be referenced
anywhere in the Y A M L workflow source file, using special double
curly bracket syntax, e.g. {{workflow.parameters.size]} when size is defined
as workflow parameter.
8.1.2 Workflow management
Argo resources can be controlled in several ways:
• Argo CLI - requires access to the Kubernetes cluster Argo is
running on.
• REST API - connects to Argo server exposed API. Most suitable
for automation or integration with other tools.
• UI - together with workflow management, it gives the user insight
into the state of workflows or view workflow visualization.
Best suitable for workflow development and debugging.
• kubectl - since Argo resources are Kubernetes CRD, they can be
implicitly managed by kubectl. However, this option is not recommended
since is doesn't provide syntax validation the other
options do.
8.1.3 Workflow execution
Naming
Each workflow step is executed in a separate Pod whose name follows
specific format:
<workflow_name>-<workflow_id>-<pod_id>
https://axgoproj.github.io/argo-workflows/variables/
#workflow-variables
38
8. A R G O WORKFLOWS FUNDAMENTALS
Namespace
If not specified otherwise, workflows are executed in the default Kubernetes
namespace. This setting can be overridden in the workflow definition
file, specifying the namespace field under metadata, in the same
manner as the rest of the Kubernetes resources. Another way to choose
the target namespace is to set it as an invocation parameter in UI,
or as a command line parameter when using Argo CLI.
Service account
By default, all workflow Pods run under the default service account
in the namespace where workflow is executed. However, for security
reasons, it's recommended to override this setting and create service
accounts with respective roles for each type of workflow, depending
on the scope and requirements. There is a defined minimum set
of permissions required for all executing service accounts:
1 a pi Vers ion: rbac. a u t h o r i z a t i o n . k8s. io /v1
2 kind: Role
3 metadata:
4 name: executor
5 rules:
6 - apiGroups:
7 - argoproj.io
8 resources:
9 - workflowtaskresult
10 verbs:
11 - create
12 - patch
Listing 8.2: Minimum Role persmissions required for Service Account
to run workflows
A good idea on how to divide service accounts is by provided
secrets. For example, the user may want to create a service account
's3-executor' with secrets needed to authenticate to S31 0
. This account
would then be used for all workflows that need to connect to S3.
https://aws.amazon.com/s3/
39
8. A R G O WORKFLOWS FUNDAMENTALS
ARGO WORKFLOW CONTROLLER DESIGN
Figure 8.1: Argo workflow controller architecture
Source: https://argoproj.github.io/argo-workflows/architecture/
8.1.4 Workflow templates
Workflow templates are a vital part of the effective usage of Argo Workflows.
By default, the workflow's lifecycle is consisting of submitting
and running the workflow on demand. For better reusability, workflow
templates were introduced and they allow the user to upload
and store workflow definitions directly on the Argo server, removing
the need to send the whole definition on each submit. It's even possible
to reference workflow templates from other workflows.
Workflow templates will play important role in ANALÝZA orchestrator
requirement implementation.
It's crucial to understand the difference between template and WorkflowTemplate.
WorkflowTemplate is the definition of the whole workflow
stored in the cluster, but a template is a step of a workflow
of which the whole workflow is composed. This naming collision
is quite unfortunate and the user needs to get used to it.
40
8. A R G O WORKFLOWS FUNDAMENTALS
8.2 Ul
Argo UI serves most of the operations managed by the user, like submitting
workflows, updating workflow definitions, or viewing past
workflow runs. Workflow run visualization comes in handy when
debugging new workflows or investigating failed runs. Logs from individual
workflow steps, together with inputs/outputs are available
to view too. There are also tabs managing integration with other Argo
projects (when applicable).
Workflows argo da g-dia mon d-2zp7l
V3.2.4 Q RESUBMIT T f DELETE T = LOGS T <SHARE
Tv i Q, ^ n r [ ] Search
should-
exeojte-1
Ojag-diamond-
2zp7t
©A
oC
oshould-riot-
execute
©should-
execute-2
©should-
execute-3
WORKFLOW DETA _3
m * H A
SUMMARY CONTAINERS INPUTSfOUTPUTS
NAME dag-diamond-2zp7l_C 1
TYPE Pod
POD NAME dag-diamond-2zp7l-4G 37132611 B
HOST NODE NAME docker-desktop •
PHASE 0 Failed
MESSAGE Error (exit code 1)
START TIME 2m ago
EMC "II'.'IE 1m ago
DURATION 2s
MEVIC-IZA"IC;S
RESOURCES ls*(1 cpu).l5*(100Mi memory) O
Figure 8.2: Argo UI displaying the state of a workflow run
41
8. A R G O WORKFLOWS FUNDAMENTALS
8.3 Auth
Argo server configuration offers three authentication modes:
• server - no authentication required; anyone who can access UI
or API, can operate with workflows
• client - in order to access the server, the user needs to provide
his service account bearer token associated with the role he is
supposed to have
• SSO - having configured the OIDC provider correctly (this includes
having set Kubernetes secrets to the provider and referenced
in the 'workflow-controller' ConfigMap together with
the OIDC provider address), the user is able to authenticate using
the OAuth2 protocol. RBAC is then managed using groups1 1
.
Argo server supports encrypted communication via TLS, enabled
by the server passing a security flag to the server initialization com-
mand.
8.4 User management
When the Argo server is set to run in client auth mode, the user needs
to provide an authentication token in order to make any kind of action.
This token is the Kubernetes bearer token of a service account.
This means each user needs to have a service account created. The role
associated with the given service account determines what the user
can do in the system. Since all Argo resources are implemented as Kubernetes
resources, the usual Kubernetes access restriction1 2
applies.
User can have restricted access just to some kinds of Argo resources,
like Workflows or WorkflowTemplates, or even allowed resources can
be specified by concrete names.
Users can also be restricted to only reduced set of actions, for example,
user can be restricted just to view workflows, not create new
ones.
https://axgoproj.github.io/argo-workflows/argo-server-sso/
#sso-rbac
1 2
https://kubernetes.io/docs/reference/access-authn-authz/
authentication/
42
8. A R G O WORKFLOWS FUNDAMENTALS
1 # The r e s t r i c t i o n c o u l d be t i g h t e n e d f u r t h e r by u s i n g
' resourceNam.es'
2 apiVersion: rbac . a u t h o r i z a t i o n . k8s . io/v1
3 kind: Role
4 metadata:
name: submit-workflow-template
6 rules:
7 - apiGroups:
8 - argoproj.io
9 resources:
io - workfloweventbindings
n verbs:
12 - l i s t
13 - apiGroups:
14 - argoproj.io
15 resources:
16 - workflowtemplates
17 verbs:
18 - get
19 - apiGroups:
20 - argoproj.io
21 resources:
22 - workflows
23 verbs:
24 - create
Listing 8.3: Role definition example with enough permissions to submit
a workflow template.
43
8. A R G O WORKFLOWS FUNDAMENTALS
8.5 Argo API
Argo API can be controlled in two ways:
• REST API - Argo server exposes rich RESTful API allowing
the user to fully control Argo workflows using HTTP requests.
When the Argo server is run in client mode, requests need to have
filled the Authorization header with the respective bearer token.
• client libraries - Argo is also controllable using client libraries,
officially supported by the Argo team. Supported languages
are Go, Java and Python. When managing Argo via code, it's arguably
more comfortable to use the client library instead of REST
API, since it makes the request construction, response parsing,
and authentication much easier. Client libraries are basically
a wrapper around GRPC communication between the client
and Argo server. Argo CLI is under the hood using the Go client
library.
8.6 Argo deployment
Argo is deployed to the cluster by applying a set of Kubernetes manifests
provided by the Argo team as a part of each release. Manifests
can be found in the Argo GitHub repository in the 'Releases' section.
These manifests define service accounts and related roles required
to run the server, the workflow controller, and some custom resource
definitions.
Server configurations are stored in ConfigMap passed to the workflow
controller Deployment. Available configuration options1 3
need
to be set in deployment manifests before applying to the cluster.
https://axgoproj.github.io/argo-workflows/
workflow-controller-configmap.yaml
44
8. A R G O WORKFLOWS FUNDAMENTALS
The deployment consists of two main services, each is required
to run under a service account with sufficient permission role:
• argo-server - hosting UI and API
• workflow-controller - scheduling workflows and communicating
with Kubernetes API
Currently, there are three variants of deployment, depending on the scope
Argo operates on:
• cluster - Argo manages workflows on all cluster namespaces
• namespaced - Argo manages workflows only in the namespace
Argo is deployed in
• managed namespace - Argo manages workflows in the namespace
Argo is deployed in and those specified in workflow-controller
ConfigMap
It's common practice to have multiple Argo deployments across Kubernetes
namespaces, for performance, security, or use-case distribution.
The bigger the scope Argo operates on, the more permissions must
be elevated for workflow-controller.
For the UI to be available publicly, the argo-server Service needs
to be patched to be of type NodePort to use the node's address or LoadBalancer
to get an IP address from the cluster load balancer provider.
This can be patched on runtime, or even better to specify this in deployment
manifests before Argo is installed.
8.6.1 Scaling
Argo can be scaled up in several ways:
• increase QPS value - the bigger QueriesPerSecond, the more
requests can Argo send to the Kubernetes API.
• increase burst value - allows to temporarily exceed QPS value.
• increase number of workers - event workers are processing
scheduled workflows/pods. The more workers, the faster is
the scheduler queue being processed.
45
8. A R G O WORKFLOWS FUNDAMENTALS
• increase CPU/memory for controller - increasing the number
of event workers may result in higher CPU/Memory demands.
• run multiple workflow controllers - when running a single workflow
controller replica, in case of failure, service disruption may
occur until a new Pod is created. When running multiple replicas,
when the main replica fails, it can be replaced by backup
replicas.
• run multiple Argo servers - in case of really high API/UI request
volume, some requests may get lost. To prevent this, run multiple
Argo server replicas.
• utilizing Pod disruption budget1 4
on workflow controller.
https://kubernetes.io/docs/concepts/workloads/pods/disruptions/
#pod-disruption-budgets
46
8. A R G O WORKFLOWS FUNDAMENTALS
8.7 Secrets
Argo works with secrets via standard Kubernetes secret mechanism1 5
.
Each workflow runs under the specified service account (or default,
if not set explicitly) and each service account has secrets associated
with it. The workflow template can then reference these secrets in
the workflow definition and secrets are then mounted as environment
variables to the container.
1 ...
2 templates:
3 - name: p r i n t - s e c r e t
4 container:
5 image: alpine:3.7
command: [sh, -c]
7 args: [ ,
echou "secretu fromu env:u PASSWORD";']
8 env:
9 - name: PASSWORD
10 valueFrom:
n secretKeyRef:
name: my-secret
13 key: mypassword
Listing 8.4: Template snippet, demonstrating secret referencing from
template definition
'https://kubernetes.io/docs/concepts/configuration/secret/
47
9 Requirements implementation in Argo
Most of the requirements are fulfilled trivially just by using Argo's
built-in functionalities described in chapter 8. However, some are more
complex and need some more smart compositions of Argo features.
9.1 Public services
The most complex requirement to implement is to expose the service
endpoint as a part of a workflow step to the internet. This requires
correct Kubernetes configuration together with keeping this configuration
simple and easy to reuse. Before the demonstration of how
this can be achieved, some Kubernetes theory needs to be reminded.
9.1.1 Kubernetes resources revised
Pod
The most basic Kubernetes unit is a Pod, which is in fact an encapsulation
of one or more containers, mostly of Docker runtime. When
multiple containers are running in a Pod, they share a common set
of Linux namespaces, groups, and other facets of isolation. A lifecycle
of a pod is tightly tied to the state of the containers running within.
Nodes
A node is a worker machine that runs Kubernetes workloads. It can
be a physical (bare metal) machine or a virtual machine (VM)
Workloads
Although it is possible to define manifests for Pods, it's more common
to manage Pods using Workloads like Deployments, StatefulSets,
or Jobs. These workloads manage Pods to meet the desired state defined
in the Workload definition. Such state can consist of a number
of available Pods running containers in the latest versions, managing
stateful and stateless applications, or running Jobs.
48
9. REQUIREMENTS IMPLEMENTATION IN A R G O
Service
Pods are logically grouped in namespaces1
are part of an internal
network using internal DNS. From the Kubernetes documentation:
"In Kubernetes, a Service is an abstraction that defines a logical set ofPods
and a policy by which to access them (sometimes this pattern is called
a micro-service). The set of Pods targeted by a Service is usually determined
by a selector."
9.1.2 Create service in Argo
Now, with the knowledge of how networking in a Kubernetes cluster
works and what needs to be done, the goal is to find a way how to create
a proper Kubernetes service as a part of a workflow. Fortunately, Argo
comes with an easy way of managing Kubernetes resources, a resource
template. A n example of such template is shown in listing 9.1.
https://kubernetes.io/docs/concepts/overview/
working-with-objects/namespaces/
49
9. REQUIREMENTS IMPLEMENTATION IN A R G O
1 - name: p i — t m p l
2 r e s o u r c e :
a c t i o n : c r e a t e
4 m a n i f e s t : |
5 a p i V e r s i o n : v l
k i n d : P o d
m e t a d a t a :
name: n g i n x
l a b e l s :
app . k u b e r n e t e s . io/name: p r o x y
11 s p e c :
12 c o n t a i n e r s :
13 - name: n g i n x
14 i m a g e : n g i n x : 1 1 . 1 4 . 2
15 p o r t s :
16 - c o n t a i n e r P o r t : 80
17 name: http—web—svc
18
19 a p i V e r s i o n : v l
20 k i n d : S e r v i c e
21 m e t a d a t a :
22 name: n g i n x — s e r v i c e
23 s p e c :
24 t y p e : N o d e P o r t
25 s e l e c t o r :
26 app . k u b e r n e t e s . io/name: p r o x y
27 p o r t s :
28 - name: n a m e — o f — s e r v i c e — p o r t
29 p r o t o c o l : T C P
30 p o r t : 80
31 t a r g e t P o r t : http—web—svc
32 n o d e P o r t : 30007
Listing 9.1: Kubernetes Service creation as a workflow step
Specifying service in each workflow like shown above would work,
but it makes the template less readable and brings complexity to team
members writing workflows, but not that familiar with the rest of Kubernetes
mechanisms. Fortunately, Argo's WorkflowTemplate feature
can easily solve this problem.
Template creating new service can be turned into parametrized
WorkflowTemplate. Usable parameters to this template might be opened
ports, container to run in the deployment, or resource labels.
Create service template breakdown
A service in Kubernetes is composed of Deployment (or other workload)
and Service.
The parametrized template example to create a Service is shown
in listing 9.2
50
9. REQUIREMENTS IMPLEMENTATION IN A R G O
1 - name: e x t e r n a l - s v c
2 inputs:
3 parameters:
4 - name: app
5 - name: port
6 resource:
7 action: create
8 setOwnerReference true
9 manifest:
10 apiVersion v1
11 kind: Service
12 metadata:
13 name: s e r v i c e - external - {{inputs, parameters .
app}}
14 spec:
15 type: NodePort
16 selector:
1 7 name: {{inputs . parameters . app}}
18 ports:
19 - port: {{inputs . parameters . port}}
Listing 9.2: A reusable, parametrized template (step) that creates new
Service of type NodePort
The template definition is pretty straightforward, parameters are
referenced using double curly braces syntax as has been shown in
8.1.1.
Kubernetes Service needs Pods to handle service calls and this can
be managed e.g. by Deployment. The Deployment creation template is
defined in a similar manner as the Service template. The selector field
in the Service definition needs to match the name label in the metadata
section of the Deployment manifest. A n example of such template is
shown in listing 9.3.
51
9. REQUIREMENTS IMPLEMENTATION IN A R G O
- name: d e p l o y m e n t
2 i n p u t s :
3 p a r a m e t e r s :
4 - name: app
5 - name: i m a g e
6 - name: t a g
7 - name: env
- name: p o r t
9 - name: a r g s
10 r e s o u r c e :
11 a c t i o n : c r e a t e
12 s e t O w n e r R e f e r e n c e : t r u e
13 m a n i f e s t : |
14 a p i V e r s i o n : a p p s / v l
15 k i n d : D e p l o y m e n t
16 m e t a d a t a :
17 name: { { i n p u t s . p a r a m e t e r s . a p p } }
18 s p e c :
19 s e l e c t o r :
20 m a t c h L a b e l s :
21 name: { { i n p u t s . p a r a m e t e r s .app}}
22 t e m p l a t e :
23 m e t a d a t a :
24 l a b e l s :
25 name: { { i n p u t s . p a r a m e t e r s .app}}
26 s p e c :
27 c o n t a i n e r s :
28 - name: { { i n p u t s . p a r a m e t e r s . app}}
29 i m a g e : { { i n p u t s . p a r a m e t e r s . i m a g e } } : { { i n p u t s . p a r a m e t e r s . tag}}
30 p o r t s :
31 - c o n t a i n e r P o r t : { { i n p u t s . p a r a m e t e r s . p o r t } }
32 a r g s : [ " { { i n p u t s . p a r a m e t e r s . a r g s » " ]
33 e n v F r o m :
34 - c o n f i g M a p R e f :
35 name: { { i n p u t s . p a r a m e t e r s . env }}
Listing 9.3: A reusable template that creates new Deployment
Using two templates defined above is enough to create a Service
available from public internet. However, notice the Service only exposes
a single port, which may not fit all the use cases for Services.
A n easy fix for this problem is to use another template, which applies
patch2
to the existing Service opening even more ports, if possible.
https://kubernetes.io/docs/tasks/manage-kubernetes-objects/
update-api-object-kubectl-patch/
52
9. REQUIREMENTS IMPLEMENTATION IN A R G O
1 - name: patch
2 inputs:
3 parameters:
4 - name: names
5 - name: app
6 - name: portValue
7 - name: portName
8 resource:
9 action: patch
10 manifest:
11 apiVersion v1
12 kind: Service
13 metadata:
14 name: s e r v i c e - external - {{inputs, parameters .
names}}
15 spec:
16 type: NodePort
1 7 ports:
18 - port: {{inputs . parameters . portValue }}
19 targetPort: {{inputs . parameters .
portValue}}
2 0 name: {{inputs . parameters . portName}}
2 1 selector:
2 2 name: {{inputs . parameters . app}}
Listing 9.4: A reusable template that patches existing service
The number of ports can be variable so some kind of iteration
through the parameter list may come in handy and Argo allows that
using the WithParatn clause.
Invoking the patch template defined above with variable number
of parameters is shown on the code shown in listing 9.5:
53
9- REQUIREMENTS IMPLEMENTATION IN A R G O
1 steps:
2
- - name: patch
4 template: patch
5 arguments:
6 parameters:
- name: app
8 value: '{{inputs.parameters.app}}'
9 - name: portValue
10 value: '{{item.value}}'
n - name: portName
12 value: '{{item.name}}'
13 withParam: '{{inputs .parameters .ports}}'
Listing 9.5: A n invocation of reusable template that patches existing
service
Notice the usage of WithParam field. It iterates through given input
parameter and individual configurations can be then referenced
as items. Such invocation of complex input parameters can be defined
as follows:
1
2 arguments:
3 parameters:
4 - name: ports
5 value: |
6 [
7 { "name": "kafka", "value": 9092 } ,
8 { "name": "https" , "value": 443 }
9 ]
Listing 9.6: Example of passing structured items as parameters
to template
Parameters then behave as key-value pairs assigned to each item.
A complete example of WorkflowTemplate creating Service can be
found in the attachment. The beauty of these templated solutions is
in a variety of modifications. Such modifications could be adjustments
54
9- REQUIREMENTS IMPLEMENTATION IN A R G O
to naming, labeling, or changing the garbage collection strategies,
depending on the conventions established in the Kubernetes cluster
where workflow executions will be held.
9.1.3 Garbage collection
When the Service is up and running, it may run for some period of time
that may not be known prior to definition/execution. Such situation
may be solved by a few different approaches. One may be to leave
the service running and then run some periodic jobs in Kubernetes
cluster that would clear Services that are no longer used. This would
require defining rules, on when to delete the Service.
Another approach may be taken utilizing Argo functionalities:
• utilizing workflow property onExit, which specifies template
executed at the very end of the workflow (similar to the defer
mechanism used in Go3
). Each workflow then could define
a cleanup template of type resource that would remove all resources
created during the workflow execution. The catch here
is to keep the workflow running for some time, so the cleanup
template is not executed too early. This can be achieved by setting
template of type to suspend as the last template executed.
• utilizing setOwnerReference property of resource template. Setting
this value to true, Argo instruments Kubernetes garbage collection
to destroy created resource, when the whole workflow ends
and is being garbage collected.
9.2 Custom API
Although Argo provides REST API by default with installation, it's
convenient to create another layer of API between Argo and the user.
The main reason for this is that requests and responses of Argo are
quite verbose and might be hard to navigate in. A n API with a reduced
set of requests/responses was created to better fit the needs
of the ANALÝZA project. This means the set of available requests has
been cut down to basic operations only with mandatory information
'https://go.dev/tour/flowcontrol/12
55
9- REQUIREMENTS IMPLEMENTATION IN A R G O
and actions. These actions include mainly workflow management like
creating, stopping, or viewing workflows. The API contract is shown
in the table 9.1.
Apart from reduction of complexity, the additional API layer also
creates an abstraction of the orchestration tool currently used. This means,
that in case of a decision to move into another workflow orchestrator
tool happens, the contract between API and consumer will be able
to stay the same. This is ensured because API hides the implementation
details of the used orchestration tool and its contract between
API and consumer is reduced to the minimum set of functions with
parameters always present in each orchestration tool.
Method URL Description
Get /v2/workflow/{name} Gets workflow status
Get /v2 /workflow/running
Gets all running
workflows
Get /v2 /workflow / {definitions }
Gets all runnable
workflow definitions
Get /v2/workflow/definition/ {name}
Gets single runnable
workflow definition
Post /v2 /workflow /start Invokes workflow
Delete /v2 /workflow/ {name}
Stops workflow
execution
Table 9.1: Custom Argo API Contract
9.2.1 Changes to the API
Some API contract changes were required when compared to the original
A N A L Y Z A orchestrator API. The reason for changes is that the old
API[1] was designed to reflect the old orchestrator implementation
and its details (e.g. concept of nodes, modules,..). Even though the API
has changed, its functionality remained mostly unchanged. The only
breaking change is the loss of control of workflows on the module/step
level. However, it's possible to adjust the custom API to parse this
information from Argo, if the future shows that this functionality is
really required.
56
9. REQUIREMENTS IMPLEMENTATION IN A R G O
9.2.2 Implementation
API is implemented in Go and the project structure follows standard
Go project layout4
. The connection between API and Argo server is
ensured using Argo's Go client library.
Code is compiled as a part of a Docker multi-stage build, where
source code is compiled in an intermediate image, and result binaries
are then copied to the final image. The base of the result image is
gcr.ioI'distroless/base-debianll; which is an image with almost no binaries
in it. The used base image size has roughly 2 MB in size. Apart
from size, another advantage is reduced attack surface in case of a hack
attempt.
\https://github.com/golang-standards/proj ect-layout
57
10 Argo features that might be used in future
This chapter describes Argo features that are not necessary for writing
workflows but might be beneficial.
10.1 Workflow definition synchronization
Over time, workflow definitions may change and new ones can be
created. As has already been shown, it's convenient to store commonly
used workflows as WorkflowTemplates. However, it's not user-friendly
to go through the manual submission process over and over again,
so a little bit of automation can step in. Since workflow definitions are
text files, it's convenient to store them in Git. This introduces better
control over workflow definitions, versioning, and history. Synchronization
between Git and Argo needs to be established using some
kind of pipeline, e.g. Github Actions1
when using Github.
A n example of such pipeline using Github Actions may look like this:
0. Prerequisite is to have Argo API address and Argo access token
stored in Github secrets.
1. User pushes new workflow template to a Git branch.
2. A n automated on-push action is triggered. A Docker container
from argoproj/argocli image is created. In the container, argo template
lint command is executed, performing syntax check of committed
workflow definition, to prevent merging invalid files
into the master branch.
3. After the syntax lint check and pull request review from peer
developers, the branch is merged to master.
https://docs.github.com/en/actions
58
io. A R G O FEATURES THAT MIGHT BE USED IN FUTURE
4. Another automated action is triggered, again creating the argoproj/argocli
container with Argo credentials pulled. Currently,
Argo doesn't support updating a WorkflowTemplate using Argo
CLI, but there are workarounds:
• Using Argo CLI: Deleting WorkflowTemplate and then
recreating with the same name.
• Using kubectl: Since WorkflowTemplate is just a Kubernetes
CustomResourceDefinition, kubectl apply could be used
to update the template. Beware, this option doesn't include
syntax validation
10.2 Usage of volume claims
Currently, ANALÝZA workflows are designed to share data through
the data warehouse. This may create unnecessary traffic to the warehouse
just to pass data between steps of a workflow. A n optimization
of this may be to use Kubernetes persistent volume claims as temporary
data storage in the context of a workflow run.
10.3 Workflow notifications
User may want to be notified about workflow events as workflow
failed, or workflow succeeded. This can be achieved by editing the default
workflow spec2
, and running workflow using onExit property.
This workflow can be sending results of the workflow, or doing any
kind of cleanup.
Alternatively, user can collect Kubernetes events about workflows
emitted by Argo. However, Kubernetes events live in the cluster for only
limited amount of time, so some kind of collector needs to be set
up to collect these events and send them to some persistent storage.
A n example of such a tool can be Kubeivatch3
, which is unfortunately
no longer maintained by original authors but is externally maintained
by the community.
https://axgoproj.github.io/argo-workflows/default-workflow-specs/
#default-workflow-spec
3
https://github.com/vmware-archive/kubewatch
59
io. A R G O FEATURES THAT MIGHT BE USED IN FUTURE
10.4 Prometheus metrics
When executing workflows, Argo can emit metrics4
collectable by
Prometheus5
, quite popular tool used to watch Kubernetes metrics. Such
metrics could be average workflow execution time, failure or success
rate. O n top of Prometheus, user can build a Grafana6
dashboards
to visualize these metrics.
https://axgoproj.github.io/argo-workflows/metrics/
#prometheus-metrics
5
https://prometheus.io/
6
https://grafana.com/
60
11 Demonstration and evaluation of results
The following chapter demonstrates Argo's capabilities on real usecase,
describes testing environment and demonstration scenario. Individual
scenario steps are linked to Appendix, where lie actual template
definitions used in the demonstration. The final section evaluates observed
results.
11.1 Demonstration environment
All the tools evaluation was executed on desktop PC with Intel Core i5-
9600K C P U and 32 GB R A M running Windows 11 and Ubuntu WSL1
.
Two separate options of Kubernetes deployment were used in the testing
process:
• kind2
(kubernetes in docker) tool
• Docker Desktop's built-in Kubernetes feature3
Both tools, when enabled, automatically create a kubeconfig4
and con-
text5
configuration, that is used by kubectl tool to communicate with the clus-
ter.
To communicate with the Argo server via either UI or REST API,
the port-forwarding6
on Argo deployment was applied. Port-forwarding
is quite common practice when developing Kubernetes application; it
brings the comfort of accessing Pods from the local machine without
the need to set up the ingress7
. Port-forwarding can also be applied
to access all Pods, even those not exposed as a Service.
https://ubuntu.com/wsl
2
https://kind.sigs.k8s.io/
3
https://docs.docker.com/desktop/kubernetes/
4
https://kubernetes.io/docs/tasks/access-application-cluster/
access-cluster/
5
https://kubernetes.io/docs/tasks/access-application-cluster/
configure-access-multiple-clusters/
6
https://kubernetes.io/docs/tasks/access-application-cluster/
port-forward-access-application-cluster/
7
https://kubernetes.io/docs/concepts/services-networking/ingress/
61
I i . DEMONSTRATION A N D EVALUATION OF RESULTS
11.2 Demonstration scenario
Argo capabilities were tested on scenario based on real data analysis
use-case. Workflow consists of following steps:
1. A.9.1 - Spin up MySQL database using Deployment, Service
of type NodePort (to allow access outside of cluster), and Service
of type ClusterIP (to allow access from inside of cluster). A l l
these steps are encapsulated into a reusable WorkflowTemplate.
The Deployment loads environment variables from ConfigMap
supplied to the service creation WorkflowTemplate. This step
demonstrated the usage of resource templates executed in parallel,
parameter passing, and usage of WorkflowTemplates.
Workflows argo workflcv WORKFLOW DETAILS
m * E A
SUMMARY CONTAINERS INPUTS/OUTPUTS
Inputs
—defaj li-QuthenticetionDIug
in=mysql_nalive_password
Figure 11.1: MySQL demo step
2. A.9.2 - Download CSV dataset using curl command executed
in container template and save it to persistent volume shared between
steps. This step demonstrated container execution with arguments
and usage of persistent volume claims.
62
I i . DEMONSTRATION A N D EVALUATION OF RESULTS
3. A.9.3 - Load CSV dataset from a persistent volume, cut the content
to first 5 lines of the dataset and write back to persistent volume.
This step demonstrated the usage of artifacts, where CSV
processing Go code was downloaded as git artifact from a public
repository and executed by Golang Docker image.
4. A.9.4 - After the database is up and the CSV dataset is preprocessed
(this is enforced by using DAG template and dependencies),
the source template is invoked, running Python script
connecting to deployed MySQL and loading CSV dataset from
persistent volume. Then it fills the database with the dataset.
The database schema is created dynamically from the dataset
structure. If the script attempts to write to the database before it's
fully initialized (remember that Service/Deployment creation
success doesn't imply it's fully initialized. However, it's possible
to create guards to ensure this property), the execution fails
and retry policy steps in to try again. This step demonstrated
the usage of inline script used in the workflow definition file,
retry policies, and dependency enforcement.
Cj s i ? y r A <& 0 uirnitoioooraws - & •<# o , (J] @
1 • SELECT * FROM CSV.CSV;
I Result Grid | Ü T^V Filter Rows: | Export: | Wrap Cell Content: JÄ
name country subcountry geonameid
• les Escaldes Andorra Escaldes-Engordany 3040051
Andorra la Vella Andorra Andorra la Vella 3041563
Umm al Qaywayn United Arab Emirates Umm al Qaywayn
Ras al43iairnah United Arab Emirates Ra's al Khaymah 291074
Figure 11.2: MySQL DB demo step
63
I i . DEMONSTRATION A N D EVALUATION OF RESULTS
5. A.9.5 - In parallel with databse writing step, there's a step that
reads pre-processed dataset from persistent volume and writes
it into standard output.
X
Logs
T [Filter (regexp)... ~
name, country/ subcountry, geonajneid
les Escaldes,Andorra,Escaldes-Engordany,304 0 051
Andorra la Vella,Andorra,Andorra la Vella,3041563
(Jim al Qaywayn, United Arab Emi rates, Umm al Qaywayn, 2 9 0 594
Has al-Khaimah,United Arab Emirates,Ra's al Khaymah,291074
tir a e="2022-05-14rL9:40:03.037Z" level=info rasg="no need to save parameter - on overlapping volume: /worlc/lineCount.txt" argo=true
I print (workflow-demo-gkl8q-|y| main
Figure 11.3: Print step demo logs
Workflows argo workflow-demo-gklSq
T v l Q. Q. } [] Search
JO
o o
p o p p
o o
o o
WORKFLOW DETAILS
m * a A
SUMMARY CONTA NERS INPUTS/OLTPUTS
Inputs
Outputs
Parameters
line-count 5 /work/worksites a
Figure 11.4: Print demo step outputs
64
l i . DEMONSTRATION A N D EVALUATION OF RESULTS
The result of the demonstration workflow is MySQL database
accessible from outside of Kubernetes cluster (ensured by Service
of type NodePort). User with access to the network where Kubernetes
Node takes place, could then query the database. Since all resources
(Services and Deployments) have configured setOwnerReference to true,
all of these will be destroyed when garbage collection will destroy
workflow pods. This behavior can be adjusted by adding the final
workflow step of type suspend that would keep the workflow running
until user input to resume workflow is provided.
Complete test scenario definitions can be found in the thesis at-
tachment.
11.3 Evaluation
After the Argo Workflows' capabilities theoretical review and practical
demonstration, it's appropriate to evaluate Argo Workflows as fullfeatured
orchestration engine suitable to succeed the A N A L Y Z A ' s
orchestration tool. Argo proved its exceptional Kubernetes support,
task dependencies, parameters and abilities to manage Kubernetes
resources to fulfill all ANALÝZA platform requirements.
There is one function that Argo didn't manage to fulfill, and that is
reusing of running modules the way the ANALÝZA orchestrator do.
However, this is by Argo's design, where workflow steps are isolated
units. However, Argo is able to mimic this behavior with some advanced
usages of workflow properties, but there's no one-line solution
for this case as it is in ANALÝZA orchestrator.
65
12 Conclusion
The goal of this thesis was to review the current market of workflow
orchestration tools, mostly those operating on Kubernetes, then pick
one of them to succeed the in-house built workflow orchestration
solution in the A N A L Y Z A project.
The research part of the thesis introduced four workflow orchestration
tools, described workflow definition syntax, deployment, and execution
model, and discussed the documentation and community
around the tool.
The next part analyzed the current A N A L Y Z A orchestration tool,
the architecture, and workflow definition style. Based on this analysis,
requirements for the orchestration tool successor were formalized.
Based on these requirements, Argo Workflows was selected as the tool
of choice.
The final part demonstrated Argo Workflow's capabilities from the
user point of view, from workflow definition syntax, through deployment
and execution to implementation of more complex requirements
defined in the requirements. Each of the most important workflow
properties was demonstrated in examples. Some advanced workflow
properties Argo offers are suggested to use in A N A L Y Z A workflows
that may improve performance and usability together with integrations
with 3 rd party tools.
Further improvements could be applied to the custom API, depending
on the feedback from production experiences. There are a lot
of features Argo API provides and are not yet included in the custom
API on purpose, to keep things simple and add functionality when
it's really needed. Even though the API complies with the required
contract, the request and response payload can be adjusted based
on the production needs. Since the API is running in the Kubernetes
cluster, it can help supervise Kubernetes resources created by workflows.
For example, it could have the ability to list all publicly exposed
services, or close the no longer used ones.
Another area of improvements is to build on top of service creation
WorkflowTemplate used in the demonstration scenario, and extend
it to be more flexible in terms of service and deployment properties,
or labelling strategy.
66
A Workflow Examples
A.1 Container
1 apiVersion: argoproj . io / v l a l p h a l
2 kind: Workflow
3 metadata:
generateName: container —demo-
5 spec:
6 entrypoint: whalesay
7 templates:
8 - name: whalesay
9 container:
image: docker/whalesay:latest
11 command: [ cowsay ]
12 args: [ " h e l l o u w o r l d " ]
Listing A.1: Argo workflow example running the whalesay container
67
A . WORKFLOW EXAMPLES
A.2 Script
1 apiVersion: argoproj . io / v l a l p h a l
2 kind: Workflow
3 metadata:
generateName: script—demo-
5 spec:
6 entrypoint: generate
7 templates:
8 - name: generate
9 script:
image: python:alpine3 .6
11 command: [python]
12 source:
13 import random
14 i = random . randint (1, 100)
print (i)
Listing A.2: Argo workflow example running Python script
68
A . WORKFLOW EXAMPLES
A.3 Resource
1 apiVersion: argoproj . io / v l a l p h a l
2 kind: Workflow
3 metadata:
generateName: resource—demo—
5 spec:
6 entrypoint: prepare—config
7 templates:
- name: prepare—config
9 resource:
10 action: create
11 setOwnerReference: true
12 manifest:
13 apiVersion: v l
kind: ConfigMap
15 metadata:
16 name: config—map—mysql
17 data:
18 MYSQL_ROOT_PASSWORD: root
Listing A.3: Workflow example running the executing resource
manifest
69
A . WORKFLOW EXAMPLES
A.4 Suspend
1 kind: Workflow
2 metadata:
generateName: suspend—demo-
4 spec:
entrypoint: suspend
6 templates:
- name: suspend
8 steps:
- — name: suspend—temporary
template: suspend—20
11 - — name: suspend—forever
template: suspend—forever
13
14 - name: suspend—forever
15 suspend: {}
16
17 - name: suspend—20
18 suspend:
19 duration: "20"
Listing A.4: Workflow example running the suspend template
70
A . WORKFLOW EXAMPLES
A.5 Steps
1 apiVersion: argoproj . io / v l a l p h a l
2 kind: Workflow
3 metadata:
4 generateName: steps—demo-
5 spec:
6 entrypoint: generate
7 templates:
8 - name: generate
9 script:
image: python:alpine3 .6
11 command: [python]
12 source:
13 import random
14 i = random . randint (1, 100)
print (i)
Listing A.5: Workflow example running steps template
71
A . WORKFLOW EXAMPLES
A.6 DAG
1 a p i V e r s i o n : a r g o p r o j . i o / v l a l p h a l
2 k i n d : W o r k f l o w
3 m e t a d a t a :
4 g e n e r a t e N a m e : dag—demo-
5 spec:
6 e n t r y p o i n t : d i a m o n d
7 t e m p l a t e s :
8 - name: d i a m o n d
dag:
1 0 tasks:
11 - name: A
12 t e m p l a t e : echo
13 a r g u m e n t s :
14 p a r a m e t e r s :
15 - name: message
16 v a l u e : " I ' m u r u n n i n g u f i r s t "
1 7 - name: B
18 d e p e n d s : " A "
19 t e m p l a t e : echo
2 0 a r g u m e n t s :
21 p a r a m e t e r s :
2 2 - name: message
2 3 v a l u e : " I ' m u r u n n i n g u a f t e r u A "
2 4 - name: C
2 5 d e p e n d s : " A "
2 6 t e m p l a t e : echo
2 7 a r g u m e n t s :
2 8 p a r a m e t e r s :
2 9 - name: message
3 0 v a l u e : " I ' r a u r u n n i n g u a f t e r u A "
31 - name: D
3 2 d e p e n d s : " B u & & u C "
3 3 t e m p l a t e : echo
3 4 a r g u m e n t s :
3 5 p a r a m e t e r s :
3 6 - name: message
3 7 v a l u e : " I ' r a u r u n n i n g u a f t e r u B u a n d u C "
3 8
3 9 - name: echo
4 0 i n p u t s :
41 p a r a m e t e r s :
4 2 - name: message
4 3 c o n t a i n e r :
4 4 i m a g e : a l p i n e : 3 . 7
4 5 command: [echo , " { { i n p u t s . p a r a m e t e r s . m e s s a g e } } " ]
Listing A.6: Workflow example running D A G template
72
A . WORKFLOW EXAMPLES
A.7 Parameters
1 a p i V e r s i o n : a r g o p r o j . i o / v l a l p h a l
2 k i n d : W o r k f l o w
3 m e t a d a t a :
4 g e n e r a t e N a m e : parameters—demo—
5 spec:
6 s e r v i c e A c c o u n t N a m e : a r g o
e n t r y p o i n t : h i
8 a r g u m e n t s :
9 p a r a m e t e r s :
1 0 - name: message # w o r k f l o w - l e v e l p a r a m e t e r c o n f i g u r a b l e a t s u b m i s s i o n
11 v a l u e : " w o r k f l o w u i n p u t u m e s s a g e "
1 2 t e m p l a t e s :
1 3 - name: h i
1 4 i n p u t s :
1 5 p a r a m e t e r s :
1 6 - name: message # r e f e r e n c i n g w o r k f l o w - l e v e l p a r a m e t e r
1 7 s t e p s :
1 8 - — name: h e l l o l
1 9 t e m p l a t e : w h a l e s a y T o F i l e
2 0 a r g u m e n t s :
2 1 p a r a m e t e r s :
2 2 - name: message
2 3 v a l u e : " •{{ i n p u t s . p a r a m e t e r s . message}}" # r e f e r e n c i n g ' message ' p a r a m e t e r
v a l u e o f t e m p l a t e ' h i '
2 4 - — name: h e l l o 2
2 5 t e m p l a t e : w h a l e s a y
a r g u m e n t s :
p a r a m e t e r s :
lump m e s s e t
2 6
2 7
2 8
2 9 v a l u e : " { { s t e p s . h e l l o l . o u t p u t s . p a r a m e t e r s . h e l l o } } " # r e f e r e n c i n g ' h e l l o ' o u t p u t
p a r a m e t e r v a l u e o f t e m p l a t e ' h e l l o l '
3 0
3 1 - name: w h a l e s a y
3 2 i n p u t s :
3 3 p a r a m e t e r s :
3 4 - name: message
3 5 c o n t a i n e r :
3 6 i m a g e : a l p i n e : l a t e s t
3 7 command: [ e c h o ]
3 8 a r g s :
3 9 - " { - { i n p u t s . p a r a m e t e r s . message}}"
4 0 - name: w h a l e s a y T o F i l e
4 1 i n p u t s :
4 2 p a r a m e t e r s :
4 3 - name: message
4 4 c o n t a i n e r :
4 5 i m a g e : d o c k e r / w h a l e s a y : l a t e s t
4 6 command: [ s h , —c]
4 7 a r g s : [ " c o w s a y u { { i n p u t s . p a r a m e t e r s . m e s s a g e } } u > u / t m p / h e l l o _ w o r l d . t x t " ]
4 8 o u t p u t s :
4 9 p a r a m e t e r s :
5 0 - name: h e l l o
5 1 v a l u e F r o m :
5 2 p a t h : / t m p / h e l l o _ w o r l d . t x t
Listing A.7: Workflow example running parameters passing demo
73
A . WORKFLOW EXAMPLES
A.8 Artifacts
1 # t h i s demo r e q u i r e s c o n f i g u r e d a r t i f a c t s t o r a g e
2 a p i V e r s i o n : a r g o p r o j . i o / v l a l p h a l
3 k i n d : W o r k f l o w
4 m e t a d a t a :
5 g e n e r a t e N a m e : a r t i f a c t —demo-
6 spec:
7 e n t r y p o i n t : a r t i f a c t — e x a m p l e
8 t e m p l a t e s :
9 - name: a r t i f a c t — e x a m p l e
10 s t e p s :
- — name: g e n e r a t e — a r t i f a c t
12 t e m p l a t e : w h a l e s a y
13 - — name: c o n s u m e — a r t i f a c t
14 t e m p l a t e : p r i n t — m e s s a g e
15 a r g u m e n t s :
16 a r t i f a c t s :
17 - name: message
f r o m : " { { s t e p s . g e n e r a t e - a r t i f a c t . o u t p u t s . a r t i f a c t s . h e l l o - a r t > > "
19
20 - name: w h a l e s a y
21 c o n t a i n e r :
22 image: d o c k e r / w h a l e s a y : 1 a te s t
23 command: [ s h , —c]
24 a r g s : [ " s l e e p u l ; u c o w s a y u h e l l o u w o r l d u I u t e e u / t m p / h e l l o _ w o r l d . t x t " ]
25 o u t p u t s :
26 a r t i f a c t s :
27 - name: h e l l o — a r t
28 p a t h : / t m p / h e l l o _ w o r l d . t x t
29
30 - name: p r i n t — m e s s a g e
31 i n p u t s :
32 a r t i f a c t s :
33 - name: message
34 p a t h : /tmp/message
35 c o n t a i n e r :
36 image: a l p i n e : l a t e s t
37 command: [ s h , —c]
38 a r g s : [ " c a t u / t m p / m e s s a g e " ]
Listing A.8: Workflow example running artifact producer/consumer
demo
74
A . WORKFLOW EXAMPLES
A.9 Test scenario
A.9.1 Service creation
1 a p i V e r s i o n : a r g o p r o j . i o / v l a l p h a l
2 k i n d : W o r k f l o w T e m p l a t e
3 m e t a d a t a :
4 name: c r e a t e - s e r v i c e
5 namespace: a r g o
6 spec:
7 e n t r y p o i n t : m a i n
8 t e m p l a t e s :
9 - name: m a i n
1 0 i n p u t s :
n pa ra m etc is:
12 - name: app
13 - name: p o r t
14 - name: i m a g e
15 - name: tag
16 - name: env
- name: i n t e r n a l
- name: e x t e r n a l
19 - name: args
2 0 dag:
2 1 t a s k s :
2 2 - name: c r e a t e — d e p l o y m e n t
2 3 t e m p l a t e : d e p l o y m e n t
2 4 a r g u m ents:
2 5 p a r a m e t e r s :
2 6 - name: app
2 7 v a l u e : " { { i n p u t s . p a r a m e t e r s . a p p } } "
28 - name: i m a g e
2 9 v a l u e : " {{ i n p u t s . p a r a m e t e r s . image }} "
3 0 - name: t a g
3 1 v a l u e : " {{ i n p u t s . p a r a m e t e r s . t a g } } "
32 - name: env
3 3 v a l u e : " {{ i n p u t s . p a r a m e t e r s . e n v } } "
3 4 - name: p o r t
3 5 v a l u e : " { { i n p u t s . p a r a m e t e r s . p o r t } } "
3 6 - name: a r g s
3 7 v a l u e : " { { i n p u t s . p a r a m e t e r s . a r g s } } "
- name: c r e a t e — i n t e r n a l —svc
when: ' { { i n p u t s . p a r a m e t e r s . i n t e r n a l }},j== LI t r u e '
4 0 t e m p l a t e : i n t e r n a l —svc
4 1 a r g u m e n t s :
4 2 p a r a m e t e r s :
4 3 - name: app
4 4 v a l u e : " {{ i n p u t s . p a r a m e t e r s . app}} "
4 5 - name: p o r t
4 6 v a l u e : " { { i n p u t s . p a r a m e t e r s . p o r t } } "
- name: c r e a t e - e x t e r n a l - s v c
when: J
{ { i n p u t s . p a r a m e t e r s . e x t e r n a l } } u
= =
u t r u e '
t e m p l a t e : e x t e r n a l - s v c
5 0 a r g u m e n t s :
51 p a r a m e t e r s :
52 - name: app
5 3 v a l u e : " { { i n p u t s . p a r a m e t e r s . app}} "
5 4 - name: p o r t
5 5 v a l u e : " { { i n p u t s . p a r a m e t e r s . p o r t } } "
5 6
75
A . WORKFLOW EXAMPLES
5 7 - name: d e p l o y m e n t
5 8 i n p u t s :
5 9 p a r a m e t e r s :
6 0 - name: app
61 - name: image
62 - name: t a g
6 3 - name: env
64 - name: p o r t
6 5 - name: args
6 6 r e s o u r c e :
6 7 a c t i o n : c r e a t e
6 8 s e t O w n e r R e f e r e n c e : true
6 9 m a n i f e s t : |
7 0 a p i V e r s i o n : a p p s / v l
71 k i n d : D e p l o y m e n t
72 m e t a d a t a :
7 3 name: { { i n p u t s . p a r a m e t e r s . a p p } }
7 4 s p e c :
7 5 s e l e c t o r :
76 m a t c h L a b e l s :
7 7 name: { { i n p u t s . p a r a m e t e r s .app}}
78 t e m p l a t e :
7 9 m e t a d a t a :
8 0 l a b e l s :
81 name: { { i n p u t s . p a r a m e t e r s .app}}
8 2 s p e c :
8 3 c o n t a i n e r s :
8 4 - name: { { i n p u t s . p a r a m e t e r s . app}}
8 5 i m a g e : { { i n p u t s . p a r a m e t e r s . i m a g e } } : { { i n p u t s . p a r a m e t e r s . tag}}
8 6 p o r t s :
- c o n t a i n e r P o r t : { { i n p u t s . p a r a m e t e r s . p o r t } }
ar gs: [ " { { i n p u t s . p a r a m e t e r s . a r g s } } " ]
8 9 envFrom:
9 0 - c o n f i g M a p R e f :
9 1 name: { { i n p u t s . p a r a m e t e r s . env }}
9 2 - name: i n t e r n a l — s v c
9 3 i n p u t s :
9 4 p a r a m e t e r s :
9 5 - name: app
9 6 - name: p o r t
9 7 r e s o u r c e :
9 8 a c t i o n : c r e a t e
9 9 s e t O w n e r R e f e r e n c e : true
1 0 0 m a n i f e s t : |
101 a p i V e r s i o n : v l
1 0 2 k i n d : S e r v i c e
1 0 3 m e t a d a t a :
1 0 4 name: s e r v i c e — i n t e r n a l — { { i n p u t s . p a r a m e t e r s . a p p } }
1 0 5 s p e c :
1 0 6 t y p e : C l u s t e r I P
1 0 7 s e l e c t o r :
1 0 8 name: { { i n p u t s . p a r a m e t e r s . a p p } }
1 0 9 ports:
1 1 0 - p o r t : { { i n p u t s . p a r a m e t e r s . p o r t } }
in
76
A . WORKFLOW EXAMPLES
1 1 2 - name: e x t e r n a l — s v c
1 1 3 i n p u t s :
1 1 4 p a r a m e t e r s :
1 1 5 - name: app
1 1 6 - name: p o r t
1 1 7 r e s o u r c e :
1 1 8 a c t i o n : c r e a t e
1 1 9 s e t O w n e r R e f e r e n c e : true
1 2 0 m a n i f e s t : |
121 a p i V e r s i o n : v l
1 2 2 k i n d : S e r v i c e
1 2 3 m e t a d a t a :
1 2 4 name: s e r v i c e — e x t e r n a l — { { i n p u t s . p a r a m e t e r s . a p p } }
1 2 5 s p e c :
1 2 6 t y p e : N o d e P o r t
1 2 7 s e l e c t o r :
1 2 8 name: { { i n p u t s . p a r a m e t e r s . a p p } }
1 2 9 p o r t s :
1 3 0 - p o r t : { { i n p u t s . p a r a m e t e r s . p o r t } }
Listing A.9: WorkflowTemplate creating Kubernetes Services and
Deployment
77
A . WORKFLOW EXAMPLES
A.9.2 Download csv
1 - name: download—csv
2 container:
3 image: alpine / c u r h l a t e s t
4 volumeMounts:
- name: workdir
6 mountPath: /work
command: [ sh , —c ]
args: ["curlu {{workflow.parameters.url}}u >u /
work/{{workflow.parameters.file-name}}"]
Listing A.10: Workflow template downloading CSV and storing into
persistent volume
78
A . WORKFLOW EXAMPLES
A.9.3 Shorten csv
1 - name: shorten—csv
2 i n p u t s :
3 a r t i f a c t s :
4 - name: code
5 path: /go/src /github . com/445455 rk / a r g o workf
lows—demo
6 g i t :
7 repo: h t t p s : / / g i t h u b .com/445455rk/argo-
workflows—demo
8 c o n t a i n e r :
9 image: golang:1.18
10 command: [ s h , —c ]
11 args: ["cdu /go/src/github.com/445455rk/argow
o r k f l o w s - d e m o u & & u g o u r u n u . u / w o r k / { {
w o r k f l o w . p a r a m e t e r s . f i l e - n a m e } ] - " ]
12 volumeMounts:
13 - name: w o r k d i r
14 mountPath: /work
Listing A.11: Workflow template trimming CSV content
79
A . WORKFLOW EXAMPLES
A.9.4 MySQL write
- name: w r i t e — t o — m y s q l
2 r e t r y S t r a t e g y :
3 l i m i t : " 1 0 "
4 i n p u t s :
5 p a r a m e t e r s :
6 - name: d b i p
7 - name: u s e r
8 - name: p a s s w o r d
9 - name: database—name
10 - name: f i l e n a m e
11 s c r i p t :
12 v o l u m e M o u n t s :
13 - name: w o r k d i r
14 m o u n t P a t h : /work
15 i m a g e : 4 4 5 4 5 5 / m y s q l - h e l p e r : l . G
16 command: [ p y t h o n ]
17 s o u r c e : |
18 i m p o r t p y m y s q l
19 i m p o r t csv
2 0
21 db — p y m y s q l . c o n n e c t (host—" { { i n p u t s . p a r a m e t e r s . d b i p » " , p o r t — 3 3 0 6 , u s e r — " { { i n p u t s .
p a r a m e t e r s . u s e r } } " , p a s s w o r d — " { { i n p u t s . p a r a m e t e r s . p a s s w o r d } } " )
2 2 c u r s o r - d b . c u r s o r ( )
2 3 c u r s o r . e x e c u t e ( ' c r e a t e u d a t a b a s e u i f u n o t u e x i s t s u { { i n p u t s . p a r a m e t e r s . d a t a b a s e - n a m e } } ' )
24 d b . s e l e c t _ d b ( " { { i n p u t s . p a r a m e t e r s . d a t a b a s e - n a m e } } " )
25 w i t h open( ' /work/{{ i n p u t s . p a r a m e t e r s . f i l e n a m e } } ' , e n c o d i n g — " u t f 8" ) as c s v _ f i l e :
26 c s v _ d a t a = csv . r e a d e r ( c s v _ f i l e )
27 c o l u m n s — n e x t ( c s v _ d a t a )
2 8 c o l u m n s _ w i t h _ t y p e s — [ s t r ( c o l ) + " u v a r c h a r (255) " f o r c o l i n c o l u m n s ]
2 9 c r e a t e _ t a b l e _ q u e r y — ' c r e a t e u t a b l e u i f u n o t u e x i s t s u { { i n p u t s . p a r a m e t e r s . d a t a b a s e n
a m e } } u ({0}) ; ' . f o r m a t ( ' , ' . j o i n ( c o l u m n s _ w i t h _ t y p e s ))
3 0 c u r s o r . e x e c u t e ( c r e a t e _ t a b l e _ q u e r y )
31 d b . c o m m i t ()
32 q u e r y — ' i n s e r t u i n t o u { { i n p u t s . p a r a m e t e r s . d a t a b a s e - n a m e } } u ( { 0 } ) u v a l u e s u ( { 1 } ) '
3 3 q u e r y = q u e r y . f o r m a t ( ' , ' . j o i n ( c o l u m n s ) , ' , ' . j o i n ([ '"/,s" ] * l e n ( c o l u m n s ) ) )
34 f o r d a t a i n c s v _ d a t a :
35 c u r s o r . e x e c u t e ( q u e r y - q u e r y , a r g s - d a t a )
3 6 d b . c o m m i t ()
3 7 c u r s o r . c l o s e ()
Listing A.12: Workflow template reading CSV content and storing into
MySQL DB
80
A . WORKFLOW EXAMPLES
A.9.5 Print csv
- name: cat—csv
container:
image: a l p i n e / c u r k l a t e s t
volumeMounts:
- name: workdir
mountPath: /work
command: [ sh, —c ]
args: [ "catu /work/{{workflow.parameters.filename
} } u & & u w c u - l u / work/{{workflow.parameters
.file-name}}u >u /work/lineCount.txt" ]
outputs:
parameters:
- name: line—count
valueFrom:
path: /work/lineCount. txt
Listing A.13: Workflow template printing CSV content to stdout
81
Bibliography
[1] Tomas Rebok et al. ANALYZA - Vypocetnia orchestrami subsystem.
cze. Masarykova univerzita, 2020. URL: https : //is .muni . cz/
publication/1736351.
[2] Rizos Sakellariou et al. "Scheduling Workflows with Budget
Constraints". In: Integrated Research in GRID Computing: CoreGRID
Integration Workshop 2005 (Selected Papers) November 28-30,
Pisa, Italy. Ed. by Sergei Gorlatch and Marco Danelutto. Boston,
M A : Springer US, 2007, pp. 189-202. ISBN: 978-0-387-47658-2.
DOI: 10.1007/978-0-387-47658-2_14. URL: https : //doi . org/
10.1007/978-0-387-47658-2_14.
[3] Jeremiah Lowin. Oct. 2021. URL: https : //www . prefect. i o /
blog/announcing-prefect-orion/.
[4] F Mlder et al. "Sustainable data analysis with Snakemake [version
2; peer review: 2 approved]". In: FlOOOResearch 10.33 (2021).
DOI: 10.12688/f lOOOresearch. 29032.2.
[5] Johannes Köster and Sven Rahmann. "Snakemake—a scalable
bioinformatics workflow engine". In: Bioinformatics 28.19 (Aug.
2012), pp. 2520-2522. ISSN: 1367-4803. DOI: 10.1093/bioinf ormatics/
bts480. eprint: https : //academic. oup. com/bioinf ormatics/
article-pdf/28/19/2520/819790/bts480 .pdf. URL: https:
//doi.org/10.1093/bioinformatics/bts480.
[6] Richard M Stallman, Roland McGrath, and Paul D Smith. GNU
make., 2001.
[7] B.P. Harenslak and J. de Ruiter. Data Pipelines with Apache Airflow.
Manning, 2021. ISBN: 9781617296901. URL: https: //books.
google.cz/books?id=8EwnEAAAQBAJ.
[8] Brendan Burns, Joe Beda, and Kelsey Hightower. Kubernetes.
Dpunkt Heidelberg, Germany, 2018.
[9] Rabi Padhy and Manas Patra. "Evolution of Cloud Computing
and Enabling Technologies". In: International Journal of Cloud
Computing and Services Science (If-CLOSER) 1 (Oct. 2012). DOI:
10.11591/closer.vli4.1216.
[10] Ibrahim Abaker Targio Hashem et al. "The rise of "big data" on
cloud computing: Review and open research issues". In: Information
Systems 47 (2015), pp. 98-115. ISSN: 0306-4379. DOI: https:
82
BIBLIOGRAPHY
//doi . org/10 . 1016/j . i s . 2014 . 07 . 006. URL: https : //www.
sciencedirect.com/science/article/pii/S0306437914001288
[11] Michael Jackson, Kostas Kavoussanakis, and Edward W. J. Wallace.
"Using prototyping to choose a bioinformatics workflow
management system". In: PLOS Computational Biology 17.2 (Feb.
2021), pp. 1-13. DOI: 10 . 1371/journal . pcbi . 1008622. URL:
https://doi.org/10.1371/journal.pcbi.1008622.
[12] Pramod Singh. "Airflow". In: Learn PySpark: Build Python-based
Machine Learning and Deep Learning Models. Berkeley, CA: Apress,
2019, pp. 67-84. ISBN: 978-1-4842-4961-1. DOI: 10 .1007/978-1-
4842-4961-1_4. URL: https://doi.org/10.1007/978-1-4842-
4961-1_4.
[13] Scott Haines. "Workflow Orchestration with Apache Airflow".
In: Modern Data Engineering with Apache Spark: A Hands-On Guide
for Building Mission-Critical Streaming Applications. Berkeley, CA:
Apress, 2022, pp. 255-295. ISBN: 978-1-4842-7452-1. DOI: 10.1007/
978-1-4842-7452-1_8. URL:https://doi.org/10.1007/978-
1-4842-7452-1_8.
[14] Paolo D i Tommaso et al. "Nextflow enables reproducible computational
workflows". In: Nature Biotechnology 35.4 (Apr. 2017),
pp. 316-319. ISSN: 1546-1696. DOI: 10.1038/nbt. 3820. URL: https:
//doi.org/10.1038/nbt.3820.
[15] Yared Dejene Dessalk et al. "Scalable Execution of Big Data
Workflows Using Software Containers". In: Proceedings of the
12th International Conference on Management of Digital EcoSystems.
MEDES '20. Virtual Event, United Arab Emirates: Association
for Computing Machinery, 2020, pp. 76-83. ISBN: 9781450381154.
DOI: 10 . 1145/3415958 . 3433082. URL: https : //doi . org/10 .
1145/3415958.3433082.
[16] Shyam BV. Airflow vs. prefect-workflow managementfor Data Projects.
Oct. 2021. URL: https : //towardsdatascience . com/airf lowvs
- prefect - workflow - management - for - data - projects -
5dla0c80f2e3.
[17] Shubhnoor Gill on Apache Airflow, Osheen Jain on Data Ingestion,
and Abhinav Chola on Data Ingestion. 7 best airflow
alternatives for 2022. Mar. 2022. URL: https : //hevodata . com/
learn/airflow-alternatives/.
83
BIBLIOGRAPHY
[18] Pedram Navid. Airflow, Prefect, and Dagster: An inside look. Jan.
2022. URL:https://towardsdatascience.com/airflow-prefect-
and-dagster-an-inside-look-6074781c9b77.
[19] Markus Schmitt. Airflow vs Luigi vs Argo vs Kubeflow vs mlflow.
URL: https : //www . datarevenue . com/en-blog/airf low-vs-
luigi-vs-argo-vs-mlflow-vs-kubeflow.
[20] Samadrita Ghosh. Prefect vs. airflow: Censius blogs. URL: https :
//censius.ai/blogs/prefect-vs-airflow.
[21 ] JORDAN SEGALL. The Rise of Workflow Orchestration Tools, URL:
https://assets-global.website-files.com/5e46eb90c58el7cafba804e9/
5f 8f885195 ca2b64eb6d462 c _ 200027 % 5C'/. 200N °/„ 5C % 20UV'/. 5C °/„
20Workflow°/„5C0
/„200rchestration0
/„5C0
/„20White0
/„5C°/„20Paper.
pdf.
[22] Ian McGraw. Picking a kubernetes orchestrator: Airflow, Argo, and
prefect. Dec. 2020. URL: https: //medium. com/arthur-engineering/
picking - a - kubernetes - orchestrator - airflow - argo - and-
prefect-83539ecc69b.
[23] Aleksey Bilogur. Orchestrating spell model pipelines using prefect.
Sept. 2021. URL: https : / / s p e l l . ml/blog/orchestrating -
spell-model-pipelines-using-prefect-YU3rsBEAACEAmRxp.
[24] Kelsey Taylor. The best data orchestration tools that businesses should
be aware of. July 2021. URL: https : //www . hitechnectar . com/
blogs/the-best-data-orchestration-tools-that-businesses-
should-be-aware-of/.
[25] Mihhail Matskin et al. " A Survey of Big Data Pipeline Orchestration
Tools from the Perspective of the DataCloud Project *".
In: Dec. 2021.
[26] Tom Yedwab. Why we switched to airflowfor Pipeline Orchestration
I Khan Academy blog. URL: https: //blog. khanacademy. org/why-
we-switched-to-airflow-for-pipeline-orchestration/.
[27] Xingwei Wang, Hong Zhao, and Jiakeng Zhu. "GRPC: A Communication
Cooperation Mechanism in Distributed Systems".
In: SIGOPS Oper. Syst. Rev. 27.3 (July 1993), pp. 75-86. ISSN: 0163-
5980. DOI: 10.1145/155870.155881. URL: https : //doi . org/10.
1145/155870.155881.
84