MASARYK U N I V E R S I T Y FACULTY OF INFORMATICS Orchestration tool for the modular computing subsystem of the ANALYZA platform Master's Thesis BC. RICHARD KYČERKA Brno, Spring 2022 MASARYK U N I V E R S I T Y FACULTY OF INFORMATICS Orchestration tool for the modular computing subsystem of the ANALYZA platform Master's Thesis BC. RICHARD KYČERKA Advisor: RNDr. Tomas Rebok, Ph.D. Department of Computer Systems and Communications Brno, Spring 2022 Declaration Hereby I declare that this paper is my original authorial work, which I have worked out on my own. A l l sources, references, and literature used or excerpted during elaboration of this work are properly cited and listed in complete reference to the due source. Be. Richard Kyčerka Advisor: RNDr. Tomáš Rebok, Ph.D. iii Acknowledgements I would like to thank my supervisor, RNDr. Tomas Rebok, Ph.D., for his advice, help and support during my work on this thesis. iv Abstract In recent years, the concept of 'Big Data' gained the attention of many analysts, since the ability to collect and process data on a massive scale is getting more and more accessible. Many projects focused on Big Data emerged and one of those projects is the A N A L Y Z A platform, which focuses on cross-domain analysis of heterogeneous data. Analysis steps are isolated mini-applications, composed into data workflows. The aim of this thesis was to review the field of workflow orchestration tools mostly operating in the Kubernetes platform and pick the most suitable one to adopt by the A N A L Y Z A platform. The choice fell on Argo Workflows, which has been proven to be the most suitable tool to use in A N A L Y Z A platform. The thesis justifies the choice by detailed analysis of Argo Workflows capabilities and demonstrates it on a real world data workflow use case. Keywords workflow, orchestration, Docker, Kubernetes, data analysis, Big Data, Airflow, Prefect, Argo Workflows, Snakemake, A N A L Y Z A v Contents Introduction 1 1 Big Data analysis 3 1.1 Cloud computing 3 1.2 Workflows 3 2 Airflow 5 2.1 Steps 5 2.1.1 Pre-TaskFlow era tasks 5 2.1.2 TaskFlow era tasks 6 2.1.3 XComs 6 2.2 Workflows 7 2.2.1 Pre-TaskFlow era DAGs 7 2.2.2 TaskFlow era DAGs 8 2.3 Deployment 8 2.3.1 Executors 9 2.4 Documentation and community 10 2.5 Summary 10 3 Prefect 11 3.1 Tasks 11 3.2 Workflows 12 3.3 Deployment 12 3.3.1 Prefect Cloud 13 3.3.2 Prefect Core 13 3.3.3 Prefect Cloud vs Server 14 3.3.4 Agents 14 3.4 Documentation and community 14 3.5 Prefect Orion 15 3.6 Summary 15 4 Snakemake 16 4.1 Tasks 16 4.2 Workflows 17 4.2.1 Workflow execution 17 vi 4.3 Deployment 18 4.3.1 API 18 4.4 Documentation and community 18 4.5 Summary 20 5 Argo Workflows 21 5.1 Tasks 21 5.2 Workflows 22 5.3 Deployment 23 5.4 API 23 5.5 Documentation and community 24 5.6 Summary 24 6 Honorable mentions 25 6.1 Luigi 25 6.2 Flyte 26 6.3 Dagster 27 6.4 Nextflow 28 7 ANALYZA platform and new orchestrator service requirement analysis 29 7.1 Orchestrator service 30 7.1.1 Analytical modules 30 7.1.2 Analytical operations 31 7.1.3 Deployment 32 7.2 Requirement analysis 33 7.3 New orchestration tool choice 34 8 Argo Workflows fundamentals 35 8.1 Workflows 35 8.1.1 Input and Outputs 37 8.1.2 Workflow management 38 8.1.3 Workflow execution 38 8.1.4 Workflow templates 40 8.2 UI 41 8.3 Auth 42 8.4 User management 42 8.5 Argo API 44 vii 8.6 Argo deployment 44 8.6.1 Scaling 45 8.7 Secrets 47 9 Requirements implementation in Argo 48 9.1 Public services 48 9.1.1 Kubernetes resources revised 48 9.1.2 Create service in Argo 49 9.1.3 Garbage collection 55 9.2 Custom API 55 9.2.1 Changes to the API 56 9.2.2 Implementation 57 10 Argo features that might be used in future 58 10.1 Workflow definition synchronization 58 10.2 Usage of volume claims 59 10.3 Workflow notifications 59 10.4 Prometheus metrics 60 11 Demonstration and evaluation of results 61 11.1 Demonstration environment 61 11.2 Demonstration scenario 62 11.3 Evaluation 65 12 Conclusion 66 A Workflow Examples 67 A . l Container 67 A.2 Script 68 A.3 Resource 69 A.4 Suspend 70 A.5 Steps 71 A.6 D A G 72 A.7 Parameters 73 A.8 Artifacts 74 A.9 Test scenario 75 A.9.1 Service creation 75 A.9.2 Download csv 78 viii A.9.3 Shorten csv A.9.4 MySQL write A.9.5 Print csv List of Tables 9.1 Custom Argo API Contract List of Figures 2.1 Airflow architecture schema 9 4.1 Snakemake workflow example 19 5.1 Argo architecture schema 23 6.1 Luigi workflow example 25 6.2 Flyte workflow example 26 6.3 Dagster workflow example 27 6.4 Nexflow workflow example 28 7.1 ANALÝZA platform architecture 29 7.2 Analytical operation definition example 31 8.1 Argo workflow controller architecture 40 8.2 Argo UI displaying the state of a workflow run 41 11.1 MySQL demo step 62 11.2 MySQL DB demo step 63 11.3 Print step demo logs 64 11.4 Print demo step outputs 64 xi Introduction In recent years, Big Data has gained considerable attention from governments, academia, and enterprises. The term "big data" refers to collecting, analyzing, and processing voluminous data. Having large data sets and reasonable computational power, many projects emerged that took the chance and implemented various tools to analyze these data sets. One such project is ANALYZA[1] platform originated as research at Masaryk University for the Police of Czech republic with a focus on cross-domain data analysis. Project ANALÝZA use-cases include image processing or network data analysis. ANALÝZA platform covers the whole process of data analysis, from storage in specialized distributed data lakes, through the complex analysis steps to visualization in specialized component providing various analytic functions and view on analysis results. Individual analysis steps are stand-alone mini-applications, where one complex analysis may combine multiple steps into data workflows. The result of such analysis may be various views on the data and their relationships. Workflow steps need to be executed in the correct order to respect dependencies between steps. This composition may be nontrivial and specialized orchestrator service, as a part of the ANALÝZA platform, is taking care of this scheduling. The goal of this research is to analyze the field of workflow orchestration tools and pick one to adapt by ANALÝZA platform to succeed its in-house built orchestration solution. The research puts emphasis on functionality portfolio, scalability, security, and usability in production environments. The new orchestration tool functional requirements are then demonstrated in various examples based on the workflow use case style used in ANALÝZA project. The thesis can be divided into three parts. The first part, i.e. chapters 1 to 6, is focused on research in the field of well-known workflow orchestration tools and provides a brief overview of the most common ones. The overview shows how are the workflows defined, what is the deployment and execution strategy and discussion about community and documentation. 1 The second part, chapter 7, introduces A N A L Y Z A project architecture together with deployment and workflow execution strategy. Architecture analysis leads to a list of requirements that the new orchestration engine must fulfill. The final part, chapter 8 and onwards, introduces the orchestration tool choice, describes its functionalities in detail with examples, and is finalized by a demonstration on a realistic use case. 2 1 Big Data analysis Big data represents the large, diverse sets of information that grows at an exponential rate. Unfortunately, big data is so large that none of the traditional data management tools can store or process it efficiently. More than the volume of data, the way organizations utilize data matters. 1.1 Cloud computing One of the main challenges in developing Big Data processing solutions is to define the right architecture to deploy Big Data software in production systems. Big Data systems, are large-scale applications that handle online and batch data that is growing exponentially. For that reason, a reliable, scalable, secure and easy to administer platform is needed to bridge the gap between the massive volumes of data to be processed, software applications and low-level infrastructure. Kubernetes1 is one of the best options available to deploy applications in large-scale infrastructures. Using Kubernetes, it is possible to handle all the online and batch workloads required to feed analytics and machine learning applications. Kubernetes works with virtualized containerized applications, where the most popular is virtualization technology is Docker2 . Application containerizing provides portability and scalability needed in order to fully utilize cloud computing. 1.2 Workflows In the data science field, when working with Big Data, the measurements often need to undergo some kind of transformation to provide the analysts with various views on the data. These transformations can be called data workflows or pipelines. Workflows are composed of set of tasks, where tasks are single units of work in data processing chain. https://kubernetes.io/ https://docs.docker.com/get-started/overview/ 3 i . BIG DATA ANALYSIS Responsibility of a task can be for example downloading a dataset from a website into a local data warehouse. Another task can take this data and do some corrections, like unifying formats of records in data sets and saving results into a local database. The next task can take the data from the database and run machine learning model training. A l l these tasks together form a workflow. Intuitively, these steps need to be executed in the correct order, since one can't run machine learning model training before downloading data sets from the website. This implies natural dependencies between tasks. For easier representation, an abstraction of Directed Acyclic Graphs (DAG) [2] has been introduced, where graph nodes represent tasks involved in the process and edges represent dependencies between tasks. Workflow management tools are here to help data scientists with data transformations. Each tool has its own implementation of tasks and workflows, but the idea of DAGs remains the same. 4 2 Airflow Apache Airflow is a workflow management tool founded by Airbnb in October 2014, later open-sourced under Top-Level Apache Software Foundation1 project. Airflow was one of the first widely used tools of the workflow management kind. In December 2020, the massive upgrade to Airflow 2.0 was released, targeting the biggest issues, like scheduler performance or the introduction of TaskFlow API allowing task definition in a functional way. 2.1 Steps Workflow steps, or tasks in Airflow terminology, can be arbitrary pieces of Python code. There are two ways to define tasks, depending on the Airflow version. Both task types can be combined together within one workflow. 2.1.1 Pre-TaskFlow era tasks The original Airflow implementation of tasks was heavily relying on Operator2 usage. Operators are pre-defined templates for tasks and Airflow comes with a set of pre-defined Operators, but user can write his own custom Operators. A n example of an Operator can be PythonOperator, which takes the Python function as an input argument and transforms the function into a workflow step. https://www.apache.org/ 2 https://airflow.apache.org/docs/apache-airflow/stable/concepts/ operators.html 5 2. AIRFLOW 1 def p r i n t _ f u n c t i o n ( x ) : 2 o p r i n t x J 4 t l == P y t h o n O p e r a t o r ( 5 t a s k _ i d = ' p r i n t ' , 6 p y t h o n _ c a l l a b l e = p r i n t _ f u n c t i o n , 7 op_kwargs = {"x" : " H e l l o u W o r l d " } , 8 dag=dag, 9 ) Listing 2.1: Airflow task definition in pre-TaskFlow era. 2.1.2 TaskFlow era tasks The TaskFlow API introduced the @dag and ©task concepts, creating the abstraction, where steps are ©task decorated functions, and workflows are @dag decorated functions calling multiple ©task functions inside. This approach has greatly reduced the verbosity of definitions and automatically dependency calculations. 1 Q t a s k Q 2 def p r i n t _ f u n c t i o n ( x ) : 3 p r i n t x Listing 2.2: Airflow task definition in TaskFlow era. 2.1.3 XComs Airflow's XComs3 are the inter-task communication mechanism. It can be thought of as a shared board, where tasks push to and pull information from. Having two tasks A and B, where A calculates some value, push it to the shared board and task B pulls the value and adjusts its own calculation based on this information. It's important to remember that XComs are designed to transfer just small amounts of metadata rather than big volumes of data. https://airflow.apache.org/docs/apache-airflow/stable/concepts/ xcoms.html 6 2. AIRFLOW 2.2 Workflows D A G 4 is a collection of tasks, with some additional flow properties, like schedules or parameters. DAG definition again differs depending on whether TaskFlow API is used or not. DAGs are stored in the metadata database, which is part of the Airflow deployment. Submission of DAGs can be done via either REST API, UI, or CLI. 2.2.1 Pre-TaskFlow era DAGs DAGs defined in the traditional way is a D A G object with a collection of Operators and DAG properties. Dependencies are set as upstream/downstream between operators, signalized by » sign, e.g. 11 »t2 signalizes tl is dependent on tl. 1 d e f p r i n t _ f u n c t i o n ( x ) : 2 p r i n t x 3 4 w i t h DAG( 5 ' e x a m p l e ' , 6 d e f a u l t _ a r g s — { ' d e p e n d s _ o n _ p a s t ' : F a l s e , ' e m a i l ' : [' a i r f l o w @ e x a m p l e . c o m ' ] , ' e m a i l _ o n _ f a i l u r e ' : F a l s e , 10 ' e m a i l _ o n _ r e t r y ' : F a l s e , ' r e t r i e s ' : 1 , 12 ' r e t r y _ d e l a y ' : t i m e d e l t a ( m i n u t e s — 5 ) , 13 }, 14 d e s c r i p t i o n — ' H e l l o , _ , w o r l d J D A G ' , 15 s c h e d u l e _ i n t e r v a l = t i m e d e l t a ( d a y s = l ) , s t a r t _ d a t e = d a t e t i m e (2022, 1 , 1 ) , 17 c a t c h u p — F a l s e , 18 t a g s = [ ' e x a m p l e ' ] , 19 ) as d a g : 20 t l = B a s h O p e r a t o r ( 21 t a s k _ i d — ' p r i n t _ d a t e ' , 22 bash_command—' d a t e ' , 23 d a g - d a g , 24 ) 25 t2 = P y t h o n O p e r a t o r ( 26 t a s k _ i d = ' p r i n t ' , 27 p y t h o n _ c a l l a b l e = p r i n t _ f u n c t i o n , 28 o p _ k w a r g s = {"x" : " H e l l o ^ W o r l d " }, 29 d a g - d a g , 30 ti » t2 31 ) Listing 2.3: Airflow DAG definition, pre-TaskFlow https://airflow.apache.org/docs/apache-airflow/stable/concepts/ dags.html 7 2. AIRFLOW 2.2.2 TaskFlow era DAGs Utilizing the @dag and ©task decorators, the DAG definition becomes a lot clearer, better readable, and easier to develop. Dependencies can be set manually, or when possible, they are calculated by Airflow. 1 @dag 2 def example () : 3 t l = BashOperator ( 4 task_id='print_date ' , 5 bash_command=' date ' , 6 ) 7 ©task 8 def p r i n t _ f u n c t i o n (x) : 9 print x 10 ©task 11 def generate_number () : 12 return random. randint (0 , 10) 13 14 random_number = generate_number () 15 random_message = p r i n t _ f u n c t i o n (random_number) 16 t l » random_message 17 ) Listing 2.4: DAG example using TaskFlow API 2.3 Deployment Airflow deployment is quite complex, so the preferred type of deployment is using Helm5 charts to deploy in the Kubernetes cluster or managed installation, where the user pays someone else to host and maintain the installation in the SaaS model. For development purposes, local deployment can be done via DockerCompose or running Airflow in a Python environment. 'https://helm.sh/ 8 2. AIRFLOW Metadata Database User Interface Figure 2.1: Airflow architecture schema Source: https://airflow.apache.org/docs/apache-airflow/stable/ concepts/overview.html 2.3.1 Executors DAGs are processed by executors6 , which are processes running inside the Scheduler process and can be divided into local and remote types. Local executors keep the task execution under the scheduler process. Remote executors delegate the work to Dask7 , Celery8 backend or Kubernetes. Executor choice is installation-wide and can be set in the Airflow configuration file. Tasks from a single flow can, but don't necessarily need to be executed on the same machine. Kubernetes executor Kubernetes executor is a process running under the Scheduler process that has access to a Kubernetes cluster. When a DAG is invoked via Kubernetes Executor, the executor requests Kubernetes API for a worker https://airflow.apache.org/docs/apache-airflow/stable/executor/ index.html#executor 7 Dask is an open-source library for parallel computing - https: //dask. org/ 8 https://docs.celeryq.dev/en/latest/index.html# 9 2. AIRFLOW Pod where the task is executed and after execution and reporting result, the pod is dismissed. 2.4 Documentation and community Airflow documentation is quite extensive since there are many concepts to be explained. Diagrams and code snippets are here to help with the understanding of individual concepts. However, some documentation sections are quite confusing and somewhat hard to under- stand. One of the biggest advantages of Airflow is its massive community built over the last years. The community is also very active in contributing to the Airflow codebase9 with hundreds of discussed issues on GitHub and also many open pull requests fixing bugs or suggesting improvements. 2.5 Summary Apache Airflow is a mature, heavy-armed orchestrator with tradition and many users accumulated during the last years. Users considering using Airflow should be aware of a steep learning curve and somewhat overwhelming installation process. This is mainly caused by many architectural changes over the years. In the near past, it feels like it's just trying to catch up on the competitors that started later than Airflow with fresh and better-designed architecture (taking advantage of Airflow design flows knowledge), rather than coming up with new ideas. For those who can overcome the initial struggles, Airflow can deliver a reliable and robust service. https://github.com/apache/airflow 10 3 Prefect Prefect is an open-source workflow management system. The main motivation for Prefect to rise was dissatisfaction with some of the Airflow's concepts (at that time the go-to workflow management system), like writing DAGs in an imperative way, weak DAG scheduling mechanism, or inability of passing big data chunks between tasks. Prefect chose Python as the language of choice for both, writing workflows and the platform itself. Prefect allows user to define tasks and flows in two ways; the imperative and functional. The more elegant and more convenient is the functional way; it brings the advantage of having less boilerplate code and better readability. On the other hand, the imperative declaration offers more fine-grained control over tasks and flows. In later sections, only the functional task API will be considered. 3.1 Tasks In Prefect, tasks are defined as Python functions annotated by @task decorator1 . Task level properties like the number of retries or task identification are passed as arguments to the decorator. Prefect comes with a decent library of mostly community-driven pre-defined tasks providing the ability to integrate with third-party services2 . For reference, such tasks could execute a shell script, create a Jira issue or post a message via Slack. https://docs.prefect.io/core/concepts/tasks https://docs.prefect.io/core/task_library 11 3. PREFECT 3.2 Workflows Flows3 are Prefect's implementation of DAGs. Tasks are instantiated inside of a flow object, possibly using the return value of one task as input for other tasks, creating a natural dependency between these tasks. Dependencies are then automatically detected by Prefect. The development and debugging can be done locally4 on the user's machine without any available Prefect deployment. In order to run a workflow on running deployment, it needs to be registered to the Prefect backend using the Prefect CLI5 . 1 @ t a s k ( n a m e = " n u m b e r L J g e n e r a t o r " ) 2 d e f g e n e r a t e _ n u m b e r () : 3 r e t u r n r a n d i n t ( 0 , 2 2 ) 4 5 @task(name=" s a y u t a s k " ) 6 d e f s a y ( x ) : 7 p r i n t ( x ) 8 9 10 w i t h F l o w ( ' S i m p l e ^ P r e f e c t ^ f l o w ') as f l o w : n number — g e n e r a t e _ n u m b e r () 12 a n o t h e r _ n u m b e r — g e n e r a t e _ n u m b e r () 13 say (number + a n o t h e r _ n u m b e r ) Listing 3.1: Prefect workflow example 3.3 Deployment Prefect architecture is divided into two parts, backend, and agents. Backend stores metadata about registered flows, provides UI, and takes care of task scheduling. Flows are executed on agents, which can be deployed in several ways, depending on the use case and available infrastructure. Prefect comes with two implementations of backends, the cloud-managed called Prefect Cloud and the on-premise one called Prefect Core. 'https://docs.prefect.io/core/concepts/flows \https://docs.prefect.io/core/getting_started/basic-core-flow.html 'https://docs.prefect.io/api/latest/cli/register 12 3. PREFECT 3.3.1 Prefect Cloud Prefect Cloud encapsulates all the backend components into a cloudmanaged solution. This means the user doesn't need to deploy in his infrastructure anything but one or more of the agent variants. Prefect Cloud is easy to set up and enables user to run his first workflow in a few moments. Cloud backend is hosted on Prefect side and there are several pricing options including a free one, although with some restrictions like a limited number of registered flows or a limited number of flow runs per month. 3.3.2 Prefect Core The core version of Prefect is basically a lightweight on-premise version of Prefect Cloud. Depending on the desired scale, Prefect can be deployed on a local machine using docker-compose or scaling large in Kubernetes cluster using helm charts6 . The docker-compose option is more suitable for evaluating and testing whether Prefect suits the user's needs rather than running in production. For production deployment, the Kubernetes deployment seems to be the only viable option. Helm charts are provided by the Prefect team which makes the charts more trustworthy, documentation related to Prefect core Helm charts is well written and covers modifications to the default configuration. 6 https://github.com/PrefectHQ/server/tree/master/helm/ prefect-server 13 3. PREFECT 3.3.3 Prefect Cloud vs Server The cloud solution provides many advantages compared to Server7 : • Authorization and Permissions - the ability to manage users authenticated through AuthO8 , their roles and team assignment • Available to access from everywhere - as long as the user has a valid API key • Better performance and scaling • Effortless maintenance and upgrades • Premium support 3.3.4 Agents Agents are processes, where flows are being executed. A l l agents need to be registered either in Prefect Cloud or Prefect core backend. Registered agents periodically poll the backend for work scheduled to be done. Agents can be deployed on user's local machine, executing flows as processes or Docker containers, but also in Kubernetes, where flows are executed as Kubernetes jobs or AWS ECS, where flows are executed as AWS ECS Tasks. While flows are being executed on agents, the work can be delegated further to Dask using Executors9 . 3.4 Documentation and community Prefect documentation is very well written, enriched by many code snippets, and easy to understand. Its content can be divided into two parts; one looking at the concepts with higher abstraction and the second explaining concepts in more detail. User can find code examples, video and written tutorials. Many companies wrote on their blogs1 0 https://docs.prefect.io/orchestration/server/overview.html# prefect-server-vs-prefect-cloud-which-should-i-choose 8 https://authO.com/ 9 https://docs.prefect.io/api/latest/executors.html#executors 1 0 https://www.prefect.io/why-prefect/case-studies/ 14 3. PREFECT about their experiences with Prefect and why they chose it or migrated from other tools. Prefect has grown a big community reflecting in quick answers on Slack or contributions on GitHub. 3.5 Prefect Orion On October 5th, 2021, Prefect announced the new generation of their product, Prefect Orion[3] along with technical alpha, expected to release in early 2022. The objective of Orion is to introduce dynamic, DAG-free workflows; a better development experience, and transparent and observable orchestration rules. 3.6 Summary Prefect is a great workflow orchestration tool for users who can afford to use the Prefect Cloud license model and are familiar with Python programming. Especially, if the user can make use of Dask integration. The transition from pure Python code to Prefect workflow is intuitive and doesn't require many changes to the existing code. The advantage of being able to run workflows without taking care of backend infrastructure is huge and unique in the field of orchestration tools. The setup process is clear and straightforward, the development process is user-friendly with the option to test and debug workflow code locally. The concept documentation is brief, yet it tells the user everything he needs to know. Prefect team is actively pushing forward the feature development giving the promise of future proof investment of time and effort in adopting Prefect as the go-to workflow orchestrator. 15 4 Snakemake Snakemake1 [4] [5] is a tool to create reproducible and scalable data analyses created by bioinformatics, but aiming to be understandable by everyone. The implementation of DAGs is highly influenced by the G N U Make[6] paradigm; workflow is a set of rules applied in a chain to transform inputs to outputs executed in isolated Conda2 environments. 4.1 Tasks Snakemake's implementation of tasks is called rules3 . Rules are written in a special file named Snakefile using custom, Python-based lan- guage4 , aiming to be clear and human-readable without boilerplate code around. The main rule components are name, input, output, and action. Some additional rule properties are available too, like specifying Conda environment, output log file location, or parameters. 1 r u l e NAME: 2 i n p u t : " p a t h / t o / i n p u t f i l e " 3 o u t p u t : " p a t h / t o / o u t p u t f i l e " 4 s h e l l : " s o m e c o m m a n d u { i n p u t } u { o u t p u t } " Listing 4.1: Snakemake rule template, executing shell command The rule applies the action by taking the input and storing the output in specified locations. The action can be a shell command, Python script, or scripts in other languages like R, Julia, or Rust. Some users may appreciate the integration with Jupyter5 notebooks too. ^ttps://snakemake.github.io/ 2 https://docs.conda.io/en/latest/ 3 https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html 4 https://snakemake.readthedocs.io/en/stable/snakefiles/writing_ snakefiles.html#grammar 5 https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html# jupyter-notebook-integration 16 4. SNAKEMAKE Input and output files don't need to be on the local host machine, it could point to remote storages6 like S3, FTP or HTTP endpoint. This remote storage is accessed via special functions called wrappers. Wrappers7 encapsulate general actions that can be reused. There is also a central wrapper repository where users can share their wrappers with others. Inputs and outputs can use wildcards8 and regular expressions to make rules generic e.g. by processing all files with common pattern. Each rule can have its own isolated run environment provided by Conda or taking the isolation even further, each rule can be run in its own container. 4.2 Workflows The user doesn't specify what the workflow looks like, he just specifies the target file and Snakemake determines rules that need to be applied to generate the output. Alternatively, the user can specify the rule to be executed. Invocation of workflows is done via CLI. Snakemake workflows offer high customization in terms of resources that can be provided to rule execution. Such options include the number of threads, amount of memory, and disk usage. 4.2.1 Workflow execution Workflow execution can be held locally on the user's machine or the execution could be delegated to the cloud or cluster. When executing remotely, inputs and outputs are stored on a shared filesystem or remote storage like S3. Workflows also support caching to avoid repeated workflow evaluations on the same inputs. When executing in the cloud, these subresults are stored on external storage. https://snakemake.readthedocs.io/en/stable/snakefiles/remote_ files.html#remote-files 7 https://snakemake.readthedocs.io/en/stable/tutorial/additional_ features.html?highlight=wrapper#tool-wrappers 8 https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html# wildcards 17 4. SNAKEMAKE Workflow runs can be examined via generated H T M L report that includes information like runtime statistics or visualization of the workflow topology. 4.3 Deployment Snakemake doesn't require any kind of long-running backend. Workflow executions are invoked by the user on demand. The execution environment is specified as an option to the CLI invocation. 4.3.1 API Since Snakemake doesn't offer any centralized server thus no API is exposed by default. However, there's an option to run workflows with special flag to send execution details to a Panoptes9 , 1 0 server. However, this feature is still under development. 4.4 Documentation and community Snakemake's documentation1 1 is detailed and well written. It includes tutorials with examples precisely explaining the workflow lifecycle. The community is built mainly by bioinformatics and many scientific papers reference Snakemake as their choice for data processing tool used in the research. Questions about Snakemake are mainly discussed on StackOverflow, currently capturing more than a thousand threads. https://getpanoptes.io/ 1 0 https://snakemake.readthedocs.io/en/stable/executing/monitoring. html n https://snakemake.readthedocs.io/ 18 4- SNAKEMAKE 1 # configfile: "config.yamI" 2 3 • rule ail: input: expand( "results/plots/(country}.hist .pdf" country=config['countries"] ) Legend ; domain knowledge Q technical knowledge 0 Snakemake knowledge # trivial rule download data: output: "data/worIdcitiespop.csv" log: "logs/download.log" conda: "envs/curl.yaml" shell: "curl -L https://burntsushi.net/stuff/worldcitiespop.csv > {output} 2> {log} rule select._by..country: 6 : 9 6 6 0 1 6 6 O input: c r 'data/worldcitiespop. csv" import sys output: sys.stderr = open{snakemake.log[9], "w") "resuIts/by-count ry/{country}.csv" log: impart matplotlib.pyplot as pit "logs/select-by-ecuntry/{country}.log" import pandas as pd conda:"envs/xsv.yaml" cities = pd.read_csv(snakemake.input[9]) shell: "xsv search -s Country '{wildcards.country}1 1 pit .hist(cities["Population"], bins=59) " {input} > {output} 2> {log}" pit.savefig(snakemake.output[9]) rule plot histogram: input: "resuits/by-country/{country}.csv" output: "result5/plots/{country}.hist.svg" container: "docker;//faizanbashir/python-data5cience:3. 5" log: "logs/plot-hist/{country}.log" script: "scripts/plot - hist.py" rule convert_to_pdf: input: "{prefix}.svg" output: "{prefix},pdf" log: "logs/convert-to-pdf/{prefix}.log" wrapper: "0.47.o/utiis/cairosvg" knowledge trivial • snakemake • technics! domain r-j m m ID !-- category Figure 4.1: Snakemake workflow example (a) section shows how a workflow looks like, with each line labeled with color depending on user knowledge, (b) DAG of jobs, the color of nodes are corresponding with colors of rule names, (c) content of plot-hist.py script referenced in 'plot_histogram' rule, (d) knowledge requirements for readability by statement category. Source:https://fl000research.eom/articles/10-33/vl#f3 19 4. SNAKEMAKE 4.5 Summary Snakemake has earned its place in mostly bioinformatic circles by ensuring the highly demanded reproducibility, robustness, but also scalability via integration with high-performance clusters. At first, it might be challenging for the user to get his head around the rule definition grammar. Once the user gets hang of it, Snakemake becomes a powerful and effective orchestration tool with strong synergy with Jupyter notebooks. Adopting Snakemake might be a big step forward for analysts using plain Python or bash scripts as data processing pipelines. Snakemake is aiming for the data science audience, rather than mainstream companies. This is reflected by the absence of central server and UI, rather focusing on effective task scheduling and individual task execution performance on computation clusters and grids. 20 5 Argo Workflows Argo is a Kubernetes oriented orchestration tool, standing out by defining workflows in a fully declarative way; extending Kubernetes Custom Resource Definition2 . The design of the tool allows to run workflows in massive scale, roughly thousands of workflows per day. Argo is developed under Cloud Native Computing Foundation3 project. For anybody to use Argo effectively, one needs to be familiar with Docker and have a really good understanding of Kubernetes concepts. 5.1 Tasks Argo tasks are represented as units called templates4 " and executed exclusively in Kubernetes pods. Templates are closely tied to container execution; running a container with arguments or having script wrapped up and executed inside of a container. In the latter case, script source code is stored in the Y A M L workflow definition file. Other template types allow performing cluster resource operations. Container and script templates are defined as Kubernetes container specs with all related properties. Tasks can share data using the combination of artifacts and Persistent Volume Claims that is also used for storing workflow results. Workflows and templates can be parametrized, also allowing to feed the output of one task into inputs to other tasks. https://axgoproj.github.io/ 2 https://kubernetes.io/docs/concepts/extend-kubernetes/ api-extension/custom-resources/ 3 https://www.cncf.io/ 4 https://argoproj.github.io/argo-workflows/workflow-concepts/ #template-types 21 5. A R G O WORKFLOWS 5.2 Workflows Workflows are declared in yaml files as Workflow,spec5 , the Kubernetes specification of a workflow. The building blocks of a workflow are templates and an entrypoint; the template executed first. Submission of workflows to the server is done via Argo CLI, UI, or REST API. Workflows are then stored as Kubernetes resources, in EtcD6 , by default. Alternatively, database storage can also be con- figured. Workflows are represented by two types of template invocators7 : • steps executing task templates sequentially • DAG executing tasks respecting the dependencies between them. Dependencies must be set manually. Workflows also support caching to avoid redundant work and artifact producing/consuming mechanisms. Steps in a workflow may or may not be executed based on conditions, like some exact value of a workflow parameter. 1 a p i V e r s i o n : a r g o p r o j . i o / v l a l p h a l 2 k i n d : W o r k f l o w 3 m e t a d a t a : g e n e r a t e N a m e : h e l l o — w o r l d 5 spec: 6 e n t r y p o i n t : w h a l e s a y 7 t e m p l a t e s : 8 - name: w h a l e s a y 9 c o n t a i n e r : image: d o c k e r / w h a l e s a y 11 command: [ c o w s a y ] 12 a r g s : [ " h e l l o u w o r l d " ] Listing 5.1: Argo workflow running the whalesay container 5 https://argoproj.github.io/argo-workflows/fields/#workflowspec 6 https://kubernetes.io/docs/tasks/administer-cluster/ configure-upgrade-etcd/ 7 https://argoproj.github.io/argo-workflows/workflow-concepts/ #template-invocators 22 5- A R G O WORKFLOWS 5.3 Deployment The official and Argo-supported installation and upgrade process is applying Kubernetes manifests located in the Argo GitHub repository User can see what is being deployed and manifests can be customized if needed. The main functionality components are the workflow controller, which takes care of scheduling, and the Argo server, which runs the API server and UI. ARGO Workflow Overview aitio namespace / workflow ^ (. , ,,\ ( controller J ( * , 0 ° "') User Namespace IrVfi wfl wfl step step step pcd pod pod Main Container Figure 5.1: Argo architecture schema Source: https://argoproj.github.io/argo-workflows/architecture/ 5.4 API Argo server comes with exposed REST API allowing user to communicate with the server via authenticated HTTP requests, which may come in handy to use with other automation tools. There are also officially supported client libraries to wrap API calls for the user, written in Java, Python, and Golang. 23 5. A R G O WORKFLOWS 5.5 Documentation and community Documentation is in some places brief and some concepts could be described in more detail. The documentation is complemented by a really nice user interactive tutorial covering the most important concepts in the prepared Argo playground. Argo GitHub repository contains many workflow examples and also a very informative tutorial by examples. 5.6 Summary Argo Workflows is a great orchestrating tool for anyone who already invested in the Kubernetes environment and has workflow use-cases, where steps are more complex, containerized units in order to utilize Argo optimally. Workflow definitions might sometimes get tricky, especially when the user is not used to writing Y A M L files but in return, Argo's capabilities are quite unique. 24 6 Honorable mentions This chapter is dedicated to workflow orchestration tools that earned their place in the workflow orchestration tool circles but didn't make it into the final comparison. The reason for this is that these tools were too far from ANALÝZA use-case, or didn't provide any exceptional functionalities compared to selected, more popular tools. 6.1 Luigi Developed by Spotify Luigi1 uses Python to orchestrate long-running batch jobs. Tasks are classes, where functions are used to define inputs, outputs, dependencies and action. Flows can be executed locally or using the central scheduler deployment. Flows are not defined explicitly, Luigi creates them based on task requirements. import l u i g i class M y T a s k ( l u i g i . T a s k ) : [param • l u i g i . Parameter(defauft=42'j ! | d e £ r e q u i r e s ) s e l f ) : ^ ^ ^ ^ j i r e t u r n S o m e O t h e r T a s k ( s e l r > p a r a i n ) J [ ~ d e f r u n t s e l f } : } | f = s e l f . o u t p u t ( ) . o p e n ( ' w ' ) 1 i p r i n t » t t " h e l l o , w o r l d " I I f . c l o s e < } i i d e f o u t p u t * s e l f ) : / I r e t u r n l u i g i . L o c a l T a r g e t ( ' / t m p / f r i o / b a r - % s . t x t ' % s e l f . ( i a r a m ) ! i f \ n a m e = = m a i n \ l u i g i . r u n The business logic of the task Whereat writes output What other tasks it depends on Figure 6.1: Luigi workflow example Source: https://luigi.readthedocs.io/en/stable/tasks.html ^ttps://luigi.readthedocs.io/en/stable/index.html 25 6. HONORABLE MENTIONS 6.2 Flyte Flyte2 , a workflow orchestrator developed by Lyft, writing workflows in Python and defining workflows and tasks by using ©workflow and ©task decorators. Workflows are then executed in a Flyte cluster, deployed locally, or in Kubernetes. 1 import t y p i n g 2 import pandas as pd 3 import numpy as np 4 5 from f l y t e k i t import t a s k , workflow 6 7 @task 8» def generate_normal_df ( m i n t , mean: f l o a t , slgma: f l o a t ) pd.DacaFrame: r e t u r n pd.DataFrame({"numbers": np.random.normal(mean, sigma,size=n)}) 10 11 @task| 12- d e f compute_stats(df: pd.DataFrame) -=• t y p i n g . T u p l e [ f l o a t , f l o a t ] : 13 r e t u r n f l o a t ( d f [ " n u m b e r s " ] . m e a n ( ) ) , f l o a t ( d f [ " n u m b e r s " ] . s t d ( ) } 14 15 ^workflow 16- d e f wf(n: i n t = 200, mean: f l o a t = O.O, Sigma: f l o a t = 1.0) -> t y p i n g . T u p l e [ f l o a t , f l o a t ] : r e t u r n compute_stats(df=generate_normal_df(n=n, mean=mean, sigma=sigma)) Figure 6.2: Flyte workflow example Source: https://docs.flyte.org/en/latest/getting_started/index.html 2 h t t p s : / / f l y t e . o r g / 26 6. H O N O R A B L E MENTIONS 6.3 Dagster Dagster3 is a Python orchestration tool combining computation steps, or ops as they call it, into jobs and further into graphs annotating Python functions by @op, ©graph and ©job operators. Dagster's workflow model is close to writing natural Python code, where nested function calls create implicit dependencies. The core, long-running Dagster deployment is composed of Daemon, Web Server, and Repositories. Workflows are run by Executors, e.g. Docker, Kubernetes, or Celery executors. • addons • A single outout can be passed to multiple hputs on downstream ops. In this example, the outpjt from the tirst op is passed to two different ops. Tie outputs ot those ops are combined and passed to the final op. Figure 6.3: Dagster workflow example Source: https://docs.dagster.io/concepts/ops-jobs-graphs/jobs-graphs 'https://dagster.io/ 27 6. HONORABLE MENTIONS 6.4 Nextflow Nextflow4 , similarly to Snakemake, defines workflows using custom Domain Specific Language as an extension to the Groovy5 programming language, which is a super-set of Java. Workflows are composed of channels and processes, where each process has inputs, outputs, and action. Workflows are executed locally or remotely via Nextflow driver, living in a Kubernetes cluster, which emits workflow jobs. // Declare syntax v e r s i o n nextflow.enable.ds1=2 // S c r i p t parameters params,query = '7sorne/data/sample,f a" params.db = "/some/path/pdb" process blastSearch { input: path query path db output: path "top h i t s . t x t " biastp -db $db -query $query -outfmt 6 > biostresult cat biast_resuit j head -n 1& j cut -f 2 > top_hits.txt y process e x t r a c t T o p H i t s { input: path t o p h i t s outputi path "sequences.txt" blastdbcmd -db $db -er\try_batch $top_hits > sequences.txt } workflow { def query ch = Channel.fromPath(params.query) b l a s t S e a r c h ( q u e r y _ c h J params.db) | e x t r a c t T o p H i t s | view } Figure 6.4: Nexflow workflow example Source: https://www.nextflow.io/docs/latest/basic.html 4 https://www.nextflow.io/ 5 https://groovy-lang.org/ 28 7 ANALÝZA platform and new orchestrator service requirement analysis Project ANALYZA[1] is a complex computational platform designed to handle the whole process of various large-scale data analyses, from storage, through computation to visualization. A N A L Y Z A places emphasis on scalability, flexibility, and reliability. The system architecture is composed of three core components: • Data warehouse - where all the data to be analyzed are centralized on one place. • Computation subsystem - set of data analyzing and transforming steps wrapped into data workflows and managed by the orchestrator service. • Visualisation component - provides multiple views on results of data analysis and initiates analytical operation executions. O Cloud infrastructure Local data warehouse s .2 OH ä 55 3 •n 'ji P > Metadata storage Distributed storage Interactive rfo^ analytical modules <-u% ^r) Remote data warehouse Remote data warehouse Automatic analytical modules Orchestrating service Figure 7.1: ANALÝZA platform architecture Source: [1] This chapter is dedicated to the analysis of the Computation subsystem, especially the orchestrator service and its functionalities. The computational subsystem is a Kubernetes service running ondemand analytical modules (tasks), composed by the orchestrator 29 7- ANALÝZA PLATFORM A N D NEW ORCHESTRATOR SERVICE REQUIREMENT ANALYSIS service. Individual analytical modules are composed into analytical operations (workflows), transforming data for further data analysis. The orchestrator service runs module containers as Pods in the Kubernetes namespace in correct order with all needed inputs. 7.1 Orchestrator service The orchestrator service was created as a part of the A N A L Y Z A platform since at that time the field of data orchestration tools was lacking in tools fulfilling the project requirements. As time moved on, many new workflow orchestration tools emerged and there is now an option to swap from the current managed solution to a community-driven open-source service, lifting the burden of developing and maintaining the current orchestrator by the A N A L Y Z A team. 7.1.1 Analytical modules Tasks, represented by analytical modules, are Docker containers, further composed into analytical operations. Two types of analytical modules are available: • Job - a short-running unit of work performing some kind of data processing action with a strictly bounded lifecycle. Jobs are executed as Kubernetes Jobs. • Service - long-running module, whose execution is not blocking other modules and is running until the end of the entire analytical operation. Services are executed as Deployments and depending on the required internal/external ports, a Service of type ClusterlP/NodePort respectively. The analytical module is defined by its type (Job or Service), Docker image, exposed container ports, parameters, and Kubernetes resource requirements. Property IsBlocking determines whether the module is running as a daemon1 . daemon - service running in the background 30 7- ANALÝZA PLATFORM A N D NEW ORCHESTRATOR SERVICE REQUIREMENT ANALYSIS 7.1.2 Analytical operations Operation definitions are written in JSON format and stored in the definition database (PostgreSQL, MySQL, or SQLite). The operation contains a list of nodes and edges, representing dependencies between individual modules. Nodes are wrappers for analytical modules. Edges define dependencies between Nodes (modules) in a similar manner as in directed acyclic graphs. { "modules": [ { "name": " m o d u l - t e s t - 1 " , "type": " s e r v i c e " , " d e s c r i p t i o n " : " t e s t i n g s e r v i c e " , "image" : " t e s t ^ " , "tag": "1.0", " i s B l o c k i n g " : true, " h e a l t h P o r t " : 6069, " p o r t s " : [ "name": " p a r t i " , "number": 8380, " i s P u b l i c " : t r u e } ), "parameters": [ -t "name" : "paratnl", " d e f a u l t V a l u e " : " v a l u e l " , " d e s c r i p t i o n " : "paraml d e s c r i p t i o n " , " i s S t r i c t " : t r u e " r e s o u r c e s " : { "cpus1 1 : 1, "ram": 250 ] Figure 7.2: Analytical operation definition example Source: [1] 31 7- ANALÝZA PLATFORM A N D NEW ORCHESTRATOR SERVICE REQUIREMENT ANALYSIS Orchestrator service comes with 2 CLI tools to operate on opera- tions: • Client CLI - provides communication with deployed orchestrator service with start/stop /get resource commands. • Designer CLI - serves CRUD operations on definition database. Additional to the client CLI, and the preferred way, to invoke operations is the REST API. The REST API exposes endpoints to manage all the orchestrator resources, like starting and stopping operations or listing running operations and modules. Orchestrator service tries to optimize workflow execution by analyzing the operation to be run and if possible, reusing already running modules started by other operations. 7.1.3 Deployment Orchestrator service deployment is simple containing: • ConfigMap2 - contains input parameters for the service like database credentials, data warehouse address or kubernetes config used further when executing operations (this is needed because orchestrator is managing Kubernetes resources). • orchestrator deployment - specifies the orchestrator image and tag to be deployed. Serves the A P I and takes care of operation scheduling. • orchestrator-svc service - orchestrator service is exposed as a single Kubernetes Service to expose REST API. https://kubernetes.io/docs/concepts/configuration/configmap/ 32 7- ANALÝZA PLATFORM A N D NEW ORCHESTRATOR SERVICE REQUIREMENT ANALYSIS 7.2 Requirement analysis Based on the orchestrator analysis, requirements for its successor can be formulated: • Kubernetes support - individual analytical modules are encapsulated in virtualized containers for easier reusability and ensuring scalability and resource management. • parametrized workflows - analysis steps may be experimental and is required to run workflows with different input parameters to achieve the most accurate outputs. • dependencies between tasks - analysis may be a simple chain of steps, but also a complex computation tree with some steps strongly dependent on outputs of previous steps. The orchestrator service must be able to respect such dependencies. • inter-task communication - dependencies between modules can be of many types, one such can be a step requiring normalized data stored somewhere in the data warehouse by the previous workflow step, another may be just using computed parameter from previous step. Orchestrator service should provide wide range of possibilities to ensure such communication. • publicly available services - it is a common use case in A N A LÝZA project to run various types of analysis on large datasets where, as a result of the analysis, is a service endpoint exposed to the internet; outside of the Kubernetes cluster the analysis was run on. This use case introduces a non-trivial configuration that chosen workflow engine must fulfill. • isolation to secure data - subject data to be analyzed may be of a sensitive character and it may be vulnerable to sending data to some cloud storage. It is required that the orchestrator service is able to run in isolated environments that may even be disconnected from the internet. • security - not only processed data should remain secure, but also the service itself and also communication with the service 33 7- ANALÝZA PLATFORM A N D NEW ORCHESTRATOR SERVICE REQUIREMENT ANALYSIS • open-source - the open-source character of the solution can bring many advantages, like the ability to view the source code, suggest improvements and bug fixes, or being free of charge. • actively developed - when introducing any new technology into the existing project stack, it is appropriate to have a vision, that the chosen technology has a strong and stable developer base (especially in open-source projects) and will be supported in the future to prevent the need to replace the tool with new solution in near future. • scalability - new orchestration solution should be able to scale well while maintaining high availability. • API - analytical operations are in most use-cases invoked via REST API. new orchestration solution should keep this contract. 7.3 New orchestration tool choice From the list of researched tools, most of them are to some extent meeting the requirements. One of them, Argo, has exceptional support for Docker and Kubernetes, while the rest of reviewed tools offer Docker/Kubernetes support just as one of the execution options; that means taking workflow source code and wrapping it in a container and running it in the Kubernetes cluster. On the other side, Argo is fully focusing on Kubernetes, and workflow definition is close to the current orchestrator. Taking these facts into account, Argo seems to be the best candidate for the new orchestrator role. 34 8 Argo Workflows fundamentals Argo was introduced on a high level in chapter 5. This chapter explains Argo Workflows fundamentals in detail in order to explain Argo's application in the ANALÝZA orchestrator service. From now on, Argo Workflows will be shortened to Argo, to avoid confusion with other projects from Argo portfolio1 . 8.1 Workflows Workflows, and other workflow derivatives like WorkflowTemplate (see section 8.1.4), CronWorkflows2 , etc., are implemented as Kubernetes Custom Resources3 , thus their definitions follow Kubernetes object manifest format. 1 apiVersion: argoproj. io / v1 alphal 2 kind: Workflow 3 metadata: generateName: hello-world - spec 5 spec: 6 entrypoint: whalesay template 7 templates: - name: whalesay 9 container: image: docker/whalesay command: [ cowsay ] 12 args: [ " h e l l o u w o r l d " ] # new type of k8s spec # name of the workflow # invoke the whalesay # name of the template Listing 8.1: Argo workflow example ^ttps://axgoproj.github.io/ 2 https://axgoproj.github.io/argo-workflows/cron-workflows/ #cron-workflows 3 https://kubernetes.io/docs/concepts/overview/ working-with-objects/kubernetes-objects/ 35 8. A R G O WORKFLOWS FUNDAMENTALS The spec part is where Argo-specific configuration takes place. The core structure of spec is defined as templates and an entrypoint. Entrypoint is the root of the workflow, and there can be multiple entrypoints of one workflow definition (in the same manner as DAGs can have multiple roots, i.e. multiple vertices with in-degree equal to zero). Default entrypoint can then be overridden on submission. Every step of a workflow is a template and there are several types of templates, further divided into two groups: 1. Template Definitions - steps from which the workflow is com- posed: (a) container4 (A.1) - the most common template type. It specifies the Docker container to be run as a workflow step. Container input arguments, image, and others are configured here. (b) script5 (A.2) - extends the container template by injecting script source code to the container to be executed. This may come in handy for simple tasks, where it's not necessary to build a complete Docker image to run a few lines of code. The source code is stored directly in the definition Y A M L file, so for the sake of readability, the script should stay short and clear. (c) resource6 (A.3) - manages Kubernetes objects directly. This is basically reduced kubectf tool functionality built directly into the workflow. Supported operations are to get, create, apply, delete, replace, or patch cluster resources. Kubernetes manifest to be applied is specified directly as one of the template configuration fields. (d) suspend8 (A.4) - pauses the workflow execution for a specified amount of time or indefinitely until user input is prohttps://axgoproj.github.io/axgo-workflows/f ields/#container 5 https://argoproj.github.io/argo-workflows/f ields/#scripttemplate 6 https://argoproj.github.io/argo-workflows/f i e l d s / #resourcetemplate 7 https://kubernetes.io/docs/reference/kubectl/ 8 https://argoproj.github.io/argo-workflows/f ields/#suspendtemplate 36 8. A R G O WORKFLOWS FUNDAMENTALS vided. The workflow can then be resumed via the CLI, API call, or UI. 2. Template Invocators - compose Template Definitions into workflows based on the complexity of workflow dependencies: (a) Steps(A.5) - execute a serie of tasks in the order they are defined. Steps can run sequentially or in parallel. (b) DAG(A6) - the more complex invocator. Composes tasks into Directed Acyclic Graph, respecting all the dependen- cies. Templates can be nested together, it is even possible to run a workflow as a part of another workflow. 8.1.1 Input and Outputs In most cases, templates will have specified inputs and outputs (these can be chained together; one task can take the output of the previous task as its own input). There are two types of arguments: • Parameters(A.7) - string/integer values. • Artifacts (A.8) - arbitrary resources like files or archives loaded from remote storage like S3, and then mounted on the specified location of a container. When talking about input arguments, it's important to distinguish between workflow-level and template-level arguments. • workflow arguments - usually the user input to the workflow invocation, defined at the root of the Argo definition spec; on the same level as entrypoint • template arguments - defined for each template (step). These arguments can be referencing values of workflow-level arguments or other workflow variables. 37 8. A R G O WORKFLOWS FUNDAMENTALS Workflow level parameters, as one of the workflow argument options, offer an option to alter workflow behavior without changing the workflow source. Parameters have their default values hardcoded in the workflow source and can be overridden when submitting workflow. Parameters, also as other workflow variables9 can be referenced anywhere in the Y A M L workflow source file, using special double curly bracket syntax, e.g. {{workflow.parameters.size]} when size is defined as workflow parameter. 8.1.2 Workflow management Argo resources can be controlled in several ways: • Argo CLI - requires access to the Kubernetes cluster Argo is running on. • REST API - connects to Argo server exposed API. Most suitable for automation or integration with other tools. • UI - together with workflow management, it gives the user insight into the state of workflows or view workflow visualization. Best suitable for workflow development and debugging. • kubectl - since Argo resources are Kubernetes CRD, they can be implicitly managed by kubectl. However, this option is not recommended since is doesn't provide syntax validation the other options do. 8.1.3 Workflow execution Naming Each workflow step is executed in a separate Pod whose name follows specific format: -- https://axgoproj.github.io/argo-workflows/variables/ #workflow-variables 38 8. A R G O WORKFLOWS FUNDAMENTALS Namespace If not specified otherwise, workflows are executed in the default Kubernetes namespace. This setting can be overridden in the workflow definition file, specifying the namespace field under metadata, in the same manner as the rest of the Kubernetes resources. Another way to choose the target namespace is to set it as an invocation parameter in UI, or as a command line parameter when using Argo CLI. Service account By default, all workflow Pods run under the default service account in the namespace where workflow is executed. However, for security reasons, it's recommended to override this setting and create service accounts with respective roles for each type of workflow, depending on the scope and requirements. There is a defined minimum set of permissions required for all executing service accounts: 1 a pi Vers ion: rbac. a u t h o r i z a t i o n . k8s. io /v1 2 kind: Role 3 metadata: 4 name: executor 5 rules: 6 - apiGroups: 7 - argoproj.io 8 resources: 9 - workflowtaskresult 10 verbs: 11 - create 12 - patch Listing 8.2: Minimum Role persmissions required for Service Account to run workflows A good idea on how to divide service accounts is by provided secrets. For example, the user may want to create a service account 's3-executor' with secrets needed to authenticate to S31 0 . This account would then be used for all workflows that need to connect to S3. https://aws.amazon.com/s3/ 39 8. A R G O WORKFLOWS FUNDAMENTALS ARGO WORKFLOW CONTROLLER DESIGN Figure 8.1: Argo workflow controller architecture Source: https://argoproj.github.io/argo-workflows/architecture/ 8.1.4 Workflow templates Workflow templates are a vital part of the effective usage of Argo Workflows. By default, the workflow's lifecycle is consisting of submitting and running the workflow on demand. For better reusability, workflow templates were introduced and they allow the user to upload and store workflow definitions directly on the Argo server, removing the need to send the whole definition on each submit. It's even possible to reference workflow templates from other workflows. Workflow templates will play important role in ANALÝZA orchestrator requirement implementation. It's crucial to understand the difference between template and WorkflowTemplate. WorkflowTemplate is the definition of the whole workflow stored in the cluster, but a template is a step of a workflow of which the whole workflow is composed. This naming collision is quite unfortunate and the user needs to get used to it. 40 8. A R G O WORKFLOWS FUNDAMENTALS 8.2 Ul Argo UI serves most of the operations managed by the user, like submitting workflows, updating workflow definitions, or viewing past workflow runs. Workflow run visualization comes in handy when debugging new workflows or investigating failed runs. Logs from individual workflow steps, together with inputs/outputs are available to view too. There are also tabs managing integration with other Argo projects (when applicable). Workflows argo da g-dia mon d-2zp7l V3.2.4 Q RESUBMIT T f DELETE T = LOGS T u / t m p / h e l l o _ w o r l d . t x t " ] 4 8 o u t p u t s : 4 9 p a r a m e t e r s : 5 0 - name: h e l l o 5 1 v a l u e F r o m : 5 2 p a t h : / t m p / h e l l o _ w o r l d . t x t Listing A.7: Workflow example running parameters passing demo 73 A . WORKFLOW EXAMPLES A.8 Artifacts 1 # t h i s demo r e q u i r e s c o n f i g u r e d a r t i f a c t s t o r a g e 2 a p i V e r s i o n : a r g o p r o j . i o / v l a l p h a l 3 k i n d : W o r k f l o w 4 m e t a d a t a : 5 g e n e r a t e N a m e : a r t i f a c t —demo- 6 spec: 7 e n t r y p o i n t : a r t i f a c t — e x a m p l e 8 t e m p l a t e s : 9 - name: a r t i f a c t — e x a m p l e 10 s t e p s : - — name: g e n e r a t e — a r t i f a c t 12 t e m p l a t e : w h a l e s a y 13 - — name: c o n s u m e — a r t i f a c t 14 t e m p l a t e : p r i n t — m e s s a g e 15 a r g u m e n t s : 16 a r t i f a c t s : 17 - name: message f r o m : " { { s t e p s . g e n e r a t e - a r t i f a c t . o u t p u t s . a r t i f a c t s . h e l l o - a r t > > " 19 20 - name: w h a l e s a y 21 c o n t a i n e r : 22 image: d o c k e r / w h a l e s a y : 1 a te s t 23 command: [ s h , —c] 24 a r g s : [ " s l e e p u l ; u c o w s a y u h e l l o u w o r l d u I u t e e u / t m p / h e l l o _ w o r l d . t x t " ] 25 o u t p u t s : 26 a r t i f a c t s : 27 - name: h e l l o — a r t 28 p a t h : / t m p / h e l l o _ w o r l d . t x t 29 30 - name: p r i n t — m e s s a g e 31 i n p u t s : 32 a r t i f a c t s : 33 - name: message 34 p a t h : /tmp/message 35 c o n t a i n e r : 36 image: a l p i n e : l a t e s t 37 command: [ s h , —c] 38 a r g s : [ " c a t u / t m p / m e s s a g e " ] Listing A.8: Workflow example running artifact producer/consumer demo 74 A . WORKFLOW EXAMPLES A.9 Test scenario A.9.1 Service creation 1 a p i V e r s i o n : a r g o p r o j . i o / v l a l p h a l 2 k i n d : W o r k f l o w T e m p l a t e 3 m e t a d a t a : 4 name: c r e a t e - s e r v i c e 5 namespace: a r g o 6 spec: 7 e n t r y p o i n t : m a i n 8 t e m p l a t e s : 9 - name: m a i n 1 0 i n p u t s : n pa ra m etc is: 12 - name: app 13 - name: p o r t 14 - name: i m a g e 15 - name: tag 16 - name: env - name: i n t e r n a l - name: e x t e r n a l 19 - name: args 2 0 dag: 2 1 t a s k s : 2 2 - name: c r e a t e — d e p l o y m e n t 2 3 t e m p l a t e : d e p l o y m e n t 2 4 a r g u m ents: 2 5 p a r a m e t e r s : 2 6 - name: app 2 7 v a l u e : " { { i n p u t s . p a r a m e t e r s . a p p } } " 28 - name: i m a g e 2 9 v a l u e : " {{ i n p u t s . p a r a m e t e r s . image }} " 3 0 - name: t a g 3 1 v a l u e : " {{ i n p u t s . p a r a m e t e r s . t a g } } " 32 - name: env 3 3 v a l u e : " {{ i n p u t s . p a r a m e t e r s . e n v } } " 3 4 - name: p o r t 3 5 v a l u e : " { { i n p u t s . p a r a m e t e r s . p o r t } } " 3 6 - name: a r g s 3 7 v a l u e : " { { i n p u t s . p a r a m e t e r s . a r g s } } " - name: c r e a t e — i n t e r n a l —svc when: ' { { i n p u t s . p a r a m e t e r s . i n t e r n a l }},j== LI t r u e ' 4 0 t e m p l a t e : i n t e r n a l —svc 4 1 a r g u m e n t s : 4 2 p a r a m e t e r s : 4 3 - name: app 4 4 v a l u e : " {{ i n p u t s . p a r a m e t e r s . app}} " 4 5 - name: p o r t 4 6 v a l u e : " { { i n p u t s . p a r a m e t e r s . p o r t } } " - name: c r e a t e - e x t e r n a l - s v c when: J { { i n p u t s . p a r a m e t e r s . e x t e r n a l } } u = = u t r u e ' t e m p l a t e : e x t e r n a l - s v c 5 0 a r g u m e n t s : 51 p a r a m e t e r s : 52 - name: app 5 3 v a l u e : " { { i n p u t s . p a r a m e t e r s . app}} " 5 4 - name: p o r t 5 5 v a l u e : " { { i n p u t s . p a r a m e t e r s . p o r t } } " 5 6 75 A . WORKFLOW EXAMPLES 5 7 - name: d e p l o y m e n t 5 8 i n p u t s : 5 9 p a r a m e t e r s : 6 0 - name: app 61 - name: image 62 - name: t a g 6 3 - name: env 64 - name: p o r t 6 5 - name: args 6 6 r e s o u r c e : 6 7 a c t i o n : c r e a t e 6 8 s e t O w n e r R e f e r e n c e : true 6 9 m a n i f e s t : | 7 0 a p i V e r s i o n : a p p s / v l 71 k i n d : D e p l o y m e n t 72 m e t a d a t a : 7 3 name: { { i n p u t s . p a r a m e t e r s . a p p } } 7 4 s p e c : 7 5 s e l e c t o r : 76 m a t c h L a b e l s : 7 7 name: { { i n p u t s . p a r a m e t e r s .app}} 78 t e m p l a t e : 7 9 m e t a d a t a : 8 0 l a b e l s : 81 name: { { i n p u t s . p a r a m e t e r s .app}} 8 2 s p e c : 8 3 c o n t a i n e r s : 8 4 - name: { { i n p u t s . p a r a m e t e r s . app}} 8 5 i m a g e : { { i n p u t s . p a r a m e t e r s . i m a g e } } : { { i n p u t s . p a r a m e t e r s . tag}} 8 6 p o r t s : - c o n t a i n e r P o r t : { { i n p u t s . p a r a m e t e r s . p o r t } } ar gs: [ " { { i n p u t s . p a r a m e t e r s . a r g s } } " ] 8 9 envFrom: 9 0 - c o n f i g M a p R e f : 9 1 name: { { i n p u t s . p a r a m e t e r s . env }} 9 2 - name: i n t e r n a l — s v c 9 3 i n p u t s : 9 4 p a r a m e t e r s : 9 5 - name: app 9 6 - name: p o r t 9 7 r e s o u r c e : 9 8 a c t i o n : c r e a t e 9 9 s e t O w n e r R e f e r e n c e : true 1 0 0 m a n i f e s t : | 101 a p i V e r s i o n : v l 1 0 2 k i n d : S e r v i c e 1 0 3 m e t a d a t a : 1 0 4 name: s e r v i c e — i n t e r n a l — { { i n p u t s . p a r a m e t e r s . a p p } } 1 0 5 s p e c : 1 0 6 t y p e : C l u s t e r I P 1 0 7 s e l e c t o r : 1 0 8 name: { { i n p u t s . p a r a m e t e r s . a p p } } 1 0 9 ports: 1 1 0 - p o r t : { { i n p u t s . p a r a m e t e r s . p o r t } } in 76 A . WORKFLOW EXAMPLES 1 1 2 - name: e x t e r n a l — s v c 1 1 3 i n p u t s : 1 1 4 p a r a m e t e r s : 1 1 5 - name: app 1 1 6 - name: p o r t 1 1 7 r e s o u r c e : 1 1 8 a c t i o n : c r e a t e 1 1 9 s e t O w n e r R e f e r e n c e : true 1 2 0 m a n i f e s t : | 121 a p i V e r s i o n : v l 1 2 2 k i n d : S e r v i c e 1 2 3 m e t a d a t a : 1 2 4 name: s e r v i c e — e x t e r n a l — { { i n p u t s . p a r a m e t e r s . a p p } } 1 2 5 s p e c : 1 2 6 t y p e : N o d e P o r t 1 2 7 s e l e c t o r : 1 2 8 name: { { i n p u t s . p a r a m e t e r s . a p p } } 1 2 9 p o r t s : 1 3 0 - p o r t : { { i n p u t s . p a r a m e t e r s . p o r t } } Listing A.9: WorkflowTemplate creating Kubernetes Services and Deployment 77 A . WORKFLOW EXAMPLES A.9.2 Download csv 1 - name: download—csv 2 container: 3 image: alpine / c u r h l a t e s t 4 volumeMounts: - name: workdir 6 mountPath: /work command: [ sh , —c ] args: ["curlu {{workflow.parameters.url}}u >u / work/{{workflow.parameters.file-name}}"] Listing A.10: Workflow template downloading CSV and storing into persistent volume 78 A . WORKFLOW EXAMPLES A.9.3 Shorten csv 1 - name: shorten—csv 2 i n p u t s : 3 a r t i f a c t s : 4 - name: code 5 path: /go/src /github . com/445455 rk / a r g o workf lows—demo 6 g i t : 7 repo: h t t p s : / / g i t h u b .com/445455rk/argo- workflows—demo 8 c o n t a i n e r : 9 image: golang:1.18 10 command: [ s h , —c ] 11 args: ["cdu /go/src/github.com/445455rk/argow o r k f l o w s - d e m o u & & u g o u r u n u . u / w o r k / { { w o r k f l o w . p a r a m e t e r s . f i l e - n a m e } ] - " ] 12 volumeMounts: 13 - name: w o r k d i r 14 mountPath: /work Listing A.11: Workflow template trimming CSV content 79 A . WORKFLOW EXAMPLES A.9.4 MySQL write - name: w r i t e — t o — m y s q l 2 r e t r y S t r a t e g y : 3 l i m i t : " 1 0 " 4 i n p u t s : 5 p a r a m e t e r s : 6 - name: d b i p 7 - name: u s e r 8 - name: p a s s w o r d 9 - name: database—name 10 - name: f i l e n a m e 11 s c r i p t : 12 v o l u m e M o u n t s : 13 - name: w o r k d i r 14 m o u n t P a t h : /work 15 i m a g e : 4 4 5 4 5 5 / m y s q l - h e l p e r : l . G 16 command: [ p y t h o n ] 17 s o u r c e : | 18 i m p o r t p y m y s q l 19 i m p o r t csv 2 0 21 db — p y m y s q l . c o n n e c t (host—" { { i n p u t s . p a r a m e t e r s . d b i p » " , p o r t — 3 3 0 6 , u s e r — " { { i n p u t s . p a r a m e t e r s . u s e r } } " , p a s s w o r d — " { { i n p u t s . p a r a m e t e r s . p a s s w o r d } } " ) 2 2 c u r s o r - d b . c u r s o r ( ) 2 3 c u r s o r . e x e c u t e ( ' c r e a t e u d a t a b a s e u i f u n o t u e x i s t s u { { i n p u t s . p a r a m e t e r s . d a t a b a s e - n a m e } } ' ) 24 d b . s e l e c t _ d b ( " { { i n p u t s . p a r a m e t e r s . d a t a b a s e - n a m e } } " ) 25 w i t h open( ' /work/{{ i n p u t s . p a r a m e t e r s . f i l e n a m e } } ' , e n c o d i n g — " u t f 8" ) as c s v _ f i l e : 26 c s v _ d a t a = csv . r e a d e r ( c s v _ f i l e ) 27 c o l u m n s — n e x t ( c s v _ d a t a ) 2 8 c o l u m n s _ w i t h _ t y p e s — [ s t r ( c o l ) + " u v a r c h a r (255) " f o r c o l i n c o l u m n s ] 2 9 c r e a t e _ t a b l e _ q u e r y — ' c r e a t e u t a b l e u i f u n o t u e x i s t s u { { i n p u t s . p a r a m e t e r s . d a t a b a s e n a m e } } u ({0}) ; ' . f o r m a t ( ' , ' . j o i n ( c o l u m n s _ w i t h _ t y p e s )) 3 0 c u r s o r . e x e c u t e ( c r e a t e _ t a b l e _ q u e r y ) 31 d b . c o m m i t () 32 q u e r y — ' i n s e r t u i n t o u { { i n p u t s . p a r a m e t e r s . d a t a b a s e - n a m e } } u ( { 0 } ) u v a l u e s u ( { 1 } ) ' 3 3 q u e r y = q u e r y . f o r m a t ( ' , ' . j o i n ( c o l u m n s ) , ' , ' . j o i n ([ '"/,s" ] * l e n ( c o l u m n s ) ) ) 34 f o r d a t a i n c s v _ d a t a : 35 c u r s o r . e x e c u t e ( q u e r y - q u e r y , a r g s - d a t a ) 3 6 d b . c o m m i t () 3 7 c u r s o r . c l o s e () Listing A.12: Workflow template reading CSV content and storing into MySQL DB 80 A . WORKFLOW EXAMPLES A.9.5 Print csv - name: cat—csv container: image: a l p i n e / c u r k l a t e s t volumeMounts: - name: workdir mountPath: /work command: [ sh, —c ] args: [ "catu /work/{{workflow.parameters.filename } } u & & u w c u - l u / work/{{workflow.parameters .file-name}}u >u /work/lineCount.txt" ] outputs: parameters: - name: line—count valueFrom: path: /work/lineCount. txt Listing A.13: Workflow template printing CSV content to stdout 81 Bibliography [1] Tomas Rebok et al. ANALYZA - Vypocetnia orchestrami subsystem. cze. Masarykova univerzita, 2020. URL: https : //is .muni . cz/ publication/1736351. [2] Rizos Sakellariou et al. "Scheduling Workflows with Budget Constraints". In: Integrated Research in GRID Computing: CoreGRID Integration Workshop 2005 (Selected Papers) November 28-30, Pisa, Italy. Ed. by Sergei Gorlatch and Marco Danelutto. Boston, M A : Springer US, 2007, pp. 189-202. ISBN: 978-0-387-47658-2. DOI: 10.1007/978-0-387-47658-2_14. URL: https : //doi . org/ 10.1007/978-0-387-47658-2_14. [3] Jeremiah Lowin. Oct. 2021. URL: https : //www . prefect. i o / blog/announcing-prefect-orion/. [4] F Mlder et al. "Sustainable data analysis with Snakemake [version 2; peer review: 2 approved]". In: FlOOOResearch 10.33 (2021). DOI: 10.12688/f lOOOresearch. 29032.2. [5] Johannes Köster and Sven Rahmann. "Snakemake—a scalable bioinformatics workflow engine". In: Bioinformatics 28.19 (Aug. 2012), pp. 2520-2522. ISSN: 1367-4803. DOI: 10.1093/bioinf ormatics/ bts480. eprint: https : //academic. oup. com/bioinf ormatics/ article-pdf/28/19/2520/819790/bts480 .pdf. URL: https: //doi.org/10.1093/bioinformatics/bts480. [6] Richard M Stallman, Roland McGrath, and Paul D Smith. GNU make., 2001. [7] B.P. Harenslak and J. de Ruiter. Data Pipelines with Apache Airflow. Manning, 2021. ISBN: 9781617296901. URL: https: //books. google.cz/books?id=8EwnEAAAQBAJ. [8] Brendan Burns, Joe Beda, and Kelsey Hightower. Kubernetes. Dpunkt Heidelberg, Germany, 2018. [9] Rabi Padhy and Manas Patra. "Evolution of Cloud Computing and Enabling Technologies". In: International Journal of Cloud Computing and Services Science (If-CLOSER) 1 (Oct. 2012). DOI: 10.11591/closer.vli4.1216. [10] Ibrahim Abaker Targio Hashem et al. "The rise of "big data" on cloud computing: Review and open research issues". In: Information Systems 47 (2015), pp. 98-115. ISSN: 0306-4379. DOI: https: 82 BIBLIOGRAPHY //doi . org/10 . 1016/j . i s . 2014 . 07 . 006. URL: https : //www. sciencedirect.com/science/article/pii/S0306437914001288 [11] Michael Jackson, Kostas Kavoussanakis, and Edward W. J. Wallace. "Using prototyping to choose a bioinformatics workflow management system". In: PLOS Computational Biology 17.2 (Feb. 2021), pp. 1-13. DOI: 10 . 1371/journal . pcbi . 1008622. URL: https://doi.org/10.1371/journal.pcbi.1008622. [12] Pramod Singh. "Airflow". In: Learn PySpark: Build Python-based Machine Learning and Deep Learning Models. Berkeley, CA: Apress, 2019, pp. 67-84. ISBN: 978-1-4842-4961-1. DOI: 10 .1007/978-1- 4842-4961-1_4. URL: https://doi.org/10.1007/978-1-4842- 4961-1_4. [13] Scott Haines. "Workflow Orchestration with Apache Airflow". In: Modern Data Engineering with Apache Spark: A Hands-On Guide for Building Mission-Critical Streaming Applications. Berkeley, CA: Apress, 2022, pp. 255-295. ISBN: 978-1-4842-7452-1. DOI: 10.1007/ 978-1-4842-7452-1_8. URL:https://doi.org/10.1007/978- 1-4842-7452-1_8. [14] Paolo D i Tommaso et al. "Nextflow enables reproducible computational workflows". In: Nature Biotechnology 35.4 (Apr. 2017), pp. 316-319. ISSN: 1546-1696. DOI: 10.1038/nbt. 3820. URL: https: //doi.org/10.1038/nbt.3820. [15] Yared Dejene Dessalk et al. "Scalable Execution of Big Data Workflows Using Software Containers". In: Proceedings of the 12th International Conference on Management of Digital EcoSystems. MEDES '20. Virtual Event, United Arab Emirates: Association for Computing Machinery, 2020, pp. 76-83. ISBN: 9781450381154. DOI: 10 . 1145/3415958 . 3433082. URL: https : //doi . org/10 . 1145/3415958.3433082. [16] Shyam BV. Airflow vs. prefect-workflow managementfor Data Projects. Oct. 2021. URL: https : //towardsdatascience . com/airf lowvs - prefect - workflow - management - for - data - projects - 5dla0c80f2e3. [17] Shubhnoor Gill on Apache Airflow, Osheen Jain on Data Ingestion, and Abhinav Chola on Data Ingestion. 7 best airflow alternatives for 2022. Mar. 2022. URL: https : //hevodata . com/ learn/airflow-alternatives/. 83 BIBLIOGRAPHY [18] Pedram Navid. Airflow, Prefect, and Dagster: An inside look. Jan. 2022. URL:https://towardsdatascience.com/airflow-prefect- and-dagster-an-inside-look-6074781c9b77. [19] Markus Schmitt. Airflow vs Luigi vs Argo vs Kubeflow vs mlflow. URL: https : //www . datarevenue . com/en-blog/airf low-vs- luigi-vs-argo-vs-mlflow-vs-kubeflow. [20] Samadrita Ghosh. Prefect vs. airflow: Censius blogs. URL: https : //censius.ai/blogs/prefect-vs-airflow. [21 ] JORDAN SEGALL. The Rise of Workflow Orchestration Tools, URL: https://assets-global.website-files.com/5e46eb90c58el7cafba804e9/ 5f 8f885195 ca2b64eb6d462 c _ 200027 % 5C'/. 200N °/„ 5C % 20UV'/. 5C °/„ 20Workflow°/„5C0 /„200rchestration0 /„5C0 /„20White0 /„5C°/„20Paper. pdf. [22] Ian McGraw. Picking a kubernetes orchestrator: Airflow, Argo, and prefect. Dec. 2020. URL: https: //medium. com/arthur-engineering/ picking - a - kubernetes - orchestrator - airflow - argo - and- prefect-83539ecc69b. [23] Aleksey Bilogur. Orchestrating spell model pipelines using prefect. Sept. 2021. URL: https : / / s p e l l . ml/blog/orchestrating - spell-model-pipelines-using-prefect-YU3rsBEAACEAmRxp. [24] Kelsey Taylor. The best data orchestration tools that businesses should be aware of. July 2021. URL: https : //www . hitechnectar . com/ blogs/the-best-data-orchestration-tools-that-businesses- should-be-aware-of/. [25] Mihhail Matskin et al. " A Survey of Big Data Pipeline Orchestration Tools from the Perspective of the DataCloud Project *". In: Dec. 2021. [26] Tom Yedwab. Why we switched to airflowfor Pipeline Orchestration I Khan Academy blog. URL: https: //blog. khanacademy. org/why- we-switched-to-airflow-for-pipeline-orchestration/. [27] Xingwei Wang, Hong Zhao, and Jiakeng Zhu. "GRPC: A Communication Cooperation Mechanism in Distributed Systems". In: SIGOPS Oper. Syst. Rev. 27.3 (July 1993), pp. 75-86. ISSN: 0163- 5980. DOI: 10.1145/155870.155881. URL: https : //doi . org/10. 1145/155870.155881. 84