Apache Tez - A New Chapter in Hadoop Data Processing

© Hortonworks Inc. 2013 Page 1
Apache Tez : Accelerating Hadoop
Query Processing
Bikas Saha@bikassaha
Hitesh Shah @hitesh1892

© Hortonworks Inc. 2013
Tez – Introduction
Page 2
• Distributed execution framework
targeted towards data-processing
applications.
• Based on expressing a computation
as a dataflow graph.
• Highly customizable to meet a
broad spectrum of use cases.
• Built on top of YARN – the resource
management framework for
Hadoop.
• Open source Apache incubator
project and Apache licensed.

© Hortonworks Inc. 2012© Hortonworks Inc. 2013. Confidential and Proprietary.
Hadoop 1 -> Hadoop 2
HADOOP 1.0
HDFS
(redundant, reliable storage)
MapReduce
(cluster resource management
& data processing)
Pig
(data flow)
Hive
(sql)
Others
(cascading)
HDFS2
(redundant, reliable storage)
YARN
(cluster resource management)
Tez
(execution engine)
HADOOP 2.0
Data Flow
Pig
SQL
Hive
Others
(Cascading)
Batch
MapReduce Real Time
Stream
Processing
Storm
Online
Data
Processing
HBase,
Accumulo
Monolithic
• Resource Management
• Execution Engine
• User API
Layered
• Resource Management – YARN
• Execution Engine – Tez
• User API – Hive, Pig, Cascading, Your App!

Tez – Empowering Applications
Page 4
• Tez solves hard problems of running on a distributed Hadoop environment
• Apps can focus on solving their domain specific problems
• This design is important to be a platform for a variety of applications
App
Tez
• Custom application logic
• Custom data format
• Custom data transfer technology
• Distributed parallel execution
• Negotiating resources from the Hadoop framework
• Fault tolerance and recovery
• Horizontal scalability
• Resource elasticity
• Shared library of ready-to-use components
• Built-in performance optimizations
• Security

Tez – End User Benefits
• Better performance of applications
• Built-in performance + Application define optimizations
• Better predictability of results
• Minimization of overheads and queuing delays
• Better utilization of compute capacity
• Efficient use of allocated resources
• Reduced load on distributed filesystem (HDFS)
• Reduce unnecessary replicated writes
• Reduced network usage
• Better locality and data transfer using new data patterns
• Higher application developer productivity
• Focus on application business logic rather than Hadoop internals
Page 5

Tez – Design considerations
Don’t solve problems that have already been solved. Or else
you will have to solve them again!
• Leverage discrete task based compute model for elasticity, scalability
and fault tolerance
• Leverage several man years of work in Hadoop Map-Reduce data
shuffling operations
• Leverage proven resource sharing and multi-tenancy model for Hadoop
and YARN
• Leverage built-in security mechanisms in Hadoop for privacy and
isolation
Page 6
Look to the Future with an eye on the Past

Tez – Problems that it addresses
• Expressing the computation
• Direct and elegant representation of the data processing flow
• Interfacing with application code and new technologies
• Performance
• Late Binding : Make decisions as late as possible using real data from at
runtime
• Leverage the resources of the cluster efficiently
• Just work out of the box!
• Customizable engine to let applications tailor the job to meet their
specific requirements
• Operation simplicity
• Painless to operate, experiment and upgrade
Page 7

Tez – Simplifying Operations
• No deployments to do. No side effects. Easy and safe to try it out!
• Tez is a completely client side application.
• Simply upload to any accessible FileSystem and change local Tez
configuration to point to that.
• Enables running different versions concurrently. Easy to test new
functionality while keeping stable versions for production.
• Leverages YARN local resources.
Page 8
Client
Machine
Node
Manager
TezTask
Node
Manager
TezTaskTezClient
HDFS
Tez Lib 1 Tez Lib 2
Client
Machine
TezClient

Tez – Expressing the computation
Page 9
Aggregate Stage
Partition Stage
Preprocessor Stage
Sampler
Task-1 Task-2
Task-1 Task-2
Task-1 Task-2
Samples
Ranges
Distributed Sort
Distributed data processing jobs typically look like DAGs (Directed Acyclic
Graph).
• Vertices in the graph represent data transformations
• Edges represent data movement from producers to consumers

Tez – Expressing the computation
Page 10
Tez provides the following APIs to define the processing
• DAG API
• Defines the structure of the data processing and the relationship
between producers and consumers
• Enable definition of complex data flow pipelines using simple graph
connection API’s. Tez expands the logical DAG at runtime
• This is how all the tasks in the job get specified
• Runtime API
• Defines the interfaces using which the framework and app code interact
with each other
• App code transforms data and moves it between tasks
• This is how we specify what actually executes in each task on the cluster
nodes

Tez – DAG API
// Define DAG
DAG dag = new DAG();
// Define Vertex
Vertex Map1 = new Vertex(Processor.class);
// Define Edge
Edge edge = Edge(Map1, Reduce1,
SCATTER_GATHER, PERSISTED, SEQUENTIAL,
Output.class, Input.class);
// Connect them
dag.addVertex(Map1).addEdge(edge)…
Page 11
Defines the global processing flow
Map1 Map2
Reduce1 Reduce2
Join
Scatter
Gather
Scatter
Gather

Tez – DAG API
Page 12
• Data movement – Defines routing of data between tasks
– One-To-One : Data from the ith producer task routes to the ith consumer task.
– Broadcast : Data from a producer task routes to all consumer tasks.
– Scatter-Gather : Producer tasks scatter data into shards and consumer tasks
gather the data. The ith shard from all producer tasks routes to the ith consumer
task.
• Scheduling – Defines when a consumer task is scheduled
– Sequential : Consumer task may be scheduled after a producer task completes.
– Concurrent : Consumer task must be co-scheduled with a producer task.
• Data source – Defines the lifetime/reliability of a task output
– Persisted : Output will be available after the task exits. Output may be lost later
on.
– Persisted-Reliable : Output is reliably stored and will always be available
– Ephemeral : Output is available only while the producer task is running
Edge properties define the connection between producer and
consumer tasks in the DAG

Tez – Logical DAG expansion at Runtime
Page 13
Reduce1
Map2
Reduce2
Join
Map1

Tez – Runtime API
Flexible Inputs-Processor-Outputs Model
• Thin API layer to wrap around arbitrary application code
• Compose inputs, processor and outputs to execute arbitrary processing
• Event routing based control plane architecture
• Applications decide logical data format and data transfer technology
• Customize for performance
• Built-in implementations for Hadoop 2.0 data services – HDFS and YARN ShuffleService.
Built on the same API. Your impls are as first class as ours!
Page 14

Tez – Library of Inputs and Outputs
Page 15
Classical ‘Map’ Classical ‘Reduce’
Intermediate ‘Reduce’ for
Map-Reduce-Reduce
Map
Processor
HDFS
Input
Sorted
Output
Reduce
Processor
Shuffle
Input
HDFS
Output
Reduce
Processor
Shuffle
Input
Sorted
Output
• What is built in?
– Hadoop InputFormat/OutputFormat
– SortedGroupedPartitioned Key-Value
Input/Output
– UnsortedGroupedPartitioned Key-Value
Input/Output
– Key-Value Input/Output

Tez – Performance
• Benefits of expressing the data processing as a DAG
• Reducing overheads and queuing effects
• Gives system the global picture for better planning
• Efficient use of resources
• Re-use resources to maximize utilization
• Pre-launch, pre-warm and cache
• Locality & resource aware scheduling
• Support for application defined DAG modifications at runtime
for optimized execution
• Change task concurrency
• Change task scheduling
• Change DAG edges
• Change DAG vertices (TBD)
Page 16

Tez – Benefits of DAG execution
Faster Execution and Higher Predictability
• Eliminate replicated write barrier between successive computations.
• Eliminate job launch overhead of workflow jobs.
• Eliminate extra stage of map reads in every workflow job.
• Eliminate queue and resource contention suffered by workflow jobs that
are started after a predecessor job completes.
• Better locality because the engine has the global picture
Page 17
Pig/Hive - MR
Pig/Hive - Tez

Tez – Container Re-Use
• Reuse YARN containers/JVMs to launch new tasks
• Reduce scheduling and launching delays
• Shared in-memory data across tasks
• JVM JIT friendly execution
Page 18
YARN Container / JVM
TezTask Host
TezTask1
TezTask2
SharedObjects
YARN Container
Tez
Application Master
Start Task
Task Done
Start Task

Tez – Sessions
Page 19
Application Master
Client
Start
Session
Submit
DAG
Task Scheduler
ContainerPool
Shared
Object
Registry
Pre
Warmed
JVM
Sessions
• Standard concepts of pre-launch
and pre-warm applied
• Key for interactive queries
• Represents a connection between
the user and the cluster
• Multiple DAGs executed in the
same session
• Containers re-used across queries
• Takes care of data locality and
releasing resources when idle

Tez – Customizable Core Engine
Page 20
Vertex-2
Vertex-1
Start
vertex
Vertex Manager
Start
tasks
DAG
Scheduler
Get Priority
Get Priority
Start
vertex
Task
Scheduler
Get container
Get container
• Vertex Manager
• Determines task
parallelism
• Determines when
tasks in a vertex
can start.
• DAG Scheduler
Determines priority
of task
• Task Scheduler
Allocates containers
from YARN and
assigns them to tasks

Tez – Event Based Control Plane
Page 21
Reduce Task 2
Input1 Input2
Map Task 2
Output1
Output2
Output3
Map Task 1
Output1
Output2
Output3
AM
Router
Scatter-Gather Edge
• Events used to communicate
between the tasks and between task
and framework
• Data Movement Event used by
producer task to inform the
consumer task about data location,
size etc.
• Input Error event sent by task to the
engine to inform about errors in
reading input. The engine then takes
action by re-generating the input
• Other events to send task completion
notification, data statistics and other
control plane information
Data Event
Error Event

Tez – Automatic Reduce Parallelism
Page 22
Map Vertex
Reduce Vertex
App Master
Vertex Manager
Data Size Statistics
Vertex State
Machine
Set Parallelism
Cancel Task
Re-Route
Event Model
Map tasks send data
statistics events to
the Reduce Vertex
Manager.
Vertex Manager
Pluggable application
logic that understands
the data statistics and
can formulate the
correct parallelism.
Advises vertex
controller on
parallelism

Tez – Theory to Practice
• Performance
• Scalability
Page 23

Tez – Hive TPC-DS Scale 200GB latency

Tez – Pig performance gains
• Demonstrates performance gains from a basic translation to a
Tez DAG
0
20
40
60
80
100
120
140
160
Prod script 1
25m vs 10m
5 MR Jobs
Prod script 2
34m vs 16m
5 MR Jobs
Prod script 3
1h 46m vs 48m
12 MR Jobs
Prod script 4
2h 22m vs 1h
21m
15 MR jobs
Timeinmins
MR
Tez

Tez – Observations on Performance
• Number of stages in the DAG
• Higher the number of stages in the DAG, performance of Tez (over MR)
will be better.
• Cluster/queue capacity
• More congested a queue is, the performance of Tez (over MR) will be
better due to container reuse.
• Size of intermediate output
• More the size of intermediate output, the performance of Tez (over MR)
will be better due to reduced HDFS usage.
• Size of data in the job
• For smaller data and more stages, the performance of Tez (over MR) will
be better as percentage of launch overhead in the total time is high for
smaller jobs.
• Offload work to the cluster
• Move as much work as possible to the cluster by modelling it via the job
DAG. Exploit the parallelism and resources of the cluster. E.g. MR split
calculation.
• Vertex caching
• The more re-computation can be avoided the better is the performance.
Page 26

Tez – Data at scale
Page 27
Hive TPC-DS
Scale 10TB

Tez – DAG definition at scale
Page 28
Hive : TPC-DS Query 88 Logical DAG ( 39 MR jobs with Hive-10)

Tez – Container Reuse at Scale
• 78 vertices + 8374 tasks on 50 containers (TPC-DS Query 4)
Page 29

Tez – Real World Use Cases for the API
Page 30

Tez – Broadcast Edge
SELECT ss.ss_item_sk, ss.ss_quantity, avg_price, inv.inv_quantity_on_hand
FROM (select avg(ss_sold_price) as avg_price, ss_item_sk, ss_quantity_sk from store_sales
group by ss_item_sk) ss
JOIN inventory inv
ON (inv.inv_item_sk = ss.ss_item_sk);
Hive – MR Hive – Tez
M
M
M
M M
HDFS
Store Sales scan.
Group by and
aggregation
reduce size of this
input.
Inventory scan
and Join
Broadcast
edge
M M M
HDFS
Store Sales scan.
Group by and
aggregation.
Inventory and Store
Sales (aggr.) output
scan and shuffle join.
R R
R R
RR
M
MMM
HDFS
Hive :
Broadcast Join

Tez – Multiple Outputs
Page 32
Pig : Split & Group-by
f = LOAD ‘foo’ AS (x, y, z);
g1 = GROUP f BY y;
g2 = GROUP f BY z;
j = JOIN g1 BY group,
g2 BY group;
Group by y Group by z
Load foo
Join
Load g1 and Load g2
Group by y Group by z
Load foo
Join
Multiple outputs
Reduce follows
reduce
HDFS HDFS
Split multiplex de-multiplex
Pig – MR Pig – Tez

Tez – One to One Edge
Page 33
Aggregate
Sample L
Join
Stage sample map
on distributed cache
l = LOAD ‘left’ AS (x, y);
r = LOAD ‘right’ AS (x, z);
j = JOIN l BY x, r BY x
USING ‘skewed’;
Load &
Sample
Aggregate
Partition L
Join
Pass through input
via 1-1 edge
Partition R
HDFS
Broadcast
sample map
Partition L and Partition R
Pig – MR Pig – Tez
Pig : Skewed Join

Tez – Current status
• Apache Incubator Project
–Rapid development. Over 1100 jiras opened. Over 800 resolved
–Growing community of contributors and users
–Latest release is 0.4
• Support for a vast topology of DAGs
• Being used by multiple applications such as Apache Hive,
Apache Pig, Cascading.
Page 34

Tez – Adoption Path
•Pre-requisite : Hadoop 2 with YARN
•Simple client-side install (no admin support needed)
–No side effects or traces left behind on your cluster. Low risk and low
effort to try out.
•Apache Hive – Already available in 0.13
•Apache Pig – Available on trunk ( slated for v0.14 )
•Cascading – version 3.0 ( coming soon )
•Run your MapReduce jobs using Tez runtime
• Change “mapreduce.framework.name” to “yarn-tez”
•ETL: Replace MR or custom pipelines with native Tez
•
Page 35

Tez – Roadmap
• Richer DAG support
– Addition of vertices at runtime
– Shared edges for shared outputs
– Enhance Input/Output library
• Performance optimizations
– Improve support for high concurrency
– Improve locality aware scheduling.
– Add framework level data statistics
– HDFS memory storage integration
• Usability
–Tez UI (coming soon)
–Stability and testability
– API ease of use
– Tools for performance analysis and debugging
Page 36

Tez – Community
• Early adopters and code contributors welcome
– Adopters to drive more scenarios. Contributors to make them happen.
• Tez meetup for developers and users
– http://www.meetup.com/Apache-Tez-User-Group
• Technical blog series
– http://hortonworks.com/blog/apache-tez-a-new-chapter-in-hadoop-data-
processing
• Useful links
– Work tracking: https://issues.apache.org/jira/browse/TEZ
– Code: https://github.com/apache/incubator-tez
– Developer list: dev@tez.incubator.apache.org
User list: user@tez.incubator.apache.org
Issues list: issues@tez.incubator.apache.org
Page 37

Tez – Takeaways
• Distributed execution framework that works on computations
represented as dataflow graphs
• Naturally maps to execution plans produced by query
optimizers
• Customizable execution architecture designed to enable
dynamic performance optimizations at runtime
• Works out of the box with the platform figuring out the hard
stuff
• Span the spectrum of interactive latency to batch, small to
large
• Open source Apache project – your use-cases and code are
welcome
• It works and is already being used by Hive and Pig
Page 38

Tez
Thanks for your time and attention!
Video with Deep Dive on Tez
http://youtu.be/-7YhVwqky6M
http://www.infoq.com/presentations/apache-tez
Questions?
@bikassaha, @hitesh1892
Page 39

Apache Tez - A New Chapter in Hadoop Data Processing

In this document