Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez is a framework for accelerating Hadoop query processing. It is based on expressing a computation as a dataflow graph and executing it in a highly customizable way. Tez is built on top of YARN and provides benefits like better performance, predictability, and utilization of cluster resources compared to traditional MapReduce. It allows applications to focus on business logic rather than Hadoop internals.
Tez – Pigperformance gains
• Demonstrates performance gains from a basic translation to a
Tez DAG
0
20
40
60
80
100
120
140
160
Prod script 1
25m vs 10m
5 MR Jobs
Prod script 2
34m vs 16m
5 MR Jobs
Prod script 3
1h 46m vs 48m
12 MR Jobs
Prod script 4
2h 22m vs 1h
21m
15 MR jobs
Timeinmins
MR
Tez
#23 For anyone who has been working on MapReduce, there is this age-old problem around “how do I figure out the correct number of reducers?”. We guess some number at compile-time and usually that turns out to be incorrect at run-time. Let’s see how we can use the Tez model to fix that. So here is this Map Vertex and this Reduce Vertex, which have these tasks running and you have the Vertex Manager running inside the framework …
[CLICK] The Map Tasks can send Data Size Statistics to the Vertex Manager, which can then extrapolate those statistics to figure out “what would be the final size of the data when all of these Maps finish?”. Based on that, it can realize that the data size is actually smaller than expected, and I can actually run two reduce tasks instead of three.
[CLICK] The Vertex Manager sends a Set Paralellism command to the framework which changes the routing information in-between these two tasks and also cancels the last task.
#25 query 1: SELECT pageURL, pageRank FROM rankings WHERE pageRank > X
#26 1.5x to 3x speedup on some of the Pigmix queries.