This document provides an overview of the ORC file format. It describes the key requirements and design decisions, including file structure, stripe structure, encoding columns, run length encoding, compression, indexing, and versioning. It also discusses optimizations, debugging, and using ORC from SQL, Java, C++, and the command line. The document is intended to help users and developers better understand how ORC works.
Introduction to ORC presentation by Owen O'Malley, with contact details and overview.
Key requirements for ORC files: self-describing files, schema, file version, tight compression, column projection, easy division, and compatibility.
ORC file structure includes metadata in the footer, stripe information, and organization into stripes that contain columnar data.
The read path for ORC files involves reading the stripe footer and required streams, serialization of streams, and various compression techniques.
Data types and encoding methods including compound types and run-length encoding for efficient storage.
ORC's compression methods, indexing techniques for row pruning, and how to efficiently access row groups.
Bloom filters for value detection, version control, schema evolution, stripe concatenation, column encryption, and additional developer tools.
Instructions on using ORC in Hive, Spark, Java, C++, and command line tools for debugging and optimization.
Key optimization strategies including stripe size, HDFS block padding, and calculations for file splitting for performance improvements.Conclusion of the presentation and additional resources for more information on the ORC format and its specifications.