[go: up one dir, main page]

arrow-avro 57.1.0

Support for parsing Avro format into the Arrow format
Documentation

arrow-avro

crates.io docs.rs

Transfer data between the Apache Arrow memory format and Apache Avro.

This crate provides:

  • a reader that decodes Avro
    • Object Container Files (OCF),
    • Avro Single‑Object Encoding (SOE), and
    • Confluent Schema Registry wire format
      into Arrow RecordBatches; and
  • a writer that encodes Arrow RecordBatches into Avro (OCF or SOE).

The latest API docs for main (unreleased) are published on the Arrow website: arrow_avro.


Install

[dependencies]
arrow-avro = "57.0.0"

Disable defaults and pick only what you need (see Feature Flags):

[dependencies]
arrow-avro = { version = "57.0.0", default-features = false, features = ["deflate", "snappy"] }

Quick start

Read an Avro OCF file into Arrow

use std::fs::File;
use std::io::BufReader;

use arrow_avro::reader::ReaderBuilder;
use arrow_array::RecordBatch;

fn main() -> anyhow::Result<()> {
    let file = BufReader::new(File::open("data/example.avro")?);
    let mut reader = ReaderBuilder::new().build(file)?;
    while let Some(batch) = reader.next() {
        let batch: RecordBatch = batch?;
        println!("rows: {}", batch.num_rows());
    }
    Ok(())
}

Write Arrow to Avro OCF (in‑memory)

use std::sync::Arc;

use arrow_avro::writer::AvroWriter;
use arrow_array::{ArrayRef, Int32Array, RecordBatch};
use arrow_schema::{DataType, Field, Schema};

fn main() -> anyhow::Result<()> {
    let schema = Schema::new(vec![Field::new("id", DataType::Int32, false)]);
    let batch = RecordBatch::try_new(
        Arc::new(schema.clone()),
        vec![Arc::new(Int32Array::from(vec![1, 2, 3])) as ArrayRef],
    )?;

    let sink: Vec<u8> = Vec::new();
    let mut w = AvroWriter::new(sink, schema)?;
    w.write(&batch)?;
    w.finish()?;
    assert!(!w.into_inner().is_empty());
    Ok(())
}

See the crate docs for runnable SOE and Confluent round‑trip examples.


Feature Flags (what they do and when to use them)

Compression codecs (OCF block compression)

arrow-avro supports the Avro‑standard OCF codecs. The defaults include all five: deflate, snappy, zstd, bzip2, and xz.

Feature Default What it enables When to use
deflate DEFLATE compression via flate2 (pure‑Rust backend) Most compatible; widely supported; good compression, slower than Snappy.
snappy Snappy block compression via snap with CRC‑32 as required by Avro Fastest decode/encode; common in streaming/data‑lake pipelines. (Avro requires a 4‑byte big‑endian CRC of the uncompressed block.)
zstd Zstandard block compression via zstd Great compression/speed trade‑off on modern systems. May pull in a native library.
bzip2 BZip2 block compression For compatibility with older datasets that used BZip2. Slower; larger deps.
xz XZ/LZMA block compression Highest compression for archival data; slowest; larger deps.

Avro defines these codecs for OCF: null (no compression), deflate, snappy, bzip2, xz, and zstandard (recent spec versions).

Notes

  • Only OCF uses these codecs (they compress per‑block). They do not apply to raw Avro frames used by Confluent wire format or SOE. The crate’s compression module is specifically for OCF blocks.
  • deflate uses flate2 with the rust_backend (no system zlib required).

Schema fingerprints & custom logical type helpers

Feature Default What it enables When to use
md5 md5 dep for optional MD5 schema fingerprints If you want to compute MD5 fingerprints of writer schemas (i.e. for custom prefixing/validation).
sha256 sha2 dep for optional SHA‑256 schema fingerprints If you prefer longer fingerprints; affects max prefix length (i.e. when framing).
small_decimals Extra handling for small decimal logical types (Decimal32 and Decimal64) If your Avro decimal values are small and you want more compact Arrow representations.
avro_custom_types Annotates Avro values using Arrow specific custom logical types Enable when you need arrow-avro to reinterpret certain Avro fields as Arrow types that Avro doesn’t natively model.
canonical_extension_types Re‑exports Arrow’s canonical extension types support from arrow-schema Enable if your workflow uses Arrow canonical extension types and you want arrow-avro to respect them.

Lower‑level/internal toggles (rarely used directly)

  • flate2, snap, crc, zstd, bzip2, xz are optional dependencies wired to the user‑facing features above. You normally enable deflate/snappy/zstd/bzip2/xz, not these directly.

Feature snippets

  • Minimal, fast build (common pipelines):

    arrow-avro = { version = "56", default-features = false, features = ["deflate", "snappy"] }
    
  • Include Zstandard too (modern data lakes):

    arrow-avro = { version = "56", default-features = false, features = ["deflate", "snappy", "zstd"] }
    
  • Fingerprint helpers:

    arrow-avro = { version = "56", features = ["md5", "sha256"] }
    

What formats are supported?

  • OCF (Object Container Files): self‑describing Avro files with header, optional compression, sync markers; reader and writer supported.
  • Confluent Schema Registry wire format: 1‑byte magic 0x00 + 4‑byte BE schema ID + Avro body; supports decode + encode helpers.
  • Avro Single‑Object Encoding (SOE): 2‑byte magic 0xC3 0x01 + 8‑byte LE CRC‑64‑AVRO fingerprint + Avro body; supports decode + encode helpers.

Examples

  • Read/write OCF in memory and from files (see crate docs “OCF round‑trip”).
  • Confluent wire‑format and SOE quickstarts are provided as runnable snippets in docs.

There are additional examples under arrow-avro/examples/ in the repository.