Hudi Streamer is your all-in-one tool for building up a data lakehouse. Out of the box, it provides a wide range of data source support. You can connect it with Debezium that continuously reads change logs from a Postgres table, or you can read incremental changes from another Apache Hudi table to form a chain of data processing pipelines. Beyond data sources, Hudi Streamer also supports data transformations, managing table services like compaction and clustering, and syncing with multiple data catalogs, such as Apache Hive Metastore, AWS Glue Catalog, Google BigQuery, DataHub, and more through the Apache XTable (Incubating) extension. Read chapter 8 of "Apache Hudi™: The Definitive Guide", which shows you real-world examples of using Hudi Streamer to build a data lakehouse. This is the first book ever written about Apache Hudi, by industry experts: Shiyan Xu, Prashant Wason, Bhavani Sudha Saktheeswaran, and Rebecca Bilbro, PhD. 👉 Get a free copy of the e-book (8 early-release chapters now available!): https://lnkd.in/e8svK5pB #ApacheHudi #DataLake #DataEngineering #DataLakehouse
How to build a data lakehouse with Hudi Streamer
More Relevant Posts
-
A great article by our co-Founder Jacob Leverich who goes into some level of technical detail on the topic of "Why Observability Needs Apache Iceberg" When we first started Observe, Inc. we loved the Snowflake architecture - in particular the separation of storage and compute. This enabled us to ingest data into S3 at a tiny fraction of the cost of incumbent vendors. Since then, Snowflake has embraced Apache Iceberg as their underlying table format making the ingested data not only cheap, but open. For the first time ever, customers can now own their telemetry data in their S3 bucket... and keep it in an open format. https://lnkd.in/gNBJSBsg
To view or add a comment, sign in
-
👇 🚀 Demystifying Apache Iceberg: The Next-Gen Table Format for Data Lakes As organizations scale their data platforms, traditional file-based data lakes often hit limits with schema evolution, ACID transactions, and performance. That’s where Apache Iceberg changes the game. ❄️ 💡 What is Apache Iceberg? Iceberg is an open table format designed for huge analytic datasets on data lakes — bringing data warehouse reliability to cloud storage like S3, ADLS, or GCS. ⚙️ Why It Matters ✅ Schema Evolution — Add, rename, or drop columns without breaking queries. ✅ ACID Transactions — Reliable reads/writes even in highly concurrent environments. ✅ Time Travel — Query historical versions of data — perfect for audits and debugging. ✅ Partition Evolution — Optimize partitions dynamically without rewriting old data. ✅ Multi-Engine Support — Works seamlessly with Spark, Snowflake, Trino, Flink, and Dremio. 🔍 Real-World Impact At scale, Iceberg enables: Unified analytics across streaming + batch data Faster query performance via metadata pruning Simplified data governance with versioned datasets 🧠 Example Use Case Integrated Iceberg with Snowflake-managed catalogs to enable consistent schema evolution across Azure Data Lake layers — ensuring data lineage, versioning, and compliance across enterprise financial datasets. 📊 In Short Apache Iceberg transforms data lakes into transactional, governed, and high-performance platforms — bridging the gap between data lake flexibility and warehouse reliability. #DataEngineering #ApacheIceberg #DataLakehouse #Snowflake #BigData #ETL #Analytics
To view or add a comment, sign in
-
🚀 𝐀𝐧𝐧𝐨𝐮𝐧𝐜𝐢𝐧𝐠 𝐚𝐫𝐫𝐨𝐰-𝐚𝐯𝐫𝐨 𝐟𝐨𝐫 𝐑𝐮𝐬𝐭! 🦀 The Arrow team has released arrow-avro, a newly rewritten Rust crate that bridges Apache Avro ↔ Apache Arrow, enabling direct, vectorized reads and writes between Avro data and Arrow RecordBatches. 𝐈𝐭 𝐬𝐮𝐩𝐩𝐨𝐫𝐭𝐬: 📦 Avro Object Container Files (OCF) 🔌 Single-Object Encoding (SOE) 🧩 Confluent & Apicurio Schema Registry formats ⚙️ Projection, schema evolution & tunable batch sizing 𝑾𝒉𝒚 𝒊𝒕 𝒎𝒂𝒕𝒕𝒆𝒓𝒔: Avro is row-oriented, Arrow is columnar and converting between the two efficiently has long been a bottleneck. arrow-avro eliminates that by decoding Avro straight into Arrow arrays, keeping pipelines columnar end-to-end across both batch and streaming data. Early benchmarks show up to 30× speedups over row-centric decoding for common workloads. 📦 Included in arrow-rs v57.0.0 🦀 Written in Rust, designed for performance, built for modern data systems. 𝑪𝒉𝒆𝒄𝒌 𝒐𝒖𝒕 𝒕𝒉𝒆 𝒇𝒖𝒍𝒍 𝒂𝒓𝒕𝒊𝒄𝒍𝒆 𝒊𝒏 𝒕𝒉𝒆 𝒄𝒐𝒎𝒎𝒆𝒏𝒕𝒔 𝒃𝒆𝒍𝒐𝒘 👇
To view or add a comment, sign in
-
-
The combination of Rust for data ingestion pipelines, Apache Iceberg as the table format, and Apache Doris as the analytics engine represents a paradigm shift in building production-grade analytics applications. Read my research and architectural explanation 👇 https://lnkd.in/eKQ2xnjt
To view or add a comment, sign in
-
The combination of Rust for data ingestion pipelines, Apache Iceberg as the table format, and Apache Doris as the analytics engine represents a paradigm shift in building production-grade analytics applications. Read my full architectural explanation 👇 https://lnkd.in/e5cH-Hfb
To view or add a comment, sign in
-
One thing last today’s AWS outage reminded me of: portability still matters. Many “open” data stacks are only open until their control plane goes down. If your Spark code references a Unity Catalog or a managed metastore, you can’t just redeploy elsewhere, even if the code itself is open source. When we started building Arc, one of the pillar of design was: - 100% open formats (Parquet + Arrow) - DuckDB as the SQL engine And most important… No external catalog, no region dependency, no hidden control plane The result? You can move your data, spin up a new node anywhere (EC2, GCP, local NVMe, even a laptop), and keep querying, same SQL, same performance. Open source is great. Portable open source is even better. 👉 https://lnkd.in/d52WJqnR #AwS #Analytics #TimeSeries #Arc
To view or add a comment, sign in
-
🧊 Apache Iceberg — the table format that’s quietly changing how we handle datalakes. If you’ve ever worked with raw Parquet files on S3 or GCS, you know the pain: someone overwrites a folder, queries show inconsistent results, and schema changes turn into a weekend project. Iceberg fixes that — without locking you into any single platform. Here’s what clicked for me while exploring it: • It treats your datalake like a database. You still store files, but Iceberg adds a metadata layer — so it knows exactly which files belong to which snapshot. • You get transactions and time travel. No more partial updates or guesswork. You can literally query your table “as of” a past version and see what changed. • Schema changes don’t break everything. Add or rename a column safely — Iceberg tracks the evolution for you. • It’s open and engine-agnostic. Snowflake, BigQuery, AWS, and Databricks are all building around it. Finally — a format that plays nicely across clouds. 💡 In short: Iceberg gives structure and reliability to messy data lakes — without sacrificing flexibility. Feels like the future of data storage will be built on open table formats like this. Anyone already experimenting with Iceberg or Delta Lake? Curious what’s working for you. #DataEngineering #ApacheIceberg #DataArchitecture #DataLakes #BigData #CloudComputing
To view or add a comment, sign in
-
Everyone wants to be on Apache Iceberg. But the actual implementation is tedious. Companies like Trust & Will needed to build a lakehouse fast while keeping their existing stack running. They got it done with bauplan + Orchestra in hours instead of months. ⚡ The challenge⚡ : build enterprise data infrastructure from scratch for financial services, but keep analysts happy with their existing dbt Labs + Snowflake workflows for BI. The goal was moving transformations to a more efficient, flexible runtime without a painful migration. Here's what made it work: → Git-like branching for data without infrastructure overhead: test changes safely, roll back instantly → In-memory quality checks as data transforms: eliminate warehouse compute costs → One Iceberg table, query from any engine: Snowflake for BI, Bauplan for transformations → Orchestra's orchestration layer handles scheduling, monitoring, cataloging 🎯 The result 🎯 : augmented modern data stack instead of replacing it. Analysts keep their workflows, engineers get flexibility and speed, all on the same data. No months-long migration. No architectural rewrite. Just faster development at a fraction of the cost. Thanks to Tim Frazer 🚀 and Hugo Lu and teams for making this integration seamless. Full technical breakdown in link in comments
To view or add a comment, sign in
-
-
Data Engineers: It's time to grab the Apache Airflow 3.0 certificates. At this moment 𝟳𝟳𝗸 𝗼𝗿𝗴𝗮𝗻𝗶𝘇𝗮𝘁𝗶𝗼𝗻𝘀 𝘂𝘀𝗲 𝗔𝗽𝗮𝗰𝗵𝗲 𝗔𝗶𝗿𝗳𝗹𝗼𝘄 to manage their ETL and MLOPS workflows. 𝟰𝟳.𝟵% 𝗼𝗳 𝘂𝘀𝗲𝗿𝘀 reported that it powers 𝗯𝘂𝘀𝗶𝗻𝗲𝘀𝘀-𝗰𝗿𝗶𝘁𝗶𝗰𝗮𝗹 𝘂𝘀𝗲 𝗰𝗮𝘀𝗲𝘀 within their organization, and this number is expected to grow. Getting certified in Apache Airflow puts you on the radar of these companies, opening you up for better career opportunities or bigger clients. And passing the exam isn't hard since Marc Lamberti has already prepared a ton of free modules that dive deep into Airflow 3.0 topics. If you're ready to start, check out my guide on the certification program 👇 #DataEngineering #ApacheAirflow #Astronomer #ETL
To view or add a comment, sign in
-
-
🔹 One Platform, All Your Data — Real-Time + Historical Ever needed yesterday’s numbers and this second’s update… but had to jump between tools or wait for engineering? That’s the old way. With Apache Fluss (Incubating) 🦦 + Apache Flink, you get a single data view: >Fresh data streams (sub-second latency) stay hot in Fluss. >Historical data (weeks, months, years) lives in Iceberg’s cost-efficient lakehouse. >One query, one platform — no more stitching systems together. ✨ What this means for you as a Data User: >Instant answers: Track KPIs and spot anomalies in real time. >Unified view: Stream + batch data in one place, no separate pipelines. >Less waiting: No dependency on Kafka or juggling multiple warehouses. >Smarter spend: Cold storage for long-term analytics, hot storage only where speed matters. This is the evolution of data architecture: where real-time insights meet historical depth — seamlessly, simply, and affordably. 🚀 Coming soon in Apache Fluss 0.8 — a new era for anyone who needs fast, reliable, and complete data access. Curious how this could change your workflow? Drop your thoughts below ⬇️
To view or add a comment, sign in
-
Explore related topics
Explore content categories
- Career
- Productivity
- Finance
- Soft Skills & Emotional Intelligence
- Project Management
- Education
- Technology
- Leadership
- Ecommerce
- User Experience
- Recruitment & HR
- Customer Experience
- Real Estate
- Marketing
- Sales
- Retail & Merchandising
- Science
- Supply Chain Management
- Future Of Work
- Consulting
- Writing
- Economics
- Artificial Intelligence
- Employee Experience
- Workplace Trends
- Fundraising
- Networking
- Corporate Social Responsibility
- Negotiation
- Communication
- Engineering
- Hospitality & Tourism
- Business Strategy
- Change Management
- Organizational Culture
- Design
- Innovation
- Event Planning
- Training & Development