Apache Hudi’s cover photo
Apache Hudi

Apache Hudi

Data Infrastructure and Analytics

San Francisco, CA 13,679 followers

Open source pioneer of the lakehouse reimagining batch processing with incremental framework for low latency analytics

About us

Open source pioneer of the lakehouse reimagining old-school batch processing with a powerful new incremental framework for low latency analytics. Hudi brings database and data warehouse capabilities to the data lake making it possible to create a unified data lakehouse for ETL, analytics, AI/ML, and more. Apache Hudi is battle-tested at scale powering some of the largest data lakes on the planet. Apache Hudi provides an open foundation that seamlessly connects to all other popular open source tools such as Spark, Presto, Trino, Flink, Hive, and so much more. Being an open source table format is not enough, Apache Hudi is also a comprehensive platform of open services and tools that are necessary to operate your data lakehouse in production at scale. Most importantly, Apache Hudi is a community built by a diverse group of engineers from all around the globe! Hudi is a friendly and inviting open source community that is growing every day. Join the community in Github: https://github.com/apache/hudi or find links to email lists and slack channels on the Hudi website: https://hudi.apache.org/

Website
https://hudi.apache.org/
Industry
Data Infrastructure and Analytics
Company size
201-500 employees
Headquarters
San Francisco, CA
Type
Nonprofit
Founded
2016
Specialties
ApacheHudi, DataEngineering, ApacheSpark, ApacheFlink, TrinoDB, Presto, DataAnalytics, DataLakehouse, AWS, GCP, Azure, ChangeDataCapture, and StreamProcessing

Locations

Employees at Apache Hudi

Updates

  • View organization page for Apache Hudi

    13,679 followers

    📰 The September 2025 edition of the Apache Hudi newsletter is here! This month was packed with technical deep dives. We've distilled the top highlights for you below: ✦ Hudi PMC Chair Vinoth Chandar presented at CMU’s Database Group Seminar Series, hosted by Andy Pavlo. ✦ The latest "Iceberg vs Delta vs Hudi Feature Comparison" blog post is out! ✦ Learn how to combine Apache Hudi‘s real-time data processing with PuppyGraph‘s zero-ETL graph queries to run security analytics directly on your existing data lakehouse. Read more and subscribe to get the next digest delivered directly to your inbox! 📥 https://lnkd.in/gikDVyAA #ApacheHudi #DataLakehouse #OpenSource

  • [Blog] Ready to unlock blazing-fast queries on your data lakehouse? ⚡️ The integration between Apache Doris and Apache Hudi is a game-changer for real-time analytics! This integration allows you to seamlessly query your Hudi tables with the high-performance, low-latency power of Doris. Check out this insightful blog post to see how you can level up your data architecture. 👉🏼 https://lnkd.in/gVkmxQcV Why is this a big deal? 🚀 High-Speed Queries: Leverage Doris's MPP architecture to run lightning-fast queries directly on your Hudi-managed data lakehouse. 🔄 Seamless Data Migration: Easily migrate data from existing systems into your Hudi + Doris lakehouse for unified analytics. 📊 Real-Time Analytics: Build powerful dashboards and analytical applications on fresh, transactionally consistent data from Hudi. 🔗 Unified Platform: Combine the power of Hudi's incremental processing and data management with Doris's exceptional query performance in a single, unified platform. This powerful combination is perfect for anyone looking to build a modern, high-performance data analytics platform. Dive into the blog to learn more! #ApacheDoris #ApacheHudi #DataLakehouse

    • No alternative text description for this image
  • [Blog] Struggling with concurrent writers and data consistency in your data lakehouse? 🤔 This is a common challenge, but Apache Hudi has robust solutions. Check out this fantastic blog post by Dipankar Mazumdar on Concurrency Control in an Open Data Lakehouse. It's a deep dive into how Hudi manages complex, simultaneous data operations. 👉 https://lnkd.in/gYa24Gvi Key Learnings: 🚀 ACID Properties: Understand the fundamentals of concurrency control and why Isolation and Serializability are critical for the "I" in ACID. 💡 Optimistic Concurrency Control (OCC): Learn how Hudi implements OCC, making it great for many common workloads where conflicts are rare. 🛡️ Multi-Version Concurrency Control (MVCC): Discover how Hudi uses MVCC to allow different processes to operate on a consistent snapshot of the table. ⚡ NBCC over OCC: For high-throughput concurrent writes and streaming, Hudi's lock-free Non-Blocking Concurrency Control (NBCC) is the recommended approach over OCC, preventing writer starvation and maximizing throughput! A must-read for any data engineer or architect working on building reliable, high-performance data platforms! #ApacheHudi #DataEngineering #DataLakehouse

    • No alternative text description for this image
  • 𝐀𝐧𝐧𝐨𝐮𝐧𝐜𝐢𝐧𝐠 𝐀𝐈-𝐏𝐨𝐰𝐞𝐫𝐞𝐝 𝐒𝐮𝐩𝐩𝐨𝐫𝐭 𝐢𝐧 𝐭𝐡𝐞 𝐀𝐩𝐚𝐜𝐡𝐞 𝐇𝐮𝐝𝐢 𝐒𝐥𝐚𝐜𝐤! 🚀 Got questions about Apache Hudi? Get answers, instantly. We're thrilled to launch our new Slack channel, 𝗮𝘀𝗸-𝗮𝗶, in the official Apache Hudi workspace! This channel is integrated with an AI assistant powered by kapa.ai, trained on the latest Hudi documentation and GitHub issues. Join the Hudi Slack workspace from here 👉 𝗵𝘂𝗱𝗶.𝗮𝗽𝗮𝗰𝗵𝗲.𝗼𝗿𝗴/𝘀𝗹𝗮𝗰𝗸 We are making our community support more accessible and efficient than ever before. Come join the channel for instant help! #ApacheHudi #AI #RAG

    • No alternative text description for this image
  • Attackers 👾 can breach a system in under a minute. ⏱️ Can your security analytics keep up? Traditional tools struggle to connect the dots across massive cloud environments in real-time. 🤔 Check out this blog by Jaz Samantha Ku in collaboration with Shiyan Xu that demonstrates a powerful, modern approach. Learn how to combine Apache Hudi's real-time data processing with PuppyGraph's zero-ETL graph queries to run security analytics directly on your existing data lakehouse. No data duplication, no complex pipelines—just faster, deeper insights to stop threats in their tracks. 🛡️ Read the full guide here: https://lnkd.in/gTKz3VWU #ApacheHudi #PuppyGraph #Security #DataLakehouse

    • No alternative text description for this image
  • Apache Hudi reposted this

    View profile for Shiyan Xu

    Data Architect | O'Reilly Author | Creator of Hudi-rs | PMC member of Apache Hudi

    Had a fantastic time speaking at the Data Streaming Summit SF 2025 this Tuesday! 🎤 It was a great pleasure to share my perspective on the architectural challenges of streaming writes in the data lakehouse. I really enjoyed diving into how Apache Hudi streaming-first designs for building streaming lakehouses. For those who couldn't make it, here’s a quick recap of the key challenges and how Hudi handles them: 💡 ⚡️ High-frequency updates: use 𝐌𝐞𝐫𝐠𝐞-𝐨𝐧-𝐑𝐞𝐚𝐝 (𝐌𝐎𝐑) tables to absorb changes. 📈 Large volumes of mutable workloads: use 𝐫𝐞𝐜𝐨𝐫𝐝-𝐥𝐞𝐯𝐞𝐥 𝐢𝐧𝐝𝐞𝐱𝐢𝐧𝐠 to find records fast and keep write latency low. 🗂️ Small File issues: use 𝐚𝐮𝐭𝐨-𝐟𝐢𝐥𝐞 𝐬𝐢𝐳𝐢𝐧𝐠 and 𝐚𝐬𝐲𝐧𝐜 𝐜𝐨𝐦𝐩𝐚𝐜𝐭𝐢𝐨𝐧/𝐜𝐥𝐮𝐬𝐭𝐞𝐫𝐢𝐧𝐠. ⚔️ Conflicts and retries with OCC: Hudi 1.0 introduced 𝐍𝐨𝐧-𝐁𝐥𝐨𝐜𝐤𝐢𝐧𝐠 𝐂𝐨𝐧𝐜𝐮𝐫𝐫𝐞𝐧𝐜𝐲 𝐂𝐨𝐧𝐭𝐫𝐨𝐥 (𝐍𝐁𝐂𝐂) to avoid costly retries. 📜 Long commit history: Hudi 1.0 𝐋𝐒𝐌 𝐓𝐢𝐦𝐞𝐥𝐢𝐧𝐞 for optimized access and storage. Thanks again to the Data Streaming Summit by StreamNative for having me! #DataLakeHouse #DataStreaming #ApacheHudi #DSSSF25

    • No alternative text description for this image
    • No alternative text description for this image
    • No alternative text description for this image
    • No alternative text description for this image
  • Apache Hudi reposted this

    Yesterday our Founder/CEO, Vinoth Chandar, presented at Carnegie Mellon University's Database Group Seminar Series, hosted by Andy Pavlo. It's an honor to contribute to this longstanding forum for database systems research. In his talk "Apache Hudi: A Database Layer over Cloud Storage for Fast Mutations and Efficient Queries" Vinoth breaks down Apache Hudi as both a storage engine and table format for the lakehouse. An interesting part of the talk discussed how Hudi navigates the "RUM conjecture" (trade-offs between Read, Update, Memory overheads). By using fixed-size file groups, Hudi optimizes for reads (via query-friendly layouts) and updates/deletes (via key-based indexing), at the expense of additional storage for metadata and versioning. Another deep dive topic included a review of Hudi's new partial update encoding for columnar writes, which reduces write amplification by only modifying changed fields. Vinoth detailed that in benchmarks this is measured to achieve up to 70x fewer bytes written and 5.7x faster queries on a 1TB MOR table with sparse updates. If you are working on distributed systems, databases, or data lakes you won't want to miss this technical session. Catch the recording on YouTube 👉 : https://lnkd.in/gQguNVEn

  • Apache Hudi reposted this

    Join us tomorrow for a live technical session on How to cut AWS EMR Costs for Apache Hudi. If you are using EMR and your bills or pipeline performance are hard to manage then you won't want to miss this session. Special guests Y Ethan Guo, Apache Hudi PMC and Onehouse engineer, and Sagar Lakshmipathy, Onehouse Solution Architect will deliver some deep content and demos. Register with link in comments and if you can't make it we will send you the recording 👇

  • Ever wondered how Apache Hudi ensures every transaction on a table gets a unique, ever-increasing timestamp, even with multiple writers working at once? 🤔 It's a fascinating distributed systems challenge! The secret lies in an implementation inspired by Google Spanner's famous TrueTime API. 💡 The Challenge: Time in Distributed Systems ⏰ In any distributed system, different machines (or "nodes") have their own clocks. These clocks can drift and fall out of sync, creating chaos when you need to know the exact order of operations. For data platforms, this can be a huge problem for data consistency and reliability. Google Spanner tackled this with its TrueTime API, which provides a globally synchronized clock with a guaranteed, bounded level of uncertainty. This allows Spanner to assign timestamps with confidence, ensuring external consistency for distributed transactions. Many powerful OLTP databases like CockroachDB rely on similar principles. Hudi's Smart Adaptation 🚀 Apache Hudi cleverly adapts these concepts for its own timeline, guaranteeing monotonically increasing instant times. This ensures a reliable and ordered history of all transactions. Here's how it works in a multi-writer scenario: 1️⃣ Locking Mechanism: Hudi uses a distributed lock. This ensures that only one writer can generate a timestamp at any given moment. 2️⃣ Accommodating Clock Drift: The writer that acquires the lock generates a timestamp based on its local machine time. Then, it intentionally waits for a configured period (e.g., X milliseconds). This "sleep" period is crucial—it accounts for the maximum expected clock drift between any two writers in the system. 3️⃣ Guaranteed Order: By the time the writer releases the lock, enough real-world time has passed that any subsequent writer acquiring the lock is guaranteed to generate a timestamp that is greater than the previous one, even if its local clock is slightly behind. Let's walk through a typical scenario: Writer 1 (W1) and Writer 2 (W2) are writing to the same Hudi table. Let's say W2's clock is slightly behind W1's, but by less than the configured clock drift (X ms). W1 acquires the lock, generates its timestamp, and then sleeps for X ms before releasing the lock. ✍️ W2, which was waiting, now acquires the lock. It repeats the process. Because of the built-in wait time from W1's operation, W2's new timestamp will definitively be later than W1's. This elegant solution ensures that no matter which writer gets the lock first, the transaction timestamps will always move forward, never backward. 📈 Because Hudi transactions are often targeted to last more than a second, it can operate with a much higher uncertainty bound (e.g., >100ms), which still provides extremely high-fidelity time generation without performance issues. It's a powerful example of applying proven distributed systems theory to solve real-world data engineering problems! #ApacheHudi #DistributedSystems #Databases

    • No alternative text description for this image
  • Let's be real: in AI, 𝐢𝐭'𝐬 𝐠𝐚𝐫𝐛𝐚𝐠𝐞 𝐢𝐧, 𝐠𝐚𝐫𝐛𝐚𝐠𝐞 𝐨𝐮𝐭. 🗑️➡️🗑️ A fancy Retrieval-Augmented Generation (RAG) system is useless if its data is stale. Data engineering lays the crucial foundation that makes AI applications like recommenders actually work and feel current. But how do you keep your RAG app's data fresh without having to rebuild your entire dataset from scratch every time? Hudi PMC member Shiyan Xu wrote an awesome two-part blog series that shows you exactly how, complete with a cool demo app. He walks you through building a RAG-based recommender that's always up-to-date, thanks to incremental processing. 𝐏𝐚𝐫𝐭 𝟏: 𝐋𝐚𝐲𝐢𝐧𝐠 𝐭𝐡𝐞 𝐆𝐫𝐨𝐮𝐧𝐝𝐰𝐨𝐫𝐤 🏗️ Get the blueprint for a RAG-powered AI recommender. Part 1 breaks down the key concepts and architecture, showing how to design a system that connects user history and product info with an LLM for smart recommendations. 🔗 Read Part 1 here: https://lnkd.in/gCv8PCYe 𝐏𝐚𝐫𝐭 𝟐: 𝐅𝐫𝐨𝐦 𝐂𝐨𝐧𝐜𝐞𝐩𝐭 𝐭𝐨 𝐂𝐨𝐝𝐞 ✨ Part 2 is where the rubber meets the road. Nobody likes stale recommendations, and this post dives into the data engineering solution. You'll see how to leverage Apache Hudi's incremental queries to efficiently process only the newest data, keeping your vector DB fresh without costly full rebuilds. Plus, there's a hands-on example of building it all out with FastAPI, Qdrant, and OpenAI APIs. 🔗 Read Part 2 here: https://lnkd.in/gCT9C_Bv Overall, this series breaks down: 💡 How to build a RAG-based AI recommender from the ground up. 💡 A super practical way to solve the stale-data problem with Hudi's incremental processing. 💡 What a real, end-to-end RAG system looks like under the hood. What are the biggest hurdles you face when operationalizing your AI and RAG projects? Drop a comment below! 👇 #AI #DataEngineering #ApacheHudi #RAG #LLM

    • No alternative text description for this image

Similar pages

Browse jobs