Apache Kafka’s cover photo
Apache Kafka

Apache Kafka

Software Development

Open-source distributed event streaming platform

About us

Apache Kafka is an open-source distributed event streaming platform used by thousands of companies for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications.

Website
https://kafka.apache.org/
Industry
Software Development
Company size
1 employee
Type
Nonprofit

Employees at Apache Kafka

Updates

  • Apache Kafka reposted this

    View profile for Alex Campos

    Digital Data Strategist

    Customers are becoming digital, and digital needs data NOW. Fast forward, data analytics quickly evolved from legacy Data Warehouses to Big Data, but always looking at "what happened", a retrospective view. Businesses need fresh data to be faster and make smarter decisions, and that means real-time data. I am happy to introduce "Stream Processing Landscape", a general guideline to understand how Apache Flink, a leading open source engine for real-time data, is setting the pace for data streaming processing and fits into the enterprise ecosystem. 🟢 Structured: well-know and well-governed data source, structured data is pointed to be around 20% of all the corporate data in the world. It is mainly stored in databases, with defined schemas. Apache Flink commonly leverages a CDC strategy to consume data in real-time from databases. 🟢 Unstructured: the remaining 80% of the enterprise data is a combination of unstructured data formats, in many ways. Cost-effective and scalable Big Data solutions, such as cloud-native storage and Apache Kafka, helps companies to safe guard and store years of logs, machine data and images. 🟢 Enterprise Apps: streaming data should integrate with the application ecosystem, triggering actions for best next offers, ad-hoc advertisement and up-selling opportunities. Specialized solutions for marketing, point of sales and enterprise management can be augmented with real-time data and AI. 🟢 Data Ecosystem: Apache Flink leverages most the robust and mature frameworks and engines currently available in the data ecosystem, including open data formats like Apache Parquet, and new Big Data management approaches such as Apache Iceberg and Fluss for Lakehouses. These open standards ensure interoperability and freedom of choice to adopt any tool, any vendor. #StreamProcessing #RealtimeData #ApacheFlink #ApacheIceberg #Lakehouse

    • No alternative text description for this image
  • Apache Kafka reposted this

    View profile for Kai Waehner
    Kai Waehner Kai Waehner is an Influencer

    Global Field CTO | Author | International Speaker | Follow me with Data in Motion

    Apache Kafka is getting native support for queues—and it’s a game changer for the #datastreaming community. With the #opensource Kafka 4.0 release, #ApacheKafka went beyond logs and stream processing by introducing queue-style semantics through KIP-932. This new capability unlocks parallelism that isn’t limited by partition count—perfect for real-world #jobqueue and #taskdistribution scenarios. The key innovation is the introduction of share groups, allowing multiple consumers to process messages from the same partition. This enhances Kafka’s scalability for event-driven applications and simplifies patterns previously handled by external #messagequeues like RabbitMQ or ActiveMQ. Kafka’s new queue support adds: - Fine-grained control with message-level acknowledgements (ack, release, reject) - High-throughput flexibility beyond the limits of traditional consumer groups - A powerful hybrid model combining queue semantics with Kafka’s durable event log For use cases like inventory systems, sales event processing, or real-time analytics, this makes Kafka a unified solution—blending the benefits of #streamprocessing with the needs of flexible, scalable queuing. This isn’t production-ready yet—no Dead Letter Queue (DLQ) support, no delayed retries, and no strict ordering per key. But for teams already invested in Kafka, it avoids the operational overhead of maintaining separate queue infrastructure for simpler tasks. Kafka continues evolving into the central nervous system for modern #cloudnative data architectures. Deep dive slide deck and video recording: https://lnkd.in/eHihr_Yr What’s the first use case you’d move to Kafka queues?

    • No alternative text description for this image
  • Apache Kafka reposted this

    View profile for Dipankar Mazumdar

    Director-Data+GenAI @Cloudera | Apache Iceberg, Hudi Contributor | Author of “Engineering Lakehouses”

    Apache Flink + Apache Hudi: Changelog Mode for Streaming ETL. When building real-time data architectures, streaming engines like Apache Flink consumes change events from sources like Apache Kafka or ApachePulsar. But how do we persist those changes efficiently, support downstream incremental queries, and build layered analytics in a lakehouse? Enter Apache Hudi’s changelog mode! At the heart of this setup is Flink’s Dynamic Table abstraction - a powerful construct that allows you to treat "unbounded changelog streams" as continuously updating tables. - Flink reads from a source stream and materializes it as a Dynamic Table. - Continuous queries transform this table, outputting another Dynamic Table. - Hudi acts as the sink, consuming the changelog stream and storing it as a Merge-on-Read (MOR) table. The innovation here is that Hudi doesn’t just store the data. It also serves as a materialization layer for the changelog stream, supporting native INSERT, UPDATE, and DELETE operations. How It Works? Flink’s Dynamic Table introduces a metadata column called "RowKind", which can take values like: +I  - INSERT -U / +U - UPDATE_BEFORE / UPDATE_AFTER -D - DELETE Hudi maps this directly via the "_hoodie_operation" column in changelog mode. This allows: ✅ Row-level changes to be encoded into log files ✅ Streaming readers to consume the data before compaction ✅ Preservation of order to support accurate incremental views This capability has been available since Hudi 0.9 and is actively used in production across streaming ETL pipelines. Traditional batch ETL pipelines suffer from high latency and heavy resource costs. With this Flink + Hudi pattern, you unlock: 🌟 Low-latency data ingestion 🌟 Fine-grained updates via changelogs 🌟 SQL-based lakehouse transformations (bronze → silver → gold) Flink handles the stateful compute, and Hudi handles durable, versioned storage. I am presenting all about this architecture in the Data Streaming Summit 2025 (organized by StreamNative) next week. Join me (link in comments). #dataengineering #softwareengineering

    • No alternative text description for this image
  • Game-changer feature. Diskless topics 🚀🚀🚀

    View profile for Stanislav Kozlovski

    🧠 'The Kafka Guy'

    🤩 NEW: Direct-to-S3 topics are coming to Apache Kafka! KIP-1150: Diskless Topics was just published on the mailing list. After WarpStream pioneered this game-changer in 2023, everyone else (Confluent, Bufstream, AutoMQ, RedPanda) has been busy implementing their own version. Now it’s Kafka’s turn to enjoy 𝟴𝟬%+ 𝗿𝗲𝗱𝘂𝗰𝗲𝗱 𝗶𝗻𝗳𝗿𝗮 𝗰𝗼𝘀𝘁𝘀. It proposes extending Kafka with so-called "Diskless" topics, which are: 💡 leaderless - any broker can accept writes for the same partition 💡 directly persisted in object storage (pluggable), without replicating between brokers. If accepted, Kafka will become one of the only systems that support BOTH low-latency classic topics and high-latency cost-efficient diskless topics. I think this feature is a no-brainer. The benefits are simply too many to iterate: 💸 no replication networking cost 💸 no producer networking costs - producers can send writes to the same zone because any broker can be a leader 👌 easy to balance - since there is no state and no single leader, clients can be instantly moved to other brokers when hot spots form 🤩 fast scalability - add or remove brokers in seconds, as if Kafka is nginx With it, Kafka becomes a very flexible system that can work efficiently both on-premise and in the cloud. Diskless topics also pave the way for a lot of further innovation: 🧊 Iceberg topics 🔥 parallel produce 🌎 multi-region active active setups ☄️ fast leaderless topics via quorum S3 Express writes The team at Aiven is driving this. 🦀 The KIP seems well thought out! Apparently they’ve been building this and are deciding to open source it. I think it’s a brilliant strategic move for them. Keeping a fork would make it “yet another Kafka”, but when you merge it to the open source - it becomes THE Kafka. Distribution is everything in business - once this (killer) feature is merged, it’ll go directly into the hands of the estimated 100,000+ companies using Apache Kafka. I will be diving a lot more into the details of how it’s proposed to work, and the expected cost/ops savings. I think it can truly reduce costs by 𝟭𝟬𝘅 (yes, ten times less). 💰 The architecture simply allows for it. 🤷♂️ Some vendors claim to do the same but after the applied margins it doesn’t work out cheaper than classic Kafka at all. With this being open source - you can be sure it will! The KIP was freshly published, so it's still very early. Who knows - it might get rejected or abandoned 😱 The Aiven folks seem serious to me and it would probably be a slow death blow to Kafka were it to not move in this direction - so I'm pretty optimistic this will be merged sooner rather than later! Suffice to say I am super excited to see this!!! ----------------------------------- If you want to stay updated with this story, make sure to follow me here and like/repost to tell the algorithm. ✅ Stanislav Kozlovski KIP: https://lnkd.in/dbVHju27

    • No alternative text description for this image

Similar pages

Browse jobs