Apache DataFusion

Software Development

Apache DataFusion is a fast, feature rich and extensible query engine built on the Apache Arrow memory model.

View all 2 employees

About us

Apache DataFusion is a fast, feature rich and extensible query engine built on the Apache Arrow memory model. “Out of the box,” DataFusion offers SQL and Dataframe APIs, excellent performance, built-in support for CSV, Parquet, JSON, and Avro, extensive customization, and a great community. Python Bindings are also available. DataFusion features a full query planner, a columnar, streaming, multi-threaded, vectorized execution engine, and partitioned data sources. You can customize DataFusion at almost all points including additional data sources, query languages, functions, custom operators and more. See the Architecture section for more details.

Website: https://datafusion.apache.org
External link for Apache DataFusion
Industry: Software Development
Company size: 51-200 employees
Type: Nonprofit
Founded: 2020

Employees at Apache DataFusion

See all employees

Updates

Apache DataFusion reposted this
Shehab Amin
1mo
Report this post
This is interesting: the smallest #Sail instance on ClickBench using c6a.2xlarge (8 vCPUs and 16 GiB of memory) outperforms every #Spark and Spark accelerator configuration across all instance sizes. The best-performing Spark setup is the Comet Accelerator on c7a.metal-48xl (192 vCPUs and 384 GiB of memory). It’s wild that Sail, running on just 8 vCPUs and 16 GiB, still significantly outperforms an accelerated Spark workload running on 192 vCPUs and 384 GiB. Notably, both Sail and Comet are powered by #DataFusion. Benchmark: https://lnkd.in/gSMjxrtR Sail: https://lnkd.in/gQiakuU6
1 Comment

Like Comment Share
Apache DataFusion reposted this
Andy Grove
6d Edited
Report this post
I'm pleased to announce the release of Apache DataFusion Comet 0.12.0, a high-performance accelerator for Apache Spark built on the DataFusion query engine. This release represents four weeks of development with 105 merged PRs from 13 contributors, delivering significant improvements across performance, functionality, and developer experience. Key highlights include experimental native Apache Iceberg scan support through iceberg-rust integration, enabling unmodified Iceberg Java to handle query planning while Comet provides native execution acceleration. We've also introduced new SQL functions including concat, abs, sha1, and hyperbolic trigonometric functions, along with the CometLocalTableScanExec operator for native local table scan support. The release features substantial code architecture improvements with unified operator serialization and refactored expression handling, making it easier for contributors to extend functionality. Configuration has been simplified with better environment variable support and streamlined memory management settings. We've addressed numerous bug fixes for string operations, join handling, and null value processing, while updating to Spark 3.5.7 and DataFusion 50.3.0. Documentation has been enhanced with improved contributor guides and consistent formatting. Comet continues to support Spark 3.4.3, 3.5.x, and 4.0.1 across multiple JDK and Scala versions. Read the blog post for more details. https://lnkd.in/gn3jWyen

Apache DataFusion Comet 0.12.0 Release datafusion.apache.org

3 Comments

Like Comment Share
Apache DataFusion

2,329 followers
1w
Report this post
The Apache #datafusion is excited to congratulate the Kubeflow Trainer project on the launch of its new Distributed Data Cache to accelerate data loading and maximize GPU utilization in ML training and optimization jobs on Kubernetes, now powered by #datafusion. 🎉 🚀 It's inspiring to see our query engine helping accelerate scalable, Cloud Native AI workloads in such a forward-looking project. We're grateful to the Kubeflow community for the collaboration and can't wait to see the innovations this unlocks for AI practitioners everywhere. Here's to faster pipelines, smoother iteration, and a growing ecosystem built on open data! Please find the Kubeflow Trainer project https://lnkd.in/gjQgUJhQ https://lnkd.in/gFvhm56K Thanks Akshay Chitneni Rus P. and Andrey Velichkevich for the collaboration from #kubeflowtrainer #datafusion #kubeflow #kubeflowtrainer #cloud #aiml #distributed

Distributed Data Cache kubeflow.org

Like Comment Share
Apache DataFusion reposted this
Dipankar Mazumdar
1w
Report this post
Accelerating Apache Spark's Execution Engine There’s been some serious work in the Spark performance space - particularly around optimizing its "execution engine" Let’s take a step back. A typical query engine consists of 5 key components: - Language frontend (e.g., SQL parser) - Intermediate Representation (IR) - Optimizer - Execution Engine (where the real computation happens) - Execution Runtime (task coordination, fault-tolerance) For years, Spark has relied on JVM-based execution - flexible and scalable, but not always the most efficient for CPU-bound tasks. To mitigate some of this overhead, Spark 2.0 introduced Whole Stage Code Generation, replacing the Volcano iterator model with Java bytecode generation This key improvement brought up to 2x speedups for many query workloads. But recent work is pushing this even further! By introducing native vectorized execution, using projects like Apache Arrow for fast in-memory columnar processing. Here are 3 cutting-edge efforts focused on accelerating Spark’s execution layer with high-performance native engines: ✅ Velox: - A C++ native execution engine optimized for performance & modularity. - Instead of every system reinventing the execution layer, Velox offers reusable components for scan, filter, join, aggregation, etc. - Advanced optimizations like SIMD, lazy evaluation, and adaptive query execution ✅ Apache Gluten: - Gluten acts as the JNI bridge between Spark SQL and native engines like Velox. - Spark’s logical plan stays intact but execution is offloaded to Velox. - Removes bottlenecks of JVM execution ✅ Apache DataFusion Comet: - An integration of Apache DataFusion (Rust-based query engine) into Spark - Spark’s execution engine is replaced with DataFusion’s native path - No code changes needed - zero-friction drop-in acceleration The shift from JVM-bound execution to native vectorized engines is already showing measurable impact in production. We're entering a new phase where modular, embeddable native engines are going to accelerate the next generation of open lakehouse compute. Some helpful videos in comments! #dataengineering #softwareengineering
8 Comments

Like Comment Share
Apache DataFusion reposted this
Andrew Lamb
1mo
Report this post
Nga Tran Carl Yeksigian and I are organizing the next Apache DataFusion meetup next Wednesday Nov 12 in Boston. Signup here: http://lu.ma/w9pw5rce

Boston Apache DataFusion Meetup · Luma luma.com

Like Comment Share
Apache DataFusion reposted this
Andrew Lamb
3w
Report this post
Save the date -- Wednesday July 22, 2026 for the first Apache DataFusion meetup in Denver: https://luma.com/jsu6faie

Denver Apache DataFusion Meetup · Luma luma.com

Like Comment Share
Apache DataFusion reposted this
Dipankar Mazumdar
1mo
Report this post
I have been following the amazing work that has been going on in the Apache DataFusion Comet project 🚀 Yesterday, the 0.11.0 version was released. Highlighting some of these new/improved stuff. Key is to watch the 'type' of PR landing in the project to tackle a diverse set of things in the data systems space. ✅ Native Parquet Modular encryption support: - meaning even when Apache Spark offloads query execution to Comet’s native (Rust) runtime, it can still decrypt encrypted Parquet files securely - uses Apache Arrow-rs & Datafusion under the hood ✅ Improved Spark support: - version 4.0.1 - enhances ANSI mode support, adding full compliance for arithmetic, integral divide, rounding & remainder operations ✅ Apache Iceberg: - 1.9.1 version support - Adds a new API that removes dependency on Parquet’s ColumnDescriptor, resolving method-not-found errors and enabling smoother Iceberg-Comet integration - Fixes connection leaks observed in production Iceberg workloads ✅ Better Memory Management: - brings huge improvements to memory management, making it easier to deploy and more resilient to out-of-memory conditions Release notes in comment. Highly recommend following the project. #dataengineering #softwareengineering
5 Comments

Like Comment Share
Apache DataFusion reposted this
Matt Butrovich
1mo
Report this post
I'm super proud of the 0.11.0 release for Comet! I think the memory management improvements will help users accelerate their Spark jobs with less time spent tuning, and features like Parquet Modular Encryption are great to see come together through collaboration with Arrow-rs and DataFusion.

Andy Grove

Original creator of Apache DataFusion. Apache Arrow & Apache DataFusion PMC Member.
1mo Edited

On behalf of the DataFusion PMC, I'm excited to announce the release of version 0.11.0 of the Comet accelerator for Apache Spark! Some of the highlights of this release are: - Parquet Modular Encryption Support - Improved Memory Management - Improved Apache Spark 4.0 Support - Expanded ANSI Support - Improved support for Complex Types in Shuffle - Native support for RangePartitioning - More expressions supported Read full details in the blog post: https://lnkd.in/gbhD24aH Get started by following the installation guide: https://lnkd.in/gDeUMJDs

Installing DataFusion Comet # datafusion.apache.org

Like Comment Share
Apache DataFusion reposted this
Andy Grove
2mo
Report this post
This is an excellent example of the value of Apache Arrow & Apache DataFusion as the foundation for building new high-performance specialized databases.
Mo Sarwat

CEO @ Wherobots | We are hiring!
2mo Edited

Today, we launched SedonaDB, a new open-source, single-node analytical database engine built in Rust that's designed to treat spatial data as a first-class citizen. Unlike its distributed counterparts, such as SedonaSpark, SedonaDB is optimized for small-to-medium data analytics, offering simplicity and speed for single-machine environments. Wherobots donates SedonaDB to the open source Apache Sedona community to be released under the ASF license 2.0 SedonaDB offers several features that make it a powerful tool for spatial analysis: - Spatial-Native Processing: SedonaDB is built from the ground up to handle spatial data side by side with non-spatial data. It supports spatial types, joins, coordinate reference systems (CRS), and functions without needing extensions or plugins. - Performance: It uses query optimizations, indexing, and data pruning to ensure high-performance spatial operations. - Ease of Use: It is easy to download, install, and embed into applications. It also provides familiar Python and SQL interfaces, with additional APIs for R and Rust. - Modern Engine: SedonaDB is built on top of Apache Arrow and Apache DataFusion, providing a modern, vectorized query engine. - Integration: It seamlessly integrates with GeoArrow, GeoParquet, and GeoPandas, making it easy to use with other popular geospatial libraries. It can query data stored locally or remotely in cloud storage such as AWS S3 SedonaDB and SedonaSpark are both necessary because they cater to different spatial data processing and AI needs based on scale and environment. SedonaSpark is ideal for large-scale workloads and production environments that already use Spark, such as joining 100 GBs to PBs of vector dataset with large raster datasets. Its distributed nature, however, introduces unnecessary overhead for smaller datasets, making local computations slower and more complex. In contrast, SedonaDB is optimized for smaller datasets and local computations, providing a faster and simpler solution. The two projects are being developed for full interoperability, ensuring that functions and SQL code can be easily transferred between them. SedonaDB github repo: https://lnkd.in/eB8suErW Apache Sedona blog: https://lnkd.in/eaRWJ2ug Wherobots announcement blog: https://lnkd.in/e7dhKSsi
Like Comment Share
Apache DataFusion reposted this
Kyle Barron
2mo
Report this post
New Lonboard release and new demo! Integrating marimo and Apache DataFusion to visualize the NYC taxi dataset. https://lnkd.in/egQh7pa7 I've been working on geospatial extensions for the Apache DataFusion SQL query engine, using GeoArrow as the underlying compute layout. It's early, but I'm working on fleshing out the PostGIS API. And there are Python bindings too! https://lnkd.in/eX8eApKZ. This was my first time using marimo and it was a joy to use! And its interactivity plays really nicely with Lonboard. Lonboard's 0.12 release improved the support for GeoArrow data types, and is moving towards being fully GeoArrow-native. Shapely is no longer a required dependency! https://lnkd.in/ergExzmu

7 Comments

Like Comment Share

LinkedIn respects your privacy

Apache DataFusion

Software Development

Apache DataFusion is a fast, feature rich and extensible query engine built on the Apache Arrow memory model.

About us

Employees at Apache DataFusion

Jiashu Hu

Aditya Singh Rathore

Updates

Join now to see what you are missing

Similar pages

Apache Arrow

Velox

bauplan

marimo

Apache Gluten

Apache Iceberg

DuckDB

InfluxData

Apache Sedona

Apache Doris