Apache DataFusion’s cover photo
Apache DataFusion

Apache DataFusion

Software Development

Apache DataFusion is a fast, feature rich and extensible query engine built on the Apache Arrow memory model.

About us

Apache DataFusion is a fast, feature rich and extensible query engine built on the Apache Arrow memory model. “Out of the box,” DataFusion offers SQL and Dataframe APIs, excellent performance, built-in support for CSV, Parquet, JSON, and Avro, extensive customization, and a great community. Python Bindings are also available. DataFusion features a full query planner, a columnar, streaming, multi-threaded, vectorized execution engine, and partitioned data sources. You can customize DataFusion at almost all points including additional data sources, query languages, functions, custom operators and more. See the Architecture section for more details.

Website
https://datafusion.apache.org
Industry
Software Development
Company size
51-200 employees
Type
Nonprofit
Founded
2020

Employees at Apache DataFusion

Updates

  • Apache DataFusion reposted this

    This is interesting: the smallest #Sail instance on ClickBench using c6a.2xlarge (8 vCPUs and 16 GiB of memory) outperforms every #Spark and Spark accelerator configuration across all instance sizes. The best-performing Spark setup is the Comet Accelerator on c7a.metal-48xl (192 vCPUs and 384 GiB of memory). It’s wild that Sail, running on just 8 vCPUs and 16 GiB, still significantly outperforms an accelerated Spark workload running on 192 vCPUs and 384 GiB. Notably, both Sail and Comet are powered by #DataFusion. Benchmark: https://lnkd.in/gSMjxrtR Sail: https://lnkd.in/gQiakuU6

    • No alternative text description for this image
  • Apache DataFusion reposted this

    I'm pleased to announce the release of Apache DataFusion Comet 0.12.0, a high-performance accelerator for Apache Spark built on the DataFusion query engine. This release represents four weeks of development with 105 merged PRs from 13 contributors, delivering significant improvements across performance, functionality, and developer experience. Key highlights include experimental native Apache Iceberg scan support through iceberg-rust integration, enabling unmodified Iceberg Java to handle query planning while Comet provides native execution acceleration. We've also introduced new SQL functions including concat, abs, sha1, and hyperbolic trigonometric functions, along with the CometLocalTableScanExec operator for native local table scan support. The release features substantial code architecture improvements with unified operator serialization and refactored expression handling, making it easier for contributors to extend functionality. Configuration has been simplified with better environment variable support and streamlined memory management settings. We've addressed numerous bug fixes for string operations, join handling, and null value processing, while updating to Spark 3.5.7 and DataFusion 50.3.0. Documentation has been enhanced with improved contributor guides and consistent formatting. Comet continues to support Spark 3.4.3, 3.5.x, and 4.0.1 across multiple JDK and Scala versions. Read the blog post for more details. https://lnkd.in/gn3jWyen

  • The Apache #datafusion is excited to congratulate the Kubeflow Trainer project on the launch of its new Distributed Data Cache to accelerate data loading and maximize GPU utilization in ML training and optimization jobs on Kubernetes, now powered by #datafusion. 🎉 🚀 It's inspiring to see our query engine helping accelerate scalable, Cloud Native AI workloads in such a forward-looking project. We're grateful to the Kubeflow community for the collaboration and can't wait to see the innovations this unlocks for AI practitioners everywhere. Here's to faster pipelines, smoother iteration, and a growing ecosystem built on open data! Please find the Kubeflow Trainer project https://lnkd.in/gjQgUJhQ https://lnkd.in/gFvhm56K Thanks Akshay Chitneni Rus P. and Andrey Velichkevich for the collaboration from #kubeflowtrainer #datafusion #kubeflow #kubeflowtrainer #cloud #aiml #distributed

  • Apache DataFusion reposted this

    Accelerating Apache Spark's Execution Engine There’s been some serious work in the Spark performance space - particularly around optimizing its "execution engine" Let’s take a step back. A typical query engine consists of 5 key components: - Language frontend (e.g., SQL parser) - Intermediate Representation (IR) - Optimizer - Execution Engine (where the real computation happens) - Execution Runtime (task coordination, fault-tolerance) For years, Spark has relied on JVM-based execution - flexible and scalable, but not always the most efficient for CPU-bound tasks. To mitigate some of this overhead, Spark 2.0 introduced Whole Stage Code Generation, replacing the Volcano iterator model with Java bytecode generation This key improvement brought up to 2x speedups for many query workloads. But recent work is pushing this even further! By introducing native vectorized execution, using projects like Apache Arrow for fast in-memory columnar processing. Here are 3 cutting-edge efforts focused on accelerating Spark’s execution layer with high-performance native engines: ✅ Velox: - A C++ native execution engine optimized for performance & modularity. - Instead of every system reinventing the execution layer, Velox offers reusable components for scan, filter, join, aggregation, etc. - Advanced optimizations like SIMD, lazy evaluation, and adaptive query execution ✅ Apache Gluten: - Gluten acts as the JNI bridge between Spark SQL and native engines like Velox. - Spark’s logical plan stays intact but execution is offloaded to Velox. - Removes bottlenecks of JVM execution ✅ Apache DataFusion Comet: - An integration of Apache DataFusion (Rust-based query engine) into Spark - Spark’s execution engine is replaced with DataFusion’s native path - No code changes needed - zero-friction drop-in acceleration The shift from JVM-bound execution to native vectorized engines is already showing measurable impact in production. We're entering a new phase where modular, embeddable native engines are going to accelerate the next generation of open lakehouse compute. Some helpful videos in comments! #dataengineering #softwareengineering

    • No alternative text description for this image
  • Apache DataFusion reposted this

    I have been following the amazing work that has been going on in the Apache DataFusion Comet project 🚀 Yesterday, the 0.11.0 version was released. Highlighting some of these new/improved stuff. Key is to watch the 'type' of PR landing in the project to tackle a diverse set of things in the data systems space. ✅ Native Parquet Modular encryption support: - meaning even when Apache Spark offloads query execution to Comet’s native (Rust) runtime, it can still decrypt encrypted Parquet files securely - uses Apache Arrow-rs & Datafusion under the hood ✅ Improved Spark support: - version 4.0.1 - enhances ANSI mode support, adding full compliance for arithmetic, integral divide, rounding & remainder operations ✅ Apache Iceberg: - 1.9.1 version support - Adds a new API that removes dependency on Parquet’s ColumnDescriptor, resolving method-not-found errors and enabling smoother Iceberg-Comet integration - Fixes connection leaks observed in production Iceberg workloads ✅ Better Memory Management: - brings huge improvements to memory management, making it easier to deploy and more resilient to out-of-memory conditions Release notes in comment. Highly recommend following the project. #dataengineering #softwareengineering

    • No alternative text description for this image
  • Apache DataFusion reposted this

    I'm super proud of the 0.11.0 release for Comet! I think the memory management improvements will help users accelerate their Spark jobs with less time spent tuning, and features like Parquet Modular Encryption are great to see come together through collaboration with Arrow-rs and DataFusion.

    View profile for Andy Grove

    Original creator of Apache DataFusion. Apache Arrow & Apache DataFusion PMC Member.

    On behalf of the DataFusion PMC, I'm excited to announce the release of version 0.11.0 of the Comet accelerator for Apache Spark!   Some of the highlights of this release are:   - Parquet Modular Encryption Support - Improved Memory Management - Improved Apache Spark 4.0 Support - Expanded ANSI Support - Improved support for Complex Types in Shuffle - Native support for RangePartitioning - More expressions supported Read full details in the blog post: https://lnkd.in/gbhD24aH Get started by following the installation guide: https://lnkd.in/gDeUMJDs

  • Apache DataFusion reposted this

    This is an excellent example of the value of Apache Arrow & Apache DataFusion as the foundation for building new high-performance specialized databases.

    View profile for Mo Sarwat

    CEO @ Wherobots | We are hiring!

    Today, we launched SedonaDB, a new open-source, single-node analytical database engine built in Rust that's designed to treat spatial data as a first-class citizen. Unlike its distributed counterparts, such as SedonaSpark, SedonaDB is optimized for small-to-medium data analytics, offering simplicity and speed for single-machine environments. Wherobots donates SedonaDB to the open source Apache Sedona community to be released under the ASF license 2.0 SedonaDB offers several features that make it a powerful tool for spatial analysis: - Spatial-Native Processing: SedonaDB is built from the ground up to handle spatial data side by side with non-spatial data. It supports spatial types, joins, coordinate reference systems (CRS), and functions without needing extensions or plugins. - Performance: It uses query optimizations, indexing, and data pruning to ensure high-performance spatial operations. - Ease of Use: It is easy to download, install, and embed into applications. It also provides familiar Python and SQL interfaces, with additional APIs for R and Rust. - Modern Engine: SedonaDB is built on top of Apache Arrow and Apache DataFusion, providing a modern, vectorized query engine. - Integration: It seamlessly integrates with GeoArrow, GeoParquet, and GeoPandas, making it easy to use with other popular geospatial libraries. It can query data stored locally or remotely in cloud storage such as AWS S3 SedonaDB and SedonaSpark are both necessary because they cater to different spatial data processing and AI needs based on scale and environment. SedonaSpark is ideal for large-scale workloads and production environments that already use Spark, such as joining 100 GBs to PBs of vector dataset with large raster datasets. Its distributed nature, however, introduces unnecessary overhead for smaller datasets, making local computations slower and more complex. In contrast, SedonaDB is optimized for smaller datasets and local computations, providing a faster and simpler solution. The two projects are being developed for full interoperability, ensuring that functions and SQL code can be easily transferred between them. SedonaDB github repo: https://lnkd.in/eB8suErW Apache Sedona blog: https://lnkd.in/eaRWJ2ug Wherobots announcement blog: https://lnkd.in/e7dhKSsi

    • No alternative text description for this image
  • Apache DataFusion reposted this

    New Lonboard release and new demo! Integrating marimo and Apache DataFusion to visualize the NYC taxi dataset. https://lnkd.in/egQh7pa7 I've been working on geospatial extensions for the Apache DataFusion SQL query engine, using GeoArrow as the underlying compute layout. It's early, but I'm working on fleshing out the PostGIS API. And there are Python bindings too! https://lnkd.in/eX8eApKZ. This was my first time using marimo and it was a joy to use! And its interactivity plays really nicely with Lonboard. Lonboard's 0.12 release improved the support for GeoArrow data types, and is moving towards being fully GeoArrow-native. Shapely is no longer a required dependency! https://lnkd.in/ergExzmu

Similar pages