Best Open Source Data Integration Tools 2026

Data Integration Tools

Data Integration Windows Mac Clear Filters

Browse free open source Data Integration tools and projects for Windows and Mac below. Use the toggles on the left to filter open source Data Integration tools by OS, license, language, programming language, and project status.

MongoDB Atlas runs apps anywhere
Deploy in 115+ regions with the modern database for every enterprise.

MongoDB Atlas gives you the freedom to build and run modern applications anywhere—across AWS, Azure, and Google Cloud. With global availability in over 115 regions, Atlas lets you deploy close to your users, meet compliance needs, and scale with confidence across any geography.

Start Free
ClouDNS provides fast and secure Free DNS Hosting and Premium DNS Hosting with Global Anycast DNS Network.
Forever Free DNS Hosting with Dynamic DNS

ClouDNS is the biggest European provider of globally Managed DNS services, including GeoDNS, Anycast DNS and DDoS protected DNS. We also offer Domain names, Monitoring services, SSL Certificates, and Google Workspace.

Learn More
1

Pentaho

Pentaho offers comprehensive data integration and analytics platform.

Pentaho couples data integration with business analytics in a modern platform to easily access, visualize and explore data that impacts business results. Use it as a full suite or as individual components that are accessible on-premise, in the cloud, or on-the-go (mobile). Pentaho enables IT and developers to access and integrate data from any source and deliver it to your applications all from within an intuitive and easy to use graphical tool. The Pentaho Enterprise Edition Free Trial can be obtained from https://pentaho.com/download/

">

69 Reviews

Downloads: 1,261 This Week

Last Update: 2025-02-06
See Project
2

Pentaho Data Integration

Pentaho Data Integration ( ETL ) a.k.a Kettle

Pentaho Data Integration uses the Maven framework. Project distribution archive is produced under the assemblies module. Core implementation, database dialog, user interface, PDI engine, PDI engine extensions, PDI core plugins, and integration tests. Maven, version 3+, and Java JDK 1.8 are requisites. Use of the Pentaho checkstyle format (via mvn checkstyle:check and reviewing the report) and developing working Unit Tests helps to ensure that pull requests for bugs and improvements are processed quickly. In addition to the unit tests, there are integration tests that test cross-module operation.

Downloads: 51 This Week

Last Update: 2021-11-08
See Project
3

Airbyte

Data integration platform for ELT pipelines from APIs, databases

We believe that only an open-source solution to data movement can cover the long tail of data sources while empowering data engineers to customize existing connectors. Our ultimate vision is to help you move data from any source to any destination. Airbyte already provides the largest catalog of 300+ connectors for APIs, databases, data warehouses, and data lakes. Moving critical data with Airbyte is as easy and reliable as flipping on a switch. Our teams process more than 300 billion rows each month for ambitious businesses of all sizes. Enable your data engineering teams to focus on projects that are more valuable to your business. Building and maintaining custom connectors have become 5x easier with Airbyte. With an average response rate of 10 minutes or less and a Customer Satisfaction score of 96/100, our team is ready to support your data integration journey all over the world.

Downloads: 7 This Week

Last Update: 2025-10-15
See Project
4

Common Core Ontologies

The Common Core Ontology Repository

The Common Core Ontologies (CCO) comprise twelve ontologies that are designed to represent and integrate taxonomies of generic classes and relations across all domains of interest. CCO is a mid-level extension of Basic Formal Ontology (BFO), an upper-level ontology framework widely used to structure and integrate ontologies in the biomedical domain (Arp, et al., 2015). BFO aims to represent the most generic categories of entity and the most generic types of relations that hold between them, by defining a small number of classes and relations. CCO then extends from BFO in the sense that every class in CCO is asserted to be a subclass of some class in BFO, and that CCO adopts the generic relations defined in BFO (e.g., has_part) (Smith and Grenon, 2004). Accordingly, CCO classes and relations are heavily constrained by the BFO framework, from which it inherits much of its basic semantic relationships.

Downloads: 6 This Week

Last Update: 2024-11-06
See Project
Optimize every aspect of hiring with Greenhouse Recruiting
Hire for what’s next.

What’s next for many of us is changing. Your company’s ability to hire great talent is as important as ever – so you’ll be ready for whatever’s ahead. Whether you need to scale your team quickly or improve your hiring process, Greenhouse gives you the right technology, know-how and support to take on what’s next.

Learn More
5

PHPCI

PHPCI is a free and open source continuous integration tool

PHPCI is a continuous integration (CI) server designed specifically for PHP applications. It automates tasks such as testing, code quality checks, and deployment, helping developers maintain code consistency and detect issues early. PHPCI supports various plugins and tools, including PHPUnit, PHPMD, and Codeception, making it highly customizable for different project needs.

Downloads: 6 This Week

Last Update: 2025-05-19
See Project
6

Recap

Recap tracks and transform schemas across your whole application

Recap is a schema language and multi-language toolkit to track and transform schemas across your whole application. Your data passes through web services, databases, message brokers, and object stores. Recap describes these schemas in a single language, regardless of which system your data passes through. Recap schemas can be defined in YAML, TOML, JSON, XML, or any other compatible language.

Downloads: 6 This Week

Last Update: 2025-12-30
See Project
7

Gradle Docker Compose Plugin

Simplifies usage of Docker Compose for integration testing

The Gradle Docker Compose Plugin by Avast integrates Docker Compose lifecycle management into Gradle builds. It allows developers to define and manage Docker containers required for integration testing or local development directly from their Gradle build scripts. This plugin automates the startup and shutdown of services, supports container health checks, and enables tight integration between application code and containerized services, enhancing reproducibility and automation in development pipelines.

Downloads: 5 This Week

Last Update: 2026-02-03
See Project
8

Apache DevLake

Apache DevLake is an open-source dev data platform

Apache DevLake is an open-source dev data platform that ingests, analyzes, and visualizes the fragmented data from DevOps tools to extract insights for engineering excellence, developer experience, and community growth. Apache DevLake is designed for developer teams looking to make better sense of their development process and to bring a more data-driven approach to their own practices. You can ask Apache DevLake many questions regarding your development process. Just connect and query. Your Dev Data lives in many silos and tools. DevLake brings them all together to give you a complete view of your Software Development Life Cycle (SDLC). From DORA to scrum retros, DevLake implements metrics effortlessly with prebuilt dashboards supporting common frameworks and goals. DevLake fits teams of all shapes and sizes, and can be readily extended to support new data sources, metrics, and dashboards, with a flexible framework for data collection and transformation.

Downloads: 4 This Week

Last Update: 2026-02-03
See Project
9

Dagster

An orchestration platform for the development, production

Dagster is an orchestration platform for the development, production, and observation of data assets. Dagster as a productivity platform: With Dagster, you can focus on running tasks, or you can identify the key assets you need to create using a declarative approach. Embrace CI/CD best practices from the get-go: build reusable components, spot data quality issues, and flag bugs early. Dagster as a robust orchestration engine: Put your pipelines into production with a robust multi-tenant, multi-tool engine that scales technically and organizationally. Dagster as a unified control plane: The ‘single plane of glass’ data teams love to use. Rein in the chaos and maintain control over your data as the complexity scales. Centralize your metadata in one tool with built-in observability, diagnostics, cataloging, and lineage. Spot any issues and identify performance improvement opportunities.

Downloads: 3 This Week

Last Update: 2026-02-05
See Project
Queue Management System for Busy Service Providers | WaitWell
The queue management system that perfectly adapts to your workflows

The queue management system that perfectly adapts to your workflows. Improve operational efficiency in weeks with the most configurable enterprise queue system.

Learn More
10

PhantomJS-Node

PhantomJS integration module for NodeJS

PhantomJS-Node is a Node.js bridge to PhantomJS, enabling programmatic control of the headless browser for tasks like web scraping, automated testing, and page rendering.

Downloads: 3 This Week

Last Update: 2025-02-05
See Project
11

nichenetr

NicheNet: predict active ligand-target links between interacting cells

nichenetr: the R implementation of the NicheNet method. The goal of NicheNet is to study intercellular communication from a computational perspective. NicheNet uses human or mouse gene expression data of interacting cells as input and combines this with a prior model that integrates existing knowledge on ligand-to-target signaling paths. This allows to predict ligand-receptor interactions that might drive gene expression changes in cells of interest. This model of prior information on potential ligand-target links can then be used to infer active ligand-target links between interacting cells. NicheNet prioritizes ligands according to their activity (i.e., how well they predict observed changes in gene expression in the receiver cell) and looks for affected targets with high potential to be regulated by these prioritized ligands.

Downloads: 3 This Week

Last Update: 2024-09-05
See Project
12

Hetionet

Hetionet: an integrative network of disease

Hetionet is a hetnet — network with multiple node and edge (relationship) types — which encodes biology. The hetnet was designed for Project Rephetio, which aims to systematically identify why drugs work and predict new therapies for drugs. The JSON and Neo4j formats contain node and edge properties, which are absent in the TSV and matrix formats, including licensing information. Therefore the recommended formats are JSON and Neo4j. Our hetio package in Python reads the JSON format, but it is otherwise a simple yet new format. The Neo4j graph database has an established and thriving ecosystem. However, if you would like to access Hetionet without Neo4j, then we suggest the JSON format. The matrix format refers to HetMat archives, which store edge adjacency matrices on disk. Additional usage information is available at the corresponding download locations.

Downloads: 2 This Week

Last Update: 2023-06-12
See Project
13

Open Source Data Quality and Profiling

World's first open source data quality & data preparation project

This project is dedicated to open source data quality and data preparation solutions. Data Quality includes profiling, filtering, governance, similarity check, data enrichment alteration, real time alerting, basket analysis, bubble chart Warehouse validation, single customer view etc. defined by Strategy. This tool is developing high performance integrated data management platform which will seamlessly do Data Integration, Data Profiling, Data Quality, Data Preparation, Dummy Data Creation, Meta Data Discovery, Anomaly Discovery, Data Cleansing, Reporting and Analytic. It also had Hadoop ( Big data ) support to move files to/from Hadoop Grid, Create, Load and Profile Hive Tables. This project is also known as "Aggregate Profiler" Resful API for this project is getting built as (Beta Version) https://sourceforge.net/projects/restful-api-for-osdq/ apache spark based data quality is getting built at https://sourceforge.net/projects/apache-spark-osdq/

8 Reviews

Downloads: 7 This Week

Last Update: 2021-01-20
See Project
14

Apache Hudi

Upserts, Deletes And Incremental Processing on Big Data

Apache Hudi (pronounced Hoodie) stands for Hadoop Upserts Deletes and Incrementals. Hudi manages the storage of large analytical datasets on DFS (Cloud stores, HDFS or any Hadoop FileSystem compatible storage). Apache Hudi is a transactional data lake platform that brings database and data warehouse capabilities to the data lake. Hudi reimagines slow old-school batch data processing with a powerful new incremental processing framework for low latency minute-level analytics. Hudi provides efficient upserts, by mapping a given hoodie key (record key + partition path) consistently to a file id, via an indexing mechanism. This mapping between record key and file group/file id, never changes once the first version of a record has been written to a file. In short, the mapped file group contains all versions of a group of records.

Downloads: 1 This Week

Last Update: 2025-12-18
See Project
15

CI Tools Demo

Docker Infrastructure via docker-compose

This repository provides a Docker-powered CI tools demo environment via a single command with docker-compose. It assembles popular CI/CD components—Jenkins, SonarQube, Nexus, GitLab, and Selenium Grid—each running in separate containers, facilitating self-contained integration testing or workshops. It’s not intended for production but serves as a practical demo or launchpad for containerized CI stacks. Each tool runs in an isolated container for modular experimentation. Maintained primarily for workshops and proofs of concept, not for production use. Includes legacy documentation and scripting for Mac users and older setups.

Downloads: 1 This Week

Last Update: 2025-09-02
See Project
16

KubeRay

A toolkit to run Ray applications on Kubernetes

KubeRay is a powerful, open-source Kubernetes operator that simplifies the deployment and management of Ray applications on Kubernetes. It offers several key components. KubeRay core: This is the official, fully-maintained component of KubeRay that provides three custom resource definitions, RayCluster, RayJob, and RayService. These resources are designed to help you run a wide range of workloads with ease.

Downloads: 1 This Week

Last Update: 2025-11-21
See Project
17

Nest Manager

NST Manager (SmartThings)

Nest Manager is a community SmartThings solution that integrates Nest devices—thermostats, Protects, and cameras—into the SmartThings ecosystem via a comprehensive SmartApp and device handlers. It offers a unified dashboard, rich device tiles, and automation hooks so users can monitor and control temperature, modes, and alerts alongside other smart home devices. The project emphasizes usability with guided setup flows, status summaries, and in-app diagnostics to help troubleshoot connectivity or permission issues. It exposes detailed attributes and commands, enabling powerful rules and scenes that coordinate Nest with sensors, presence, and schedules in SmartThings. Historical and environmental data can be surfaced to support energy-aware automations and notifications. For advanced users, it provides granular preferences to tune polling, event verbosity, and safety behaviors, turning SmartThings into a capable hub for Nest-centric homes.

Downloads: 1 This Week

Last Update: 2025-09-03
See Project
18

reticulate

R Interface to Python

reticulate is an R package from Posit that creates seamless interoperability between R and Python. It lets you call Python modules, classes, and functions from within R, automatically translating between R and Python data structures. Useful for combining Python tooling with R projects, data analysis, and RMarkdown reports.

Downloads: 1 This Week

Last Update: 2025-11-14
See Project
19

EasyDataQuality for Pentaho Kettle

EasyDataQuality for Pentaho Data Integration in Kettle

EasyDQ plugins for Contact cleansing in Pentaho Data Integration in Kettle.

1 Review

Downloads: 3 This Week

Last Update: 2016-04-26
See Project
20

Jaspersoft ETL

Jaspersoft ETL is a data integration platform providing high performance data extract-transform-load (ETL) capabilities. Jaspersoft ETL is appropriate for all analytic and operational data integration needs. Activity on this project is located at jas

Downloads: 4 This Week

Last Update: 2013-04-16
See Project
21

Daffodil Replicator

Daffodil Replicator is a powerful Open Source Java tool for data integration, data migration and data protection in real time. It allows bi-directional data replication and synchronization between homogeneous / heterogeneous databases including Oracle, M

1 Review

Downloads: 2 This Week

Last Update: 2019-06-12
See Project
22

ADempiere Compiere Kettle or PDI

Templates for integrating the data structures of Compiere, Openbravo or ADempiere for all kind of Pentaho Data Integration processes. Later on we plan to migrate these to Talend too.

Downloads: 3 This Week

Last Update: 2015-08-01
See Project
23

ARSystem plugins for Pentaho Kettle

AR-System step and db plugins for Pentaho Data Integration Kettle V5

Allows you to write per API to AR-System Server (BMC Remedy Action Request System). Includes two step output, one step input and one database plugin. The step plugins need the database plugin.

Downloads: 3 This Week

Last Update: 2019-03-08
See Project
24

Metl ETL Data Integration

Simple message-based, web-based ETL integration

Metl is a simple, web-based ETL tool that allows for data integrations including database, files, messaging, and web services. Supports RDBMS, SOAP, HTTP, FTP, SFTP, XML, FIXLEN, CSV, JSON, ZIP, and more. Metl implements scheduled integration tasks without the need for custom coding or heavy infrastructure. It can be deployed in the cloud or in an internal data center, and it was built to allow developers to extend it with custom components.

Downloads: 3 This Week

Last Update: 2022-01-21
See Project
25

COMA Community Edition

Schema Matching Solution for Data Integration

COMA CE is the community edition of the well-established COMA project developed at the University of Leipzig. It comprises the parsers, matcher library, matching framework and a sample GUI for tests and evaluations. COMA was initiated at the database chair of the University of Leipzig in 2002 and got much positive feedback ever since. It excels due to numerous matching strategies, which can be combined to large matching workflows, and which enable reliable match results between different kind of schemas.

Downloads: 2 This Week

Last Update: 2016-03-18
See Project