Best Open Source Data Profiling Tools 2025

Data Profiling Tools

Data Profiling Clear Filters

Browse free open source Data Profiling tools and projects below. Use the toggles on the left to filter open source Data Profiling tools by OS, license, language, programming language, and project status.

La version gratuite d'Auth0 s'enrichit !
Gratuit pour 25 000 utilisateurs avec intégration Okta illimitée : concentrez-vous sur le développement de vos applications.

Vous l'avez demandé, nous l'avons fait ! Les versions gratuite et payante d'Auth0 incluent des options qui vous permettent de développer, déployer et faire évoluer vos applications en toute sécurité. Utilisez Auth0 dès maintenant pour découvrir tous ses avantages.

Essayez Auth0 gratuitement
MongoDB Atlas runs apps anywhere
Deploy in 115+ regions with the modern database for every enterprise.

MongoDB Atlas gives you the freedom to build and run modern applications anywhere—across AWS, Azure, and Google Cloud. With global availability in over 115 regions, Atlas lets you deploy close to your users, meet compliance needs, and scale with confidence across any geography.

Start Free
1

Population Shift Monitoring

Monitor the stability of a Pandas or Spark dataframe

popmon is a package that allows one to check the stability of a dataset. popmon works with both pandas and spark datasets. popmon creates histograms of features binned in time-slices, and compares the stability of the profiles and distributions of those histograms using statistical tests, both over time and with respect to a reference. It works with numerical, ordinal, categorical features, and the histograms can be higher-dimensional, e.g. it can also track correlations between any two features. popmon can automatically flag and alert on changes observed over time, such as trends, shifts, peaks, outliers, anomalies, changing correlations, etc, using monitoring business rules. Advanced users can leverage popmon's modular data pipeline to customize their workflow. Visualization of the pipeline can be useful when debugging or for didactic purposes. There is a script included with the package that you can use.

Downloads: 2 This Week

Last Update: 2025-09-04
See Project
2

DataCleaner

Data quality analysis, profiling, cleansing, duplicate detection +more

DataCleaner is a data quality analysis application and a solution platform for DQ solutions. It's core is a strong data profiling engine, which is extensible and thereby adds data cleansing, transformations, enrichment, deduplication, matching and merging. Website: http://datacleaner.github.io

">

3 Reviews

Downloads: 15 This Week

Last Update: 2019-02-12
See Project
3

DQO Data Quality Operations Center

Data Quality Operations Center

DQO is an DataOps friendly data quality monitoring tool with customizable data quality checks and data quality dashboards. DQO comes with around 100 predefined data quality checks which helps you monitor the quality of your data. Table and column-level checks which allows writing your own SQL queries. Daily and monthly date partition testing. Data segmentation by up to 9 different data streams. Build-in scheduling. Calculation of data quality KPIs which can be displayed on multiple built-in data quality dashboards.

Downloads: 1 This Week

Last Update: 2025-07-23
See Project
4

Semantic Type Detection

Metadata/data identification Java library

Metadata/data identification Java library. Identifies Base Type (e.g. Boolean, Double, Long, String, LocalDate, LocalTime, ...) and Semantic Type information (e.g. Gender, Age, Color, Country, ...). Extensive country/language support. Extensible via user-defined plugins. Comprehensive Profiling support. Large set of built-in Semantic Types (extensible via JSON defined plugins). Extensive Profiling metrics (e.g. Min, Max, Distinct, signatures, …) Sufficiently fast to be used inline. See Speed notes below. Minimal false positives for Semantic type detection. See Performance notes below. Usable in either Streaming, Bulk or Record mode. Broad country/language support - including US, Canada, Mexico, Brazil, UK, Australia, much of Europe, Japan and China. Support for sharded analysis (i.e. Analysis results can be merged) Once stream is profiled then subsequent samples can be validated and/or new samples can be generated.

Downloads: 1 This Week

Last Update: 2025-09-24
See Project
The CRM you’ll want to use every day
With CRM, Sales, and Marketing Automation in one, Act! gives you everything you need for happier clients, more revenue, and less stress.

Act! Premium is perfect for small and midsize businesses looking to market better, sell more, and create customers for life. With unparalleled flexibility and freedom of choice, Act! Premium accommodates the unique ways you do business. Whether it’s customizations to fit your specific business or industry processes or your preferences for deployment and access, the possibilities with Act! Premium are limitless.

Learn More
5

Open Source Data Quality and Profiling

World's first open source data quality & data preparation project

This project is dedicated to open source data quality and data preparation solutions. Data Quality includes profiling, filtering, governance, similarity check, data enrichment alteration, real time alerting, basket analysis, bubble chart Warehouse validation, single customer view etc. defined by Strategy. This tool is developing high performance integrated data management platform which will seamlessly do Data Integration, Data Profiling, Data Quality, Data Preparation, Dummy Data Creation, Meta Data Discovery, Anomaly Discovery, Data Cleansing, Reporting and Analytic. It also had Hadoop ( Big data ) support to move files to/from Hadoop Grid, Create, Load and Profile Hive Tables. This project is also known as "Aggregate Profiler" Resful API for this project is getting built as (Beta Version) https://sourceforge.net/projects/restful-api-for-osdq/ apache spark based data quality is getting built at https://sourceforge.net/projects/apache-spark-osdq/

8 Reviews

Downloads: 3 This Week

Last Update: 2021-01-20
See Project
6

AMB Data Profiling Data Quality

AMB New Generation Data Empowerment - offers a comprehensive approach to data governance needs with ground breaking features to locate, identify, discover, manage and protect your overall data infrastructure. Repeatable Process/Exposed Repository.

2 Reviews

Downloads: 0 This Week

Last Update: 2015-04-27
See Project
7

COBOL Data Definitions

Parse, analyze and -- most importantly -- use COBOL data definitions. This gives you access to COBOL data from Python programs. Write data analyzers, one-time data conversion utilities and Python programs that are part of COBOL systems. Really.

Downloads: 0 This Week

Last Update: 2013-04-26
See Project
8

DISTOD

Distributed discovery of bidirectional order dependencies

The DISTOD data profiling algorithm is a distributed algorithm to discover bidirectional order dependencies (in set-based form) from relational data. DISTOD is based on the single-threaded FASTOD-BID algorithm [1], but DISTOD scales elastically to many machines outperforming FASTOD-BID by up to orders of magnitude. Bidirectional order dependencies (bODs) capture order relationships between lists of attributes in a relational table. They can express that, for example, sorting books by publication date in ascending order also sorts them by age in descending order. The knowledge about order relationships is useful for many data management tasks, such as query optimization, data cleaning, or consistency checking. Because the bODs of a specific dataset are usually not explicitly given, they need to be discovered. The discovery of all minimal bODs (in set-based canonical form) is a task with exponential complexity in the number of attributes.

Downloads: 0 This Week

Last Update: 2023-06-12
See Project
9

Data Preprocessing Automate

Data Preprocessing Automation: A GUI for easy data cleaning & visualiz

Data Preprocessing Automation is a Python-based GUI application designed to simplify and automate data preprocessing tasks. It allows users to upload Excel files, automatically handle missing values, remove duplicates, and detect and remove outliers using statistical methods. The application provides data visualization tools, including box plots for distribution analysis and scatter plots for exploring relationships between variables. Users can download the processed data for further analysis. Built with Tkinter, Pandas, Matplotlib, and Seaborn, it ensures an intuitive interface and efficient performance. Additionally, it features a custom logo, a clean UI with a green-blue theme, and options for licensing and public release. This tool is ideal for data analysts, researchers, and professionals looking to automate preprocessing without coding. 🚀

1 Review

Downloads: 0 This Week

Last Update: 2025-02-22
See Project
Comet Backup - Fast, Secure Backup Software for MSPs
Fast, Secure Backup Software for Businesses and IT Providers

Comet is a flexible backup platform, giving you total control over your backup environment and storage destinations.

Learn More
10

Metacrafter

Metadata and data identification tool and Python library

Python command line tool and Python engine to label table fields and fields in data files. It could help to find meaningful data in your tables and data files or to find Personal identifiable information (PII). Metacrafter is a rule-based tool that helps to label fields of the tables in databases. It scans table and finds person names, surnames, midnames, PII data, basic identifiers like UUID/GUID. These rules written as .yaml files and could be easily extended.

Downloads: 0 This Week

Last Update: 2024-06-14
See Project
11

NYCOpenData-Profiling-Analysis

Open Data Profiling, Quality and Analysis on NYC OpenData dataset

Open data often comes with little or no metadata. You will profile a large collection of open data sets and derive metadata that can be used for data discovery, querying, and identification of data quality problems. For each column, identify and summarize the semantic types present in the column. These can be generic types (e.g., city, state) or collection-specific types (NYU school names, NYC agency). For each semantic type T identified, enumerate all the values encountered for T in all columns present in the collection.

Downloads: 0 This Week

Last Update: 2023-06-12
See Project
12

Optimus

Agile Data Preparation Workflows made easy with Pandas

Easily write code to clean, transform, explore and visualize data using Python. Process using a simple API, making it easy to use for newcomers. More than 100 functions to handle strings, process dates, urls and emails. Easily plot data from any size. Out-of-box functions to explore and fix data quality. Use the same code to process your data in your laptop or in a remote cluster of GPUs.

Downloads: 0 This Week

Last Update: 2023-06-12
See Project
13

Panda-Helper

Panda-Helper: Data profiling utility for Pandas DataFrames and Series

Panda-Helper is a simple data-profiling utility for Pandas DataFrames and Series. Assess data quality and usefulness with minimal effort. Quickly perform initial data exploration, so you can move on to more in-depth analysis.

Downloads: 0 This Week

Last Update: 2025-02-05
See Project
14

RDS - JS - Examples

TypeScript/JavaScript example code using the RDS API

Rich Data Services (or RDS) is a suite of REST APIs designed by Metadata Technology North America (MTNA) to meet various needs for data engineers, managers, custodians, and consumers. RDS provides a range of services including data profiling, mapping, transformation, validation, ingestion, and dissemination. For more information about each of these APIs and how you can incorporate or consume them as part of your work flow please visit the MTNA website. RDS-JS-Examples is TypeScript/JavaScript repository for showcases and examples to demonstrate using the RDS API. Many of the examples will leverage the RDS JavaScript SDK to simplify and faciliate interacting with any given RDS API. By using this SDK you will add to your project the benefit of strong types and easy to use helper functions that directly reflect the RDS API.

Downloads: 0 This Week

Last Update: 2023-06-12
See Project
15

Roomba

A Node.js tool to examine the correctness of Open Data Metadata

Linked Open Data (LOD) has emerged as one of the largest collection of interlinked datasets on the web. Benefiting from this mine of data requires the existence of descriptive information about each dataset in the accompanying metadata. Such meta information is currently very limited to few data portals where they are usually provided manually thus giving little or bad quality insights. To address this issue, we propose a scalable automatic approach for extracting, validating and generating descriptive linked dataset profiles. This approach applies several techniques to check the validity of the attached metadata as well as providing descriptive and statistical information of a certain dataset as well as a whole data portal. Using our framework on prominent data portals shows that the general state of the Linked Open Data needs attention as most of datasets suffer from bad quality metadata and lack additional informative metrics.

Downloads: 0 This Week

Last Update: 2023-06-12
See Project
16

Swiple

Swiple enables you to easily observe, understand, validate data

Swiple is an automated data monitoring platform that helps analytics and data engineering teams seamlessly monitor the quality of their data. With automated data analysis and profiling, scheduling and alerting, teams can resolve data quality issues before they impact mission critical resources. Experience hassle-free integration with Swiple's zero-infrastructure and zero-code setup. Seamlessly incorporate data quality checks into your existing workflows without any coding or infrastructure changes, allowing you to focus on what matters most - your data. Save engineers weeks of time generating data quality checks. Swiple analyzes your dataset and builds data quality checks based on what is observed in your data. You just pick the ones you want.

Downloads: 0 This Week

Last Update: 2023-06-12
See Project
17

apache spark data pipeline osDQ

osDQ dedicated to create apache spark based data pipeline using JSON

This is an offshoot project of open source data quality (osDQ) project https://sourceforge.net/projects/dataquality/ This sub project will create apache spark based data pipeline where JSON based metadata (file) will be used to run data processing , data pipeline , data quality and data preparation and data modeling features for big data. This uses java API of apache spark. It can run in local mode also. Get json example at https://github.com/arrahtech/osdq-spark How to run Unzip the zip file Windows : java -cp .\lib\*;osdq-spark-0.0.1.jar org.arrah.framework.spark.run.TransformRunner -c .\example\samplerun.json Mac UNIX java -cp ./lib/*:./osdq-spark-0.0.1.jar org.arrah.framework.spark.run.TransformRunner -c ./example/samplerun.json For those on windows, you need to have hadoop distribtion unzipped on local drive and HADOOP_HOME set. Also copy winutils.exe from here into HADOOP_HOME\bin

Downloads: 0 This Week

Last Update: 2019-01-20
See Project
18

odd-collector

Open-source metadata collector based on ODD Specification

ODD Collector is a lightweight service that gathers metadata from all your data sources. Push-client is a provider which sends information directly to the central repository of the Platform. ODDRN (Open Data Discovery Resource Name) is a unique resource name that identifies entities such as data sources, data entities, dataset fields etc. It is used to build lineage and update metadata.

Downloads: 0 This Week

Last Update: 2023-10-25
See Project
19

odd-collector-gcp

Open-source GCP metadata collector based on ODD Specification

ODD Collector GCP is a lightweight service which gathers metadata from all your Google Cloud Platform data sources.

Downloads: 0 This Week

Last Update: 2023-09-21
See Project
20

rimhistory

RimWorld game save data analyzer

RimWorld game save data analyzer.

Downloads: 0 This Week

Last Update: 2023-06-12
See Project
21

sapiente

A Data profiling project

Downloads: 0 This Week

Last Update: 2014-09-01
See Project

Previous
You're on page 1
Next

Open Source Data Profiling Tools Guide

Open source data profiling tools are an increasingly popular way for businesses to gain insights about their customer bases. These tools allow organizations to extract, analyze, and visualize data from various sources. They enable users to profile data quickly and make informed decisions based on the results.

Open source data profiling tools help companies access information from multiple non-proprietary sources such as web APIs, database tables, and text files. This helps organizations make sense of large amounts of data without having to invest in expensive proprietary software solutions.

Data profiling is typically used for discovering trends and patterns within a dataset. It also can be used for verifying the integrity of incoming datasets or for creating visualizations that highlight key aspects of the dataset. Data profiles usually include descriptive statistics including mean values, percentages, frequencies etc., as well as outliers or anomalies in the dataset.

These open source programs also come with powerful visualization capabilities that allow users to explore the data by using charts and graphs which make it easier to interpret complex datasets quickly and accurately. Additionally, these programs offer flexible query building options so users can call up specific types of information that they need as quickly as possible without needing complicated queries or scripts written in programming languages such as Python or SQL.

Overall, open source data profiling tools are becoming more widely adopted due to their cost-effectiveness and flexibility compared to other proprietary solutions making them very appealing for businesses looking for ways to get deeper insights into their customers’ behavior patterns.

What Features Do Open Source Data Profiling Tools Provide?

Data Cleanup: Open source data profiling tools provide a range of features to help clean up and format your raw data. This includes options such as splitting long fields into multiple columns, converting text strings into numerical values, removing unnecessary characters or words, and more.
Data Validation: These tools also have the ability to validate the data through various processes before it is used for analysis or reporting purposes. This includes scanning datasets for incorrect values, ensuring that entries follow certain conventions (e.g., dates must be in valid date formats), and other checks that are necessary for accurate results.
Outlier Detection: Open source data profiling tools allow users to identify any outliers in the dataset quickly by running automated tests on specific values. By recognizing unusual patterns or occurrences in a dataset, users can make sure they get accurate results from their analyses.
Duplicate Records Removal: To save time and resources when working with large datasets, open source data profiling tools offer a feature where duplicate records are removed from the dataset automatically. This helps ensure that all analyses performed using this data will yield reliable results without any interference from repeated observations in the same set of observations.
Cross-Database Analysis & Comparison: Many open source data profiling tools come with features that enable you to perform cross-database analysis as well as compare two different databases side by side while looking for patterns or discrepancies between them.
Report Generation: Most open source profile software provides an easy way to generate reports directly from your database so you can quickly view summaries of your findings at a glance rather than having to manually analyze each line item one at a time.

Types of Open Source Data Profiling Tools

Open Refine: This tool is designed to help analyze and manipulate data by allowing users to easily identify patterns and fix errors in the data that would otherwise be difficult or time-consuming.
Talend Open Studio: This open source platform allows users to access, transform, integrate and govern their big data. It also enables users to monitor the data’s performance in the cloud and build machine learning models faster using preconfigured workflows.
Apache Flink: Apache Flink is written in Java/Scala and offers an API for both streaming analytics as well as batch processing of large datasets. The software supports both real-time analysis of incoming streams of data as well as offline batch jobs on historical data stored in a distributed file system such as HDFS or S3.
Dataiku: The open source version of this data platform lets you create rich interactive visualizations, perform predictive analytics with advanced machine learning algorithms, share results with other teams, and create custom applications around your datasets without having to write code.
Trifacta: This tool helps you quickly explore different types of datasets from Excel spreadsheets to text files, web APIs, relational databases, JSON documents, etc., so you can identify meaningful insights faster and make more informed decisions about your business strategies.
Keboola: This data profiling tool allows for the integration of information from multiple databases, spreadsheets and APIs. It also provides capabilities such as event-driven workflows, anomaly detection and forecasting.

What Are the Advantages Provided by Open Source Data Profiling Tools?

Cost-Effectiveness: One of the most attractive benefits of open source data profiling tools is their cost-effectiveness. Many open source tools are available at no cost, or with an affordable one-time fee. This makes them ideal for organizations on a budget, who can’t afford to invest in expensive proprietary software solutions.
High Level Of Customization: Open source data profiling tools typically allow users to customize the tool to meet their exact needs. For instance, they may be able to adjust settings, add custom functions and features as needed, and extend functionality without having to purchase additional licenses or software modules.
Flexibility: Another attractive benefit of open source software is its flexibility; it can often be used across different operating systems (e.g., Windows and Linux) without having to purchase separate versions for each OS or perform complex installation processes. Furthermore, if users want more control over how the data is collected and analyzed, they can make changes directly to the code or use external libraries and APIs for added flexibility.
Transparency And Collaboration: Open source software fosters collaboration between developers because it allows anyone access to view and modify the codebase when needed; this increases transparency while helping improve product quality through user contributions. Additionally, open source projects often have active communities where users can ask questions about specific issues or get advice from experienced developers – a feature that isn’t usually offered with proprietary options.
Security: As many security experts agree, open source solutions are generally considered more secure than closed-source ones since anyone has the ability to review the codebase for potential vulnerabilities – making them less susceptible to malicious attacks such as backdoors and other hidden threats which are not easily identified by traditional security techniques. This increased security level improves overall system reliability while keeping data protected from harm.

Who Uses Open Source Data Profiling Tools?

Data Analysts: Individuals who collect, analyze, and interpret data from a variety of sources to better understand the information.
Business Intelligence Professionals: Individuals with knowledge in data science and business analysis that use open source data profiling tools to apply their expertise to optimize company performance.
System Administrators: Those who maintain networks and systems within an organization, using open source data profiling tools to ensure smooth operations.
Data Scientists: Professionals who work on the development of new algorithms or techniques for finding patterns in large datasets and dealing with difficult analytics problems such as machine learning.
Developers: Software engineers who build applications based on open source data profiling tools, adding features and functionality that fulfills customer requirements.
Researchers: Scientists that utilize open source data profiling tools for research purposes, analyzing vast amounts of available information in order to draw conclusions related to their field of study.
Students: Learners studying technology or mathematics make use of open source data profiling tools for gaining understanding about real world problems while working on assignments or projects.
Data Architects: Professionals who design and develop database architectures efficient enough to store, organize, and retrieve data. They use open source data profiling tools for better understanding the data available in order to establish how best to structure it.
Database Administrators: Those responsible for data maintenance, backups, and security within an organization. Open source data profiling tools are used to audit existing databases and improve its performance.
Data Visualizers: Professionals who specialize in the visualization of data into charts and graphics, making them easier to comprehend using open source data profiling tools.
Business Executives: Individuals in higher level positions within an organization use open source data profiling tools to better understand data related to their areas of expertise, making informed decisions that will optimize the performance of the company.

How Much Do Open Source Data Profiling Tools Cost?

Open source data profiling tools are available for free, making them one of the most cost-effective solutions on the market. You don't have to pay any subscription costs or license fees, and you won't be hit with any unexpected costs down the line as many other data profiling software options can often require. Open source data profiling tools enable users to access the source code and modify it according to their needs. This affords unparalleled flexibility in customizing a data profile solution that is specific to an organization's unique set of goals, objectives, and processes. Additionally, these tools are regularly updated with new features and capabilities so that organizations can stay up-to-date with developments in the world of data analysis. All of this makes open source data profiling tools a great option for businesses looking to save money while still leveraging high-quality technology.

What Software Can Integrate With Open Source Data Profiling Tools?

Open source data profiling tools, such as Talend or Pentaho, can integrate with many different types of software, including databases and spreadsheets. Additionally, it's possible for open source data profiling tools to be integrated with software solutions dedicated to big data solutions, such as Hadoop or Apache Spark. It's also possible for open source data profiling tools to integrate with other programs focused on analytics and business intelligence (BI) services. With integration possible between these different types of software, businesses have more flexibility in how they manage their data.

Trends Related to Open Source Data Profiling Tools

Open source data profiling tools are becoming increasingly popular due to their low cost and abundant availability.
They provide organizations with the ability to develop custom solutions tailored to their specific requirements, allowing for faster development times and more flexibility.
These tools are typically easy to use, requiring minimal technical knowledge and allowing users to quickly produce reports on the data they have collected.
Open source data profiling tools provide organizations with a cost-effective way to collect, analyze, and report on large datasets.
They are also beneficial for organizations who need to share data across multiple divisions, as they can easily be integrated into existing software applications.
Additionally, open source data profiling tools can be used for predictive analytics capabilities, enabling businesses to make informed decisions based on past performance.
Lastly, open source data profiling tools are becoming more powerful and sophisticated, providing users with more advanced features that can help them better understand their data.

How To Get Started With Open Source Data Profiling Tools

Using open source data profiling tools is a great way to gain valuable insights into your organization’s data. To get started with using an open source data profiling tool, the first step is to download and install it onto your computer. Generally, this can be done by visiting the official site for the specific open source tool you are interested in and following the instructions provided.

After downloading and installing the software, it is important to familiarize yourself with how to use it. Usually, there will be helpful guides available from either the official website or from online resources. Additionally, most open source data profiling tools will have an active user community that provides tips and tricks for getting started as well as support options if needed. It may even be helpful to take some tutorials on how to use a particular tool before diving right in.

Once you are comfortable using all of its features and functionality, you can begin collecting data for analysis through queries. Depending on what sort of results you want to obtain, different types of queries should be written accordingly (i.e., SELECT statements). After running these queries against specified databases, results can then be viewed within the tool’s user interface or exported into other formats such as CSV files which can then be used in other applications or projects like reporting or machine learning models training purposes.

Overall, getting started with an open source data profiling tool requires some initial research but once you have familiarized yourself with its environment and capabilities it is fairly straightforward from thereon out. With the right setup, you can start making sense of your data and uncover valuable insights that will help to better inform decisions.

Open Source Data Profiling Tools

Data Profiling Tools

Population Shift Monitoring

DataCleaner

DQO Data Quality Operations Center

Semantic Type Detection

Open Source Data Quality and Profiling

AMB Data Profiling Data Quality

COBOL Data Definitions

DISTOD

Data Preprocessing Automate

Metacrafter

NYCOpenData-Profiling-Analysis

Optimus

Panda-Helper

RDS - JS - Examples

Roomba

Swiple

apache spark data pipeline osDQ

odd-collector

odd-collector-gcp

rimhistory

sapiente