This is an offshoot project of open source data quality (osDQ) project https://sourceforge.net/projects/dataquality/

This sub project will create apache spark based data pipeline where JSON based metadata (file) will be used to run data processing , data pipeline , data quality and data preparation and data modeling features for big data. This uses java API of apache spark. It can run in local mode also.

Get json example at https://github.com/arrahtech/osdq-spark

How to run

Unzip the zip file

Windows : java -cp .\lib\*;osdq-spark-0.0.1.jar org.arrah.framework.spark.run.TransformRunner -c .\example\samplerun.json

Mac UNIX
java -cp ./lib/*:./osdq-spark-0.0.1.jar org.arrah.framework.spark.run.TransformRunner -c ./example/samplerun.json

For those on windows, you need to have hadoop distribtion unzipped on local drive and HADOOP_HOME set. Also copy winutils.exe from here into HADOOP_HOME\bin

Features

  • Create data pipeline like using Join, Filter, Aggregate, Case statement
  • Use Data Quality - replace, drop, join,
  • Data Profiling, Column base Profiling
  • Fuzzy Join - cosine distance and others
  • classification and sampling - random forest, Multi class neural network
  • data normalization - zscore, std deviation, ratio score,
  • Sampling Random, Stratified , Key based

Project Samples

Project Activity

See All Activity >

License

GNU General Public License version 3.0 (GPLv3)

Follow apache spark data pipeline osDQ

apache spark data pipeline osDQ Web Site

You Might Also Like
MongoDB Atlas runs apps anywhere Icon
MongoDB Atlas runs apps anywhere

Deploy in 115+ regions with the modern database for every enterprise.

MongoDB Atlas gives you the freedom to build and run modern applications anywhere—across AWS, Azure, and Google Cloud. With global availability in over 115 regions, Atlas lets you deploy close to your users, meet compliance needs, and scale with confidence across any geography.
Start Free
Rate This Project
Login To Rate This Project

User Reviews

Be the first to post a review of apache spark data pipeline osDQ!

Additional Project Details

Intended Audience

Architects, Information Technology, Other Audience

User Interface

Console/Terminal

Programming Language

Java, Scala

Related Categories

Java Data Warehousing Software, Java Business Intelligence Software, Java ETL Tool, Java Data Pipeline Tool, Java Data Quality Tool, Scala Data Warehousing Software, Scala Business Intelligence Software, Scala ETL Tool, Scala Data Pipeline Tool, Scala Data Quality Tool

Registered

2016-06-17