Bleach

Bleach is distributed stream data cleaning system built on Apache Storm. Unlike other data cleaning systems which mainly focus on batch data cleaning, Bleach performs data cleaning directly on data streams without waiting for all the data to be acquired. It aims to achieve efficient and accurate qualitative data cleaning under real-time constraints. It currently support FD rules and CFD rules. More details can be found in our paper.

How to run it

First, you need a cluster of machines in which Storm, Kafka and Zookeeper are installed. Next, download Bleach code and compile it by mvn:

$ git clone git://github.com:ychtian/Bleach
$ cd Bleach && mvn assembly:assembly

Then, submit the jar to Storm cluster to start Bleach:

$ storm jar target/bleach-1.0.0-jar-with-dependencies.jar storm.dataclean.TestTopology.TestRepair -config job.conf

All the configuration is included in file job.conf.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
src/jvm/storm/dataclean		src/jvm/storm/dataclean
README.md		README.md
job.conf		job.conf
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Bleach

How to run it

About

Releases

Packages

Languages

DistributedSystemsGroup/Bleach

Folders and files

Latest commit

History

Repository files navigation

Bleach

How to run it

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages