Skip to content

djgarcia/NoiseFramework

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 

Repository files navigation

NoiseFramework

This framework implements two Big Data preprocessing approaches to remove noisy examples: an homogeneous ensemble (HME-BD) and an heterogeneous ensemble (HTE-BD) filter, with special emphasis in their scalability and performance traits. A simple filtering approach based on similarities between instances (ENN-BD) is also implemented.

This software has been proved with four large real-world datasets such as:

Brief benchmark results:

  • HME-BD has shown to be the best noise filter algorithm, achieving the best accuracy.
  • HME-BD is also the most efficient in terms of computing time.
  • HTE-BD can outperform HME-BD for some datasets under low levels of noise.

Example (HME-BD)

import org.apache.spark.mllib.feature._

val nTrees = 100
val maxDepth = 10
val partitions = 4

// Data must be cached in order to improve the performance

val hme_bd_model = new HME_BD(trainingData, // RDD[LabeledPoint]
                              nTrees, // size of the Random Forests
                              partitions, // number of partitions
                              maxDepth, // depth of the Random Forests
                              seed) // seed for the Random Forests

val hme_bd = hme_bd_model.runFilter()

Example (HTE-BD)

import org.apache.spark.mllib.feature._

val nTrees = 100
val maxDepth = 10
val partitions = 4
val voting = 0 // 0 = majority, 1 = consensus

// Data must be cached in order to improve the performance

val hte_bd_model = new HTE_BD(trainingData, // RDD[LabeledPoint]
                              nTrees, // size of the Random Forests
                              partitions, // number of partitions
                              vote, // voting strategy
                              k, // number of neighbors
                              maxDepth, // depth of the Random Forests
                              seed) // seed for the Random Forests

val hte_bd = hte_bd_model.runFilter()

Example (ENN-BD)

import org.apache.spark.mllib.feature._

// Data must be cached in order to improve the performance

val enn_bd_model = new ENN_BD(trainingData, // RDD[LabeledPoint]
                              k) // number of neighbors

val enn_bd = enn_bd_model.runFilter()

Releases

No releases published

Packages

No packages published

Languages