Skip to content

Executing SAMOA with Apache S4

Albert Bifet edited this page Oct 17, 2013 · 2 revisions

In this tutorial we will describe how to execute SAMOA on top of Apache S4.

Prerequisites

The following dependencies are needed to run SAMOA smoothly on Apache S4

Gradle

Gradle is a build automation tool and is used to build Apache S4. The installation guide can be found here. The following instructions is a simplified installation guide.

  1. Download Gradle binaries from downloads, or from the console type wget http://services.gradle.org/distributions/gradle-1.6-bin.zip
  2. Unzip the file unzip gradle-1.6-bin.zip
  3. Set the Gradle environment variable: export GRADLE_HOME=/foo/bar/gradle-1.6
  4. Add to the systems path export PATH=$PATH:$GRADLE_HOME/bin
  5. Install Gradle by running gradle

Now you are all set to install Apache S4

Apache S4

S4 is a general-purpose, distributed, scalable, fault-tolerant, pluggable platform that allows programmers to easily develop applications for processing continuous unbounded streams of data. The installation process is as follows:

  1. Download the latest Apache S4 release from Apache S4 0.6.0 or from command line wget http://www.apache.org/dist/incubator/s4/s4-0.6.0-incubating/apache-s4-0.6.0-incubating-src.zip or clone from git. git clone https://git-wip-us.apache.org/repos/asf/incubator-s4.git
  2. Unzip the file unzip apache-s4-0.6.0-incubating-src.zip or go in the cloned directory.
  3. Set the Apache S4 environment variable export S4_HOME=/foo/bar/apache-s4-0.6.0-incubating-src
  4. Add the S4_HOME to the system PATH. export PATH=$PATH:$S4_HOME
  5. Once the previous steps are done we can proceed to build and install Apache S4.
  6. You can have a look at the available build tasks by typing gradle tasks
  7. There are some dependencies issues, therefore you should run the wrapper task first by typing gradle wrapper.
  8. Build Apache S4 by running gradle in the S4_HOME directory.
  9. Install the S4-TOOLS, gradle s4-tools::installApp

Done. Now you can configure and run your Apache S4 cluster.


SAMOA

  1. The SAMOA package can be downloaded from http://samoa-project.net/ or cloned from git git clone https://github.com/yahoo/samoa.git. In case of SSH cloning remember to register your public key.
  2. Unzip the SAMOA distribution package unzip SAMOA-0.0.1-SNAPSHOT-dist.zip

Inside the SAMOA directory you will find the following files:

samoa-api-0.1.jar  
samoa-s4.properties  
samoa-storm.properties
samoa
SAMOA-S4-0.0.1.jar   
SAMOA-Storm-0.0.1.jar
  • samoa : is the execution script for the SAMOA framework.
  • samoa-api-<version>.jar : is the library with the developers API for implementing new algorithms and topologies.
  • SAMOA-S4-<version>.jar : is the Apache S4 platform specific adapter which enables SAMOA to run on top of Apache S4.
    • samoa-s4.properties : is the configuration file for defining some S4 specific properties.
  • SAMOA-Storm-<version>.jar : is the Storm platform specific adapter which enables SAMOA to run on top of Storm.
    • samoa-storm.properties : is the configuration file for defining some Storm specific properties.

When using a cloned repository, packages are to be prepared with the s4 profile: mvn package -Ps4. The SAMOA-S4-0.0.1.jar file will be generated in the /target directory


SAMOA-S4 Configuration

This section will go through the samoa-s4.properties file and how to configure it. In order for SAMOA to run correctly in a distributed environment there are some variables that need to be defined. Since Apache S4 uses ZooKeeper for cluster management we need to define where it is running.

# Zookeeper Server
zookeeper.server=localhost
zookeeper.port=2181

Apache S4 also distributes the application via HTTP, therefore the server and port which contains the S4 application must be provided.

# Simple HTTP Server providing the packaged S4 jar
http.server.ip=localhost
http.server.port=8000

Apache S4 uses the concept of logical clusters to define a group of machines, which are identified by an ID and start serving on a specific port.

# Name of the S4 cluster
cluster.name=cluster
cluster.port=12000

SAMOA can be deployed on a single machine using only one resource or in a cluster environments. The following property can be defined to deploy as a local application or on a cluster.

# Deployment strategy
samoa.deploy.mode=local

SAMOA S4 Deployment

In order to deploy SAMOA in a distributed environment you MUST configure the samoa-s4.properties file correctly. If you are running locally it is optional to modify the properties file.

The deployment is done by running the SAMOA execution script samoa with some additional parameters. The execution syntax is as follows: ./samoa <platform> <jar-location> <task & options>

Example:

./samoa S4 ../../../target/SAMOA-S4-0.0.1.jar "ClusteringTask -q 1 -P 5 -L 100 -G 5 -i 500000 -s (RandomRBFGeneratorEvents -K 5 -N 0.0 -V 12000 -a 2)"

The <platform> can be s4 or storm.

The <jar-location> must be the absolute path to the platform specific jar file.

The <task & options> should be the name of a known task and the options belonging to that task.