Skip to content
This repository has been archived by the owner on Dec 27, 2022. It is now read-only.

Latest commit

 

History

History
65 lines (43 loc) · 5.56 KB

EXTEND.md

File metadata and controls

65 lines (43 loc) · 5.56 KB

One Ring is designed with extensibility in mind.

Extend Operations

To extend One Ring built-in set of Operations, you have to implement an Operation class according to a set of conventions described in this doc.

First off, you should create a Maven module. Place it at the same level as root pom.xml, and include your module in root project's <modules> section. You can freely choose the group and artifact IDs you like.

To make One Ring know your module, include its artifact reference in CLI's pom.xml <dependencies>. To make your module know One Ring, include a reference of artifact ash.nazg:Commons in your module's <dependencies> (and its test-jar scope too). For an example, look into Math's pom.xml.

Now you can proceed to create an Operation package and describe it.

By convention, Operation package must be named your.package.name.operations and have package-info.java annotated with @ash.nazg.config.tdl.RegisteredPackage. That annotation is required by One Ring to recognize the contents of your.package.name. There's an example.

Place all your Operations inside that Package.

If your module contains a number of Operation that share same Parameter Definitions, names of these parameters must be placed in a public final class your.package.name.config.ConfigurationParameters class as public static final String constants with same @Description annotation each. An Operation can define its own parameters inside its class following the same convention.

Parameter Definitions and their default value constants have semantic names, depending on their purpose:

  • DS_INPUT_ for input DataStream references,
  • DS_OUTPUT_ for output DataStream references,
  • OP_ for the operation's Parameter,
  • DEF_ for any default value, and the rest of the name should match a corresponding OP_ or DS_,
  • GEN_ for any column, generated by this Operation.

References to columns of input DataStreams must end with _COLUMN suffix, and to column lists with _COLUMNS.

An Operation in essence is an implementation of

public interface OpInfo extends Serializable {
    String verb();

    TaskDescriptionLanguage.Operation description();

    Map<String, JavaRDDLike> getResult(Map<String, JavaRDDLike> input) throws Exception;
}

that provides Operation's verb (Operation's short name), metadata, and entry point.

It is up to you, the author, to provide all the required methods with the following declarative pattern:

  • annotate verb with a @Description,
  • use description to return Operation's entire configuration space in a Task Description Language 2 object,
  • user getResult as an entry point of your business code. It'll be fed with all DataStreams accumulated by the current Process at the moment of your Operation invocation, and should return any DataStreams your Operation should emit.

In the lack of any required metadata the build will be prematurely failed by One Ring Guardian, so incomplete class won't fail your extended copy of One Ring CLI on your Process execution time.

The Operation abstract class implements that interface while supplying conventional utilities to your implementation, and also entry point to configuration interface for the CLI. You ablsolutely should override-and-call-super for the configure method. By convention, if any of the parameters have invalid value, you're obliged to throw an InvalidConfigValueException with a descriptive message about the configuration mistake.

As another must, you create a test case for your Operation. See existing tests for a reference.

There is a plenty of examples to learn by, just look into the source code for Operation's descendants. For your convenience, there's a list of most notable ones:

  • FilterByDateOperation with lots of parameters of different types that have defaults,
  • SplitByDateOperation — its sister Operation generates a lot of output DataStreams with wildcard names,
  • DummyOperation — this one properly does nothing, just creates aliases for its input DataStreams,
  • SubtractOperation can consume and emit both RDDs and PairRDDs as DataStreams,
  • WeightedSumOperation generates a lot of columns that either come from input DataStreams or are created anew,
  • and the package Proximity contains Operations that deal with Point and Polygon RDDs in their DataStreams.

Extend Dist Storage Adapters

To extend One Ring Dist with a custom Storage Adapter, you have to implement a pair of InputAdapter and OutputAdapter interfaces. They're fairly straightforward mini-Operations, just see existing Adapter sources for the reference.

A single restriction exists: you can't set your Adapter as a fallback one (by handling 'any' protocol), as that role is reserved to One Ring Hadoop Adapter.

Hopefully this information is enough to extend One Ring.