Introduction

Nexus is a lightweight, Kubernetes-native, high-throughput client-server proxy for machine learning, optimization and AI algorithms. Nexus allows data science teams to author and deploy their products to live environments with a few YAML manifests, and provides a simple flexible HTTP API for interacting with machine learning applications.

Core Nexus features include:

Elastic virtual queues for incoming requests
- Incoming request rate normalization via throttling to limits acceptable by Kubernetes API Server
- Automatic scheduling of a delayed request ahead of a queue
Allocation to autoscaling machine groups using Kubernetes affinity/toleration settings
Plug-n-play algorithm execution and deployment via Kubernetes Custom Resources
Multi-cluster support without extra configuration with the help of Nexus Configuration Controller
Input buffering and push down to algorithm container using secure URIs
Processing rate from receiving to result on a scale of thousands of completions per second, depending algorithm execution time.
HTTP API for launching runs and receiving results via presigned URLs.

Design

Nexus can be deployed in Single Cluster or Multi Cluster modes. Single Cluster mode consists of three components: at least one scheduler, a maintainer and at least one receiver, all deployed to a single Kubernetes cluster. Multi Cluster mode consists of:

Controller Cluster, which has at least one scheduler and a maintainer
One or more Shard Clusters, with at least one receiver. scheduler can also be deployed to these clusters in case an algorithm uses one of Nexus SDK's to create execution trees

Maintainer

Maintainer is responsible for handling requests that are sitting in a queue more than expected, as well as requests that could not be converted to a Kubernetes Job for any reason, and for garbage collecting failed submissions. Nexus stores request metadata and algorithm lifecycle-related information in a so-called checkpoint store. Maintainer scans that table on regular intervals, selects requests that are experiencing problems, i.e. they are stuck in buffered state and tries to resolve the situation by either terminating the request, or sending it ahead of the main queue. In addition, Maintainer is responsible for watching for events emitted by algorithm pods and taking action in certain situations (OOMKill, ImagePullBackoff etc.)

Scheduler

Scheduler is what makes it possible to run algorithms through Crystal. Each scheduler has a public API that can be used to submit runs and retrieve results. Moreover, each scheduler holds a separate virtual queue that it uses to process incoming requests. Nexus relies on load balancer using round-robin algorithm when distributing requests between scheduler pods, so a horizontal autoscaler can be used to the maximum efficiency.

Usage

-- TBD --

Versioning

Nexus's API is versioned. Requests against a specific version with have URI suffix like this: /algorithm/v1.0/run. If a version is not specified, latest will be targeted - which includes experimental and unstable features. For production use cases, always use one of the current production versions:

/algorithm/v1.2/run

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.container		.container
.github		.github
.gitattributes		.gitattributes
.gitignore		.gitignore
.testcoverage.yml		.testcoverage.yml
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

Design

Maintainer

Scheduler

Usage

Versioning

About

Releases 1

Packages

Languages

License

SneaksAndData/nexus

Folders and files

Latest commit

History

Repository files navigation

Introduction

Design

Maintainer

Scheduler

Usage

Versioning

About

Topics

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages