Discussion: Architecture for Jan to support multiple Inference Engines #771

dan-menlo · 2023-11-29T02:30:38Z

Objective

Jan's architecture will default to Nitro, but have flexibility to incorporate other Model Backends / Inference Engines
We see very fast movement around a few ecosystems
- Intel BigDL
- Intel Extensions for Transformers
- TensorRT-LLM for WIndows
- And many more to come
Jan's Architecture needs to be able to incorporate modular Inference Engines
- Parallelize efforts to support different inference engines
- Allow us to hedge technical architectural risks

Solution

I envision an architecture in Jan that has the following:

Model Extension
Inference Extension
[Many] Extensions for each Inference Engine

Models Extension

Handles Models filesystem and CRUD
/models API endpoint

Inference Extension

Handles generic endpoints (for now, /chat/completions, later /audio/speech)
Routes an inference request to the correct Inference Engine extension (defined by model.json)

Extension for each Inference Engine

Each Inference Engine exposes a /chat/completions endpoint
Each Inference Engine is expected to have its own OpenAI compliant API
Each Inference Engine is expected to do its own automatic resource management (e.g. loading, unloading)
Each Inference Engine is expected to stream events to the Frontend UI
- e.g. Model Loading, Unloading events (i.e. "please wait")
- e.g. SSEs for model response
Extensions for Inference Engine are available via The Hub
- These are NPM packages that build on our Extensions API

Example

File Tree

/jan
    /models
        /llama2-70b
             llama2-gguf-q4_k_m.bin     # uses Nitro
             model.json
        /llama2-70b-intel-bigdl
              #pytorch files
              model.json
    /engines
        /nitro
            engine.json
        /openai

model.json gpt4-32k-1603

{
    engine: "openai"    // If not specified, defaults to Nitro
}

engine.json example for Nitro

{
    settings: { ...default_gguf_settings},
    parameters: { ...default_gguf_params},
}

Execution Path

User makes inference request to llama2-70b-intel-bigdl
Inference Extension loads the model.json for llama2-70b-intel-bigdl and sees engine is intel-bigdl
Inference Extension routes it to intel-bigdl Inference Engine Extension
intel-bigdl Inference Engine Extension takes in /chat/completions request, runs inference, and returns result through SSE

The text was updated successfully, but these errors were encountered:

hiro-v · 2023-11-29T03:36:35Z

Idk what models extension do when we have flat and separate model extensions
Nitro and anything as remote models (oai/ azure oai/ claude/ tgi compatible) will be supported by Dec 15
I will check if the nvidia local llm support or not but this is optional, but as in their announcement the 0.6.0 will be out in 2023 Dec.

freelerobot · 2023-11-29T03:40:06Z

Nitro is still an intermediary server to llama.cpp. It is opinionated, uses Drogon cpp, and is the default engine when users don't specify the engine for their GGUF models.

dan-menlo · 2023-11-29T06:57:53Z

@vuonghoainam Re supporting Claude/TGI, aren't there differences between their interfaces (vs. OAI)? Does this mean we will need separate inference engine extensions for each of them?

I can also see how it would not be a huge amount of work, just some Javascript glueing.

hiro-v · 2023-11-29T08:22:59Z

Yes, @dan-jan it's correct that we add the support for TGI and claude like that.

Some decisions:

Change in model.json in https://github.com/janhq/jan/tree/main/plugins/model-plugin
===> Will update engine as enum, and this is the field that inference engines will pick up to use
===> For remote models: The settings are mostly secrets that will be supported by combining with this: feat: Extension should be able to register its own settings #776 to persist
The inference-engine from https://github.com/janhq/jan/tree/main/plugins/inference-plugin will be refactored as follow and each one will be published as @janhq/inference-engine-<name>.

inference-engine/
    nitro/ (default)
    openai/
    azure-openai/
    claude/
    hf-endpoints/
    nvidia-triton-trt-llm/

On the right hand side, the parameters will be there for thread settings to update.

dan-menlo · 2023-11-29T08:24:30Z

@vuonghoainam This looks quite clear. I'll create a task in the epic to track documenting this in Specs.

freelerobot · 2023-12-31T08:34:49Z

Moving to #1271

dan-menlo mentioned this issue Nov 29, 2023

epic: Jan supports multi Inference Engines #751

Closed

14 tasks

github-project-automation bot added this to Menlo Nov 29, 2023

dan-menlo changed the title ~~Spec: Models should support different Inference Engines~~ Discussion: Architecture for Jan to support multiple Inference Engines Nov 29, 2023

freelerobot added P0: critical Mission critical type: discussion labels Nov 29, 2023

freelerobot assigned tikikun and hiro-v Nov 29, 2023

dan-menlo moved this to Todo in Menlo Nov 29, 2023

hiro-v mentioned this issue Nov 30, 2023

[WIP] Multiple model inference engine #783

Closed

5 tasks

hiro-v moved this from Todo to In Review in Menlo Nov 30, 2023

dan-menlo added this to the Jan supports multiple Inference Engines milestone Dec 11, 2023

freelerobot moved this from In Review to Planned in Menlo Dec 20, 2023

freelerobot added roadmap: Jan Runs Models and removed P0: critical Mission critical labels Dec 22, 2023

freelerobot removed this from the Jan supports multiple Inference Engines milestone Dec 27, 2023

freelerobot closed this as completed Dec 31, 2023

github-project-automation bot moved this from Planned to Done in Menlo Dec 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discussion: Architecture for Jan to support multiple Inference Engines #771

Discussion: Architecture for Jan to support multiple Inference Engines #771

dan-menlo commented Nov 29, 2023 •

edited

Loading

hiro-v commented Nov 29, 2023

freelerobot commented Nov 29, 2023

dan-menlo commented Nov 29, 2023

hiro-v commented Nov 29, 2023 •

edited

Loading

dan-menlo commented Nov 29, 2023

freelerobot commented Dec 31, 2023

Discussion: Architecture for Jan to support multiple Inference Engines #771

Discussion: Architecture for Jan to support multiple Inference Engines #771

Comments

dan-menlo commented Nov 29, 2023 • edited Loading

Objective

Solution

Example

hiro-v commented Nov 29, 2023

freelerobot commented Nov 29, 2023

dan-menlo commented Nov 29, 2023

hiro-v commented Nov 29, 2023 • edited Loading

dan-menlo commented Nov 29, 2023

freelerobot commented Dec 31, 2023

dan-menlo commented Nov 29, 2023 •

edited

Loading

hiro-v commented Nov 29, 2023 •

edited

Loading