Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion: Architecture for Jan to support multiple Inference Engines #771

Closed
Tracked by #751
dan-menlo opened this issue Nov 29, 2023 · 6 comments
Closed
Tracked by #751
Assignees

Comments

@dan-menlo
Copy link
Contributor

dan-menlo commented Nov 29, 2023

Objective

  • Jan's architecture will default to Nitro, but have flexibility to incorporate other Model Backends / Inference Engines
  • We see very fast movement around a few ecosystems
  • Jan's Architecture needs to be able to incorporate modular Inference Engines
    • Parallelize efforts to support different inference engines
    • Allow us to hedge technical architectural risks

Solution

I envision an architecture in Jan that has the following:

  • Model Extension
  • Inference Extension
  • [Many] Extensions for each Inference Engine

Models Extension

  • Handles Models filesystem and CRUD
  • /models API endpoint

Inference Extension

  • Handles generic endpoints (for now, /chat/completions, later /audio/speech)
  • Routes an inference request to the correct Inference Engine extension (defined by model.json)

Extension for each Inference Engine

  • Each Inference Engine exposes a /chat/completions endpoint
  • Each Inference Engine is expected to have its own OpenAI compliant API
  • Each Inference Engine is expected to do its own automatic resource management (e.g. loading, unloading)
  • Each Inference Engine is expected to stream events to the Frontend UI
    • e.g. Model Loading, Unloading events (i.e. "please wait")
    • e.g. SSEs for model response
  • Extensions for Inference Engine are available via The Hub
    • These are NPM packages that build on our Extensions API

image

Example

File Tree

/jan
    /models
        /llama2-70b
             llama2-gguf-q4_k_m.bin     # uses Nitro
             model.json
        /llama2-70b-intel-bigdl
              #pytorch files
              model.json
    /engines
        /nitro
            engine.json
        /openai

model.json gpt4-32k-1603

{
    engine: "openai"    // If not specified, defaults to Nitro
}

engine.json example for Nitro

{
    settings: { ...default_gguf_settings},
    parameters: { ...default_gguf_params},
}

Execution Path

  1. User makes inference request to llama2-70b-intel-bigdl
  2. Inference Extension loads the model.json for llama2-70b-intel-bigdl and sees engine is intel-bigdl
  3. Inference Extension routes it to intel-bigdl Inference Engine Extension
  4. intel-bigdl Inference Engine Extension takes in /chat/completions request, runs inference, and returns result through SSE
@dan-menlo dan-menlo changed the title Spec: Models should support different Inference Engines Discussion: Architecture for Jan to support multiple Inference Engines Nov 29, 2023
@hiro-v
Copy link
Contributor

hiro-v commented Nov 29, 2023

  1. Idk what models extension do when we have flat and separate model extensions
  2. Nitro and anything as remote models (oai/ azure oai/ claude/ tgi compatible) will be supported by Dec 15
  3. I will check if the nvidia local llm support or not but this is optional, but as in their announcement the 0.6.0 will be out in 2023 Dec.

@freelerobot
Copy link
Contributor

Nitro is still an intermediary server to llama.cpp. It is opinionated, uses Drogon cpp, and is the default engine when users don't specify the engine for their GGUF models.

@dan-menlo
Copy link
Contributor Author

@vuonghoainam Re supporting Claude/TGI, aren't there differences between their interfaces (vs. OAI)? Does this mean we will need separate inference engine extensions for each of them?

I can also see how it would not be a huge amount of work, just some Javascript glueing.

@hiro-v
Copy link
Contributor

hiro-v commented Nov 29, 2023

Yes, @dan-jan it's correct that we add the support for TGI and claude like that.

Some decisions:

inference-engine/
    nitro/ (default)
    openai/
    azure-openai/
    claude/
    hf-endpoints/
    nvidia-triton-trt-llm/
  • On the right hand side, the parameters will be there for thread settings to update.

@dan-menlo
Copy link
Contributor Author

@vuonghoainam This looks quite clear. I'll create a task in the epic to track documenting this in Specs.

@dan-menlo dan-menlo moved this to Todo in Menlo Nov 29, 2023
@hiro-v hiro-v moved this from Todo to In Review in Menlo Nov 30, 2023
@dan-menlo dan-menlo added this to the Jan supports multiple Inference Engines milestone Dec 11, 2023
@freelerobot freelerobot moved this from In Review to Planned in Menlo Dec 20, 2023
@freelerobot freelerobot added roadmap: Jan Runs Models and removed P0: critical Mission critical labels Dec 22, 2023
@freelerobot freelerobot removed this from the Jan supports multiple Inference Engines milestone Dec 27, 2023
@freelerobot
Copy link
Contributor

Moving to #1271

@github-project-automation github-project-automation bot moved this from Planned to Done in Menlo Dec 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

No branches or pull requests

4 participants