Skip to content

Commit

Permalink
server : refactor middleware and /health endpoint (#9056)
Browse files Browse the repository at this point in the history
* server : refactor middleware and /health endpoint

* move "fail_on_no_slot" to /slots

* Update examples/server/server.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* fix server tests

* fix CI

* update server docs

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
  • Loading branch information
ngxson and ggerganov authored Aug 16, 2024
1 parent d565bb2 commit 8b3befc
Show file tree
Hide file tree
Showing 3 changed files with 178 additions and 218 deletions.
35 changes: 26 additions & 9 deletions examples/server/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -368,15 +368,16 @@ node index.js

## API Endpoints

### GET `/health`: Returns the current state of the server
### GET `/health`: Returns heath check result

- 503 -> `{"status": "loading model"}` if the model is still being loaded.
- 500 -> `{"status": "error"}` if the model failed to load.
- 200 -> `{"status": "ok", "slots_idle": 1, "slots_processing": 2 }` if the model is successfully loaded and the server is ready for further requests mentioned below.
- 200 -> `{"status": "no slot available", "slots_idle": 0, "slots_processing": 32}` if no slots are currently available.
- 503 -> `{"status": "no slot available", "slots_idle": 0, "slots_processing": 32}` if the query parameter `fail_on_no_slot` is provided and no slots are currently available.
**Response format**

If the query parameter `include_slots` is passed, `slots` field will contain internal slots data except if `--slots-endpoint-disable` is set.
- HTTP status code 503
- Body: `{"error": {"code": 503, "message": "Loading model", "type": "unavailable_error"}}`
- Explanation: the model is still being loaded.
- HTTP status code 200
- Body: `{"status": "ok" }`
- Explanation: the model is successfully loaded and the server is ready.

### POST `/completion`: Given a `prompt`, it returns the predicted completion.

Expand Down Expand Up @@ -639,10 +640,16 @@ Given a ChatML-formatted json description in `messages`, it returns the predicte
}'
```

### GET `/slots`: Returns the current slots processing state. Can be disabled with `--slots-endpoint-disable`.
### GET `/slots`: Returns the current slots processing state

This endpoint can be disabled with `--no-slots`

If query param `?fail_on_no_slot=1` is set, this endpoint will respond with status code 503 if there is no available slots.

**Response format**

Example:

```json
[
{
Expand Down Expand Up @@ -702,7 +709,13 @@ Given a ChatML-formatted json description in `messages`, it returns the predicte
]
```

### GET `/metrics`: Prometheus compatible metrics exporter endpoint if `--metrics` is enabled:
Possible values for `slot[i].state` are:
- `0`: SLOT_STATE_IDLE
- `1`: SLOT_STATE_PROCESSING

### GET `/metrics`: Prometheus compatible metrics exporter

This endpoint is only accessible if `--metrics` is set.

Available metrics:
- `llamacpp:prompt_tokens_total`: Number of prompt tokens processed.
Expand Down Expand Up @@ -767,6 +780,10 @@ Available metrics:

### GET `/lora-adapters`: Get list of all LoRA adapters

This endpoint returns the loaded LoRA adapters. You can add adapters using `--lora` when starting the server, for example: `--lora my_adapter_1.gguf --lora my_adapter_2.gguf ...`

By default, all adapters will be loaded with scale set to 1. To initialize all adapters scale to 0, add `--lora-init-without-apply`

If an adapter is disabled, the scale will be set to 0.

**Response format**
Expand Down
Loading

0 comments on commit 8b3befc

Please sign in to comment.