This is a BentoML example project, showing you how to serve and deploy open-source embedding and reranking Models using michaelfeil/Infinity, which enables high-throughput deployments for clip, sentence-transformer, reranking and classification models.
See here for a full list of BentoML example projects.
- You have installed Python 3.9+ and
pip
. See the Python downloads page to learn more. - You have a basic understanding of key concepts in BentoML, such as Services. We recommend you read Quickstart first.
- You have installed Docker as this example depends on a base Docker image
michaelf34/infinity
to set up Infinity. - (Optional) We recommend you create a virtual environment for dependency isolation for this project. See the Conda documentation or the Python documentation for details.
Clone the repo.
git clone https://github.com/bentoml/BentoInfinity.git
cd BentoInfinity
Make sure you are in the BentoInfinity
directory and mount it from your host machine (${PWD}
) into a Docker container at /BentoInfinity
. This means that the files and folders in the current directory are available inside the container at the /BentoInfinity
.
docker run --runtime=nvidia --gpus all -v ${PWD}:/BentoInfinity -v ~/bentoml:/root/bentoml -p 3000:3000 --entrypoint /bin/bash -it --workdir /BentoInfinity michaelf34/infinity v2
Install dependencies.
pip install -r requirements.txt
We have defined a BentoML Service in service.py
. Run bentoml serve
in your project directory to start the Service.
$ bentoml serve .
2024-06-06T10:31:45+0000 [INFO] [cli] Starting production HTTP BentoServer from "service:INFINITY" listening on http://localhost:3000 (Press CTRL+C to quit)
The server is now active at http://localhost:3000. You can interact with it using the Swagger UI or in other different ways.
CURL
curl -X 'POST' \
'http://localhost:3000/embeddings' \
-H 'Content-Type: application/json' \
-d '{
"input": ["Explain superconductors like I am five years old"],
"model": "BAAI/bge-small-en-v1.5"
}'
Python client
import bentoml
with bentoml.SyncHTTPClient("http://localhost:3000") as client:
response = client.embeddings(
input=["Explain superconductors like I am five years old"],
model= "BAAI/bge-small-en-v1.5"
)
print(f"Embeddings dim:"
f" {len(response['embeddings']), len(response['embeddings'][0])}"
f"usage: {response['usage']}")
After the Service is ready, you can deploy the application to BentoCloud. Make sure you have logged in to BentoCloud, then run the following command to deploy it.
bentoml deploy .
Once the application is up and running on BentoCloud, you can access it via the exposed URL.
Note: For custom deployment in your own infrastructure, use BentoML to generate an OCI-compliant image.