Skip to content

Commit

Permalink
Refactor/update hf support (#34)
Browse files Browse the repository at this point in the history
* refactoring HF support

* updating README

* refactoring __main__

* refactoring device parameter for HF support

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update README.md

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Aniket Maurya <theaniketmaurya@gmail.com>
  • Loading branch information
3 people authored Mar 19, 2024
1 parent c200dbc commit b58ec8b
Show file tree
Hide file tree
Showing 3 changed files with 31 additions and 12 deletions.
13 changes: 7 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -104,25 +104,26 @@ terminal.

### Serve HuggingFace Models

You can easily serve any HuggingFace Transformer model using FastServe.
Leveraging FastServe, you can seamlessly serve any HuggingFace Transformer model, enabling flexible deployment across various computing environments, from CPU-based systems to powerful GPU and multi-GPU setups.

For some models, it is required to have a HuggingFace API token correctly set up in your environment to access models from the HuggingFace Hub.
This is not necessary for all models, but you may encounter this requirement, such as accepting terms of use or any other necessary steps. Take a look at your model's page for specific requirements.
```
export HUGGINGFACE_TOKEN=<your hf token>
```

Example of run the server:
The server can be easily initiated with a specific model. In the example below, we demonstrate using `gpt2`. You should replace `gpt2` with your model of choice. The `model_name` parameter is optional; if not provided, the class attempts to fetch the model name from an environment variable `HUGGINGFACE_MODEL_NAME`. Additionally, you can now specify whether to use GPU acceleration with the `device` parameter, which defaults to `cpu` for CPU usage.

```python
from fastserve.models import ServeHuggingFace

# Here, we use "gpt2" as an example. Replace "gpt2" with the name of your desired model.
# The `model_name` parameter is optional; the class can retrieve it from an environment variable called `HUGGINGFACE_MODEL_NAME`.
app = ServeHuggingFace(model_name="gpt2")
# Initialize with GPU support if desired by setting `device="cuda"`.
# For CPU usage, you can omit `device` or set it to `cpu`.
app = ServeHuggingFace(model_name="gpt2", device="cuda")
app.run_server()
```

or, run `python -m fastserve.models --model huggingface --model_name bigcode/starcoder --batch_size 4 --timeout 1` from
or, run `python -m fastserve.models --model huggingface --model_name bigcode/starcoder --batch_size 4 --timeout 1 --device cuda` from
terminal.

To make a request to the server, send a JSON payload with the prompt you want the model to generate text for. Here's an example using requests in Python:
Expand Down
3 changes: 2 additions & 1 deletion src/fastserve/models/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -76,7 +76,8 @@
elif args.model == "huggingface":
app = ServeHuggingFace(
model_name=args.model_name,
device=device,
use_gpu=True if args.use_gpu else False,
device="cuda" if args.use_gpu else device,
timeout=args.timeout,
batch_size=args.batch_size,
)
Expand Down
27 changes: 22 additions & 5 deletions src/fastserve/models/huggingface.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
from typing import Any, List

from pydantic import BaseModel
from transformers import AutoModel, AutoTokenizer
from transformers import AutoModelForCausalLM, AutoTokenizer

from fastserve.core import FastServe

Expand All @@ -20,7 +20,13 @@ class PromptRequest(BaseModel):


class ServeHuggingFace(FastServe):
def __init__(self, model_name: str = None, **kwargs):
def __init__(
self, model_name: str = None, use_gpu: bool = False, device="cpu", **kwargs
):
# Determine execution mode from environment or explicit parameter
self.use_gpu = use_gpu or os.getenv("USE_GPU", "false").lower() in ["true", "1"]
self.device = device

# HF authentication
hf_token = os.getenv("HUGGINGFACE_TOKEN")
if hf_token:
Expand All @@ -39,8 +45,7 @@ def __init__(self, model_name: str = None, **kwargs):
)
super().__init__(**kwargs)

@staticmethod
def _load_model_and_tokenizer(model_name: str):
def _load_model_and_tokenizer(self, model_name: str):
if not model_name:
logger.error(
"The Hugging Face model name has not been provided. \
Expand All @@ -49,7 +54,15 @@ def _load_model_and_tokenizer(model_name: str):
)
return None, None
try:
model = AutoModel.from_pretrained(model_name)
if self.use_gpu:
# Load model with GPU support, device_map="auto" enables multi-GPU if available
model = AutoModelForCausalLM.from_pretrained(
model_name, device_map="auto"
)
else:
# Load model for CPU execution
model = AutoModelForCausalLM.from_pretrained(model_name)

tokenizer = AutoTokenizer.from_pretrained(model_name)
logger.info(f"Model and tokenizer for '{model_name}' loaded successfully.")
return model, tokenizer
Expand All @@ -62,6 +75,10 @@ def _load_model_and_tokenizer(model_name: str):
def __call__(self, request: PromptRequest) -> Any:
try:
inputs = self.tokenizer.encode(request.prompt, return_tensors="pt")

if self.use_gpu:
inputs = inputs.to(self.device)

output = self.model.generate(
inputs,
max_length=request.max_tokens,
Expand Down

0 comments on commit b58ec8b

Please sign in to comment.