Refactor/update hf support (#34)

* refactoring HF support * updating README * refactoring __main__ * refactoring device parameter for HF support * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update README.md --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Aniket Maurya <theaniketmaurya@gmail.com>
aniketmaurya · Mar 19, 2024 · b58ec8b · b58ec8b
1 parent c200dbc
commit b58ec8b
Show file tree

Hide file tree

Showing 3 changed files with 31 additions and 12 deletions.
diff --git a/README.md b/README.md
@@ -104,25 +104,26 @@ terminal.
 
 ### Serve HuggingFace Models
 
-You can easily serve any HuggingFace Transformer model using FastServe.
+Leveraging FastServe, you can seamlessly serve any HuggingFace Transformer model, enabling flexible deployment across various computing environments, from CPU-based systems to powerful GPU and multi-GPU setups.
 
 For some models, it is required to have a HuggingFace API token correctly set up in your environment to access models from the HuggingFace Hub.
 This is not necessary for all models, but you may encounter this requirement, such as accepting terms of use or any other necessary steps. Take a look at your model's page for specific requirements.
 ```
 export HUGGINGFACE_TOKEN=<your hf token>
 ```
 
-Example of run the server:
+The server can be easily initiated with a specific model. In the example below, we demonstrate using `gpt2`. You should replace `gpt2` with your model of choice. The `model_name` parameter is optional; if not provided, the class attempts to fetch the model name from an environment variable `HUGGINGFACE_MODEL_NAME`. Additionally, you can now specify whether to use GPU acceleration with the `device` parameter, which defaults to `cpu` for CPU usage.
+
 ```python
 from fastserve.models import ServeHuggingFace
 
-# Here, we use "gpt2" as an example. Replace "gpt2" with the name of your desired model.
-# The `model_name` parameter is optional; the class can retrieve it from an environment variable called `HUGGINGFACE_MODEL_NAME`.
-app = ServeHuggingFace(model_name="gpt2")
+# Initialize with GPU support if desired by setting `device="cuda"`.
+# For CPU usage, you can omit `device` or set it to `cpu`.
+app = ServeHuggingFace(model_name="gpt2", device="cuda")
 app.run_server()
 ```
 
-or, run `python -m fastserve.models --model huggingface --model_name bigcode/starcoder --batch_size 4 --timeout 1` from
+or, run `python -m fastserve.models --model huggingface --model_name bigcode/starcoder --batch_size 4 --timeout 1 --device cuda` from
 terminal.
 
 To make a request to the server, send a JSON payload with the prompt you want the model to generate text for. Here's an example using requests in Python:

diff --git a/src/fastserve/models/__main__.py b/src/fastserve/models/__main__.py
@@ -76,7 +76,8 @@
 elif args.model == "huggingface":
     app = ServeHuggingFace(
         model_name=args.model_name,
-        device=device,
+        use_gpu=True if args.use_gpu else False,
+        device="cuda" if args.use_gpu else device,
         timeout=args.timeout,
         batch_size=args.batch_size,
     )

diff --git a/src/fastserve/models/huggingface.py b/src/fastserve/models/huggingface.py
@@ -3,7 +3,7 @@
 from typing import Any, List
 
 from pydantic import BaseModel
-from transformers import AutoModel, AutoTokenizer
+from transformers import AutoModelForCausalLM, AutoTokenizer
 
 from fastserve.core import FastServe
 
@@ -20,7 +20,13 @@ class PromptRequest(BaseModel):
 
 
 class ServeHuggingFace(FastServe):
-    def __init__(self, model_name: str = None, **kwargs):
+    def __init__(
+        self, model_name: str = None, use_gpu: bool = False, device="cpu", **kwargs
+    ):
+        # Determine execution mode from environment or explicit parameter
+        self.use_gpu = use_gpu or os.getenv("USE_GPU", "false").lower() in ["true", "1"]
+        self.device = device
+
         # HF authentication
         hf_token = os.getenv("HUGGINGFACE_TOKEN")
         if hf_token:
@@ -39,8 +45,7 @@ def __init__(self, model_name: str = None, **kwargs):
         )
         super().__init__(**kwargs)
 
-    @staticmethod
-    def _load_model_and_tokenizer(model_name: str):
+    def _load_model_and_tokenizer(self, model_name: str):
         if not model_name:
             logger.error(
                 "The Hugging Face model name has not been provided. \
@@ -49,7 +54,15 @@ def _load_model_and_tokenizer(model_name: str):
             )
             return None, None
         try:
-            model = AutoModel.from_pretrained(model_name)
+            if self.use_gpu:
+                # Load model with GPU support, device_map="auto" enables multi-GPU if available
+                model = AutoModelForCausalLM.from_pretrained(
+                    model_name, device_map="auto"
+                )
+            else:
+                # Load model for CPU execution
+                model = AutoModelForCausalLM.from_pretrained(model_name)
+
             tokenizer = AutoTokenizer.from_pretrained(model_name)
             logger.info(f"Model and tokenizer for '{model_name}' loaded successfully.")
             return model, tokenizer
@@ -62,6 +75,10 @@ def _load_model_and_tokenizer(model_name: str):
     def __call__(self, request: PromptRequest) -> Any:
         try:
             inputs = self.tokenizer.encode(request.prompt, return_tensors="pt")
+
+            if self.use_gpu:
+                inputs = inputs.to(self.device)
+
             output = self.model.generate(
                 inputs,
                 max_length=request.max_tokens,