Business Agent with RAG by finetuned LLM

This project aims to implement a Retrieval-Augmented Generation (RAG) pipeline over business‐rule corpus/knowledgebase, with:

Modal GPU API with vLLM – Finetuned Foundation model and completed remote inference endpoint
LangGraph RAG integration – Work in progress

+----------------+       +----------------+       +------------------+       +------------------+
|                |       |                |       |                  |       |                  |
|   User Input   +------>+  Streamlit UI  +------>+      Modal       +------>+      vLLM        |
|                |       |  (LangGraph)   |       | (Remote GPU API) |       | (Accelerated LLM)|
+----------------+       +----------------+       +------------------+       +------------------+
                                                   |
                          +------------------------+-------------------------+
                          |
                          |     Retrieval
                          v
                  [FAISS Vector DB on business rule corpus]

Modal GPU API with vLLM

🚀 Project Overview

Fine-tune Meta LLaMA 3 8B with QLoRA (quantized LoRA) on a custom QA dataset combining:
- Stanford Alpaca
- OpenAssistant (OASST1)
Use AWS S3 for dataset and adapter checkpoint storage.
Merge LoRA adapter into the base LLaMA 3 8B model.
Upload merged model to Hugging Face Hub.
Deploy the merged model on modal.ai GPU containers leveraging vLLM for fast and memory-efficient inference.
Provide a lightweight client to interact remotely with the deployed model.
Prepare for integration as a LangGraph RAG LLM backend.

🗂 Directory Structure

data
├── convert_oasst_alpaca.py # (unused) data conversion script

colab_experiment
├── qlora adaptor training.ipynb # (unused) notebook for QLoRA fine‐tuning

vllm
├── credential.py # HF & AWS S3 credentials loader
├── merge_lora.py # merge LoRA adapter into base model
├── upload_merged_model_to_hf.py # push merged model to Hugging Face
├── modal_inference.py # Modal App + LLMRunner (✅)
├── client.py # local client to call Modal API
├── adapter
│ ├── adapter_config.json # LoRA adapter config
│ └── adapter_model.safetensors # LoRA adapter weights
src\businessllm_rag_chat

Description

data/convert_oasst_alpaca.py: Handles dataset downloading, parsing, and merging of OpenAssistant and Alpaca into training-ready JSONL files and upload to AWS s3 bucket.
colab_experiment/qlora adaptor training.ipynb: Colab notebook running QLoRA fine-tuning on LLaMA 3 8B using A100 GPU.
vllm/ folder contains all scripts related to model merging, uploading, and inference deployment:
- credential.py: Manages AWS and Hugging Face credentials (recommended to use environment variables or secret managers to keep secrets safe).
- merge_lora.py: Downloads the LoRA adapter checkpoint and merges it into the base LLaMA 3 8B model.
- upload_merged_model_to_hf.py: Uploads the merged full model to Hugging Face Hub for serving.
- modal_inference.py: Defines the Modal application that deploys the merged model with vLLM on an A100 GPU container.
- client.py: A local client example to invoke the remote inference endpoint on Modal.
- adapter/: Stores your local LoRA adapter files (adapter_config.json and adapter_model.safetensors).

🗂 Dataset Preparation

Download and parse Alpaca and OpenAssistant datasets.
Combine and split into train/eval sets.
Upload processed JSONL files to AWS S3 for training.

🛠 Fine-tuning with QLoRA on Colab (A100 GPU)

Load LLaMA 3 8B base model with 4-bit quantization using bitsandbytes.
Prepare model with PEFT LoRA adapters.
Train on sampled dataset (e.g. 2000 samples) with HuggingFace Trainer.
Save LoRA adapter checkpoints locally.
Upload LoRA adapter to AWS S3 for backup and later retrieval.

⚙️ Merge Adapter and Save Merged Model

Download LoRA adapter from S3.
Load base LLaMA 3 8B.
Merge LoRA adapter weights into base model.
Save merged model locally.
Upload merged model folder to Hugging Face Hub for downstream use.

⚡️ Inference with vLLM on modal.ai

Define a modal app to build a container image with required dependencies (vllm, transformers, torch, huggingface_hub).
Deploy the merged model from HF Hub onto an A100 GPU-enabled modal container.
Use vllm.LLM with SamplingParams for fast and cost-efficient text generation.
Expose a remote generate method to run inference with a given prompt.
Provide local client script to call remote model seamlessly.

🔧 Requirements

Python 3.8+
transformers, datasets, bitsandbytes, peft, torch, boto3
modal CLI and account setup with A100 GPU support
AWS S3 bucket for dataset and adapter storage
Hugging Face account and access token for model hosting

🏁 Quick Start

a. Sign up for Modal

Visit https://modal.com and create a free account.

Install the Modal CLI:

brew install modal   # macOS (Homebrew)
# or
pip install modal-cli

Authenticate your CLI:
```
modal login
```

b. Deploy the GPU inference service
From the project root, run:

modal deploy vllm/modal_inference.py

c. Test locally
Invoke your remote model from the command line:

python vllm/client.py --prompt "Hello!"

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
colab_experiment		colab_experiment
data		data
src/businessllm_rag_chat		src/businessllm_rag_chat
tests		tests
vllm		vllm
README.md		README.md
poetry.lock		poetry.lock
prompt.txt		prompt.txt
pyproject.toml		pyproject.toml
repo_map.txt		repo_map.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Business Agent with RAG by finetuned LLM

🚀 Project Overview

🗂 Directory Structure

Description

🗂 Dataset Preparation

🛠 Fine-tuning with QLoRA on Colab (A100 GPU)

⚙️ Merge Adapter and Save Merged Model

⚡️ Inference with vLLM on modal.ai

🔧 Requirements

🏁 Quick Start

About

Uh oh!

Releases

Packages

Languages

kaispace30098/BusinessLLM-RAGChat

Folders and files

Latest commit

History

Repository files navigation

Business Agent with RAG by finetuned LLM

🚀 Project Overview

🗂 Directory Structure

Description

🗂 Dataset Preparation

🛠 Fine-tuning with QLoRA on Colab (A100 GPU)

⚙️ Merge Adapter and Save Merged Model

⚡️ Inference with vLLM on modal.ai

🔧 Requirements

🏁 Quick Start

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages