Skip to content

Commit

Permalink
Merge pull request #39 from OliverKillane/enh/ai-assistance
Browse files Browse the repository at this point in the history
RAG experiments
  • Loading branch information
OliverKillane authored Feb 9, 2025
2 parents 114f2dc + 27e7ff1 commit 1f18464
Show file tree
Hide file tree
Showing 184 changed files with 305 additions and 4 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,2 +1,3 @@
/target
/.vscode
mutants.out
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ This project is an experiment and while functional, it is not fully tested & the
### [`./crates` → Contains the libraries developed for this project](./crates)
### [`./bench` → Benchmarks against other systems](./bench)
### [`./book` → The emDB book](./book) [→ hosted here](https://oliverkillane.github.io/emDB/)
### [`./papers`Academic works developed alongside this project](./papers/)
### [`./projects`Projects within emDB (including the thesis this emDB was for)](./projects/)
### [`./scripts` → Helper scripts](./scripts/)

## Documentation
Expand Down
3 changes: 0 additions & 3 deletions papers/README.md

This file was deleted.

4 changes: 4 additions & 0 deletions projects/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# <img src="./../crates/emdb/docs/logo.drawio.svg" alt="emDB" style="vertical-align: middle;" title="emdb logo" width="100"/> Projects

- All academic work, related documents and experiments using <img src="./../crates/emdb/docs/logo.drawio.svg" alt="emDB" style="vertical-align: middle;" title="emdb logo" width="50"/>.
- Additional side projects in emDB.
33 changes: 33 additions & 0 deletions projects/ai-assistance/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
## Tuning LLMs for a codebase
1. Scrape a codebase for relevant info:
- Code, file locations, git history, etc
2. Tune an existing open source LLM on the code.
3. Use retreival augmented generation using periodically scraped database.

### Develop
Create an env file at [.env](./.env) with a github personal access token ([instructions](https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/managing-your-personal-access-tokens))

To just create a venv
```bash
uv venv --python=3.12
source .venv/bin/activate
```

To install dev deps (for the notebooks)
```bash
uv pip install -r pyproject.toml --extra dev
```
In vscode can then open the [notebooks](./notebooks/) and select the python venv to use.

### Notebooks
```bash
uv run --with jupyter jupyter lab
```

> NOTE: Check GPUs
> See guide [here for pytorch](https://docs.astral.sh/uv/guides/integration/pytorch/#installing-pytorch), using nvidia-smi command to check driver from within wsl.
### Resources
- [Huggingface 'Fine Tuning on a Single GPU'](https://huggingface.co/learn/cookbook/fine_tuning_code_llm_on_single_gpu)
- [Huggingface RAG](https://huggingface.co/blog/ray-rag#:~:text=Huggingface%20Transformers%20recently%20added%20the%20Retrieval%20Augmented%20Generation,state%20of%20the%20art%20results%20on%20knowledge-intensive%20tasks)
- [Huggingface Open Source LLM leaderboard](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/)
229 changes: 229 additions & 0 deletions projects/ai-assistance/notebooks/rag.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,229 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Retreival Augmented Generation\n",
"Setup github personal access token ([instructions](https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/managing-your-personal-access-tokens)).\n",
"\n",
"Based on the RAG tutorials for huggingface at:\n",
" - [zephyr + langchain](https://huggingface.co/learn/cookbook/rag_zephyr_langchain)\n",
"\n",
"Additional resources\n",
" - [rag with milvus](https://huggingface.co/learn/cookbook/rag_with_hf_and_milvus)"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [],
"source": [
"from utils.github import get_github_token\n",
"GITHUB_TOKEN = get_github_token()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Setup FAISS with documents"
]
},
{
"cell_type": "code",
"execution_count": 69,
"metadata": {},
"outputs": [],
"source": [
"from typing import Callable\n",
"from langchain.document_loaders import GithubFileLoader, GitHubIssuesLoader\n",
"from langchain.text_splitter import RecursiveCharacterTextSplitter\n",
"from langchain.vectorstores import FAISS\n",
"from langchain.embeddings import HuggingFaceEmbeddings\n",
"\n",
"def has_extension(ends: list[str]) -> Callable[[str],bool]:\n",
" def check(path: str) -> bool:\n",
" return path.split(\".\")[-1] in ends\n",
" return check\n",
"\n",
"chunked_issues = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=30).split_documents(GitHubIssuesLoader(repo=\"oliverkillane/emDB\", access_token=GITHUB_TOKEN, include_prs=True, state=\"all\").load())\n",
"docs = GithubFileLoader(repo=\"oliverkillane/emDB\", access_token=GITHUB_TOKEN, file_filter=has_extension([\"rs\", \"md\", \"toml\"])).load()\n",
"chunked_code = RecursiveCharacterTextSplitter(chunk_size=4096, chunk_overlap=30).split_documents(docs)\n",
"\n",
"db = FAISS.from_documents(chunked_code, HuggingFaceEmbeddings(model_name=\"BAAI/bge-base-en-v1.5\"))\n",
"retriever = db.as_retriever(search_type=\"similarity\", search_kwargs={\"k\": 4})"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Setup the LLM"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import torch\n",
"from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig\n",
"\n",
"model_name = \"HuggingFaceH4/zephyr-7b-beta\"\n",
"\n",
"bnb_config = BitsAndBytesConfig(\n",
" load_in_4bit=True, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type=\"nf4\", bnb_4bit_compute_dtype=torch.bfloat16\n",
")\n",
"\n",
"model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=bnb_config)\n",
"tokenizer = AutoTokenizer.from_pretrained(model_name)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Setup the chains"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from langchain.llms import HuggingFacePipeline\n",
"from langchain.prompts import PromptTemplate\n",
"from transformers import pipeline\n",
"from langchain_core.output_parsers import StrOutputParser\n",
"from langchain_core.runnables import RunnablePassthrough\n",
"\n",
"text_generation_pipeline = pipeline(\n",
" model=model,\n",
" tokenizer=tokenizer,\n",
" task=\"text-generation\",\n",
" temperature=0.2,\n",
" do_sample=True,\n",
" repetition_penalty=1.1,\n",
" return_full_text=True,\n",
" max_new_tokens=400,\n",
")\n",
"\n",
"llm = HuggingFacePipeline(pipeline=text_generation_pipeline)\n",
"ASSISTANT_SPLIT = \"<|assistant|>\"\n",
"CONTEXT_SPLIT = \"<|context|>\"\n",
"USER_SPLIT = \"<|user|>\"\n",
"ANSWER_SPLIT = \"<|answer|>\"\n",
"prompt_template = f\"\"\"\n",
"<|system|>\n",
"Answer the question based on your knowledge. Use the following context to help:\n",
"{CONTEXT_SPLIT}\n",
"{{context}}\n",
"{USER_SPLIT}\n",
"{{question}}\n",
"{ASSISTANT_SPLIT}\n",
"\"\"\"\n",
"\n",
"prompt = PromptTemplate(\n",
" input_variables=[\"context\", \"question\"],\n",
" template=prompt_template,\n",
")\n",
"\n",
"llm_chain = prompt | llm | StrOutputParser()\n",
"retriever = db.as_retriever()\n",
"rag_chain = {\"context\": retriever, \"question\": RunnablePassthrough()} | llm_chain\n",
"\n",
"def ask(question: str) -> None:\n",
" rag_full_answer = rag_chain.invoke(question)\n",
" rag_answer = rag_full_answer.split(ASSISTANT_SPLIT)[1]\n",
" rag_context = rag_full_answer.split(CONTEXT_SPLIT)[1].split(USER_SPLIT)[0]\n",
" \n",
" llm_answer = llm_chain.invoke({\"context\": \"\", \"question\": question}).split(ASSISTANT_SPLIT)[1]\n",
" \n",
" print(f\"\"\"\n",
" LLM: {llm_answer}\n",
" RAG CONTEXT: {rag_context}\n",
" LLM + RAG: {rag_answer}\n",
" \"\"\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"ask(\"Who works on emDB?\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"ask(\"What data structures does emdb support for implementing tables?\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"ask(\"Could you give me some basic code to create an emql table with one column (i32) called 'cool', and to then query for all elements in order.\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"ask(\"How can I build emdb, how do I run tests? How about benchmarks?\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"ask(\"What is combi? And what is pulpit?\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"ask(\"What is the window pattern in emDB, why is it necessary?\")"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.0"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
22 changes: 22 additions & 0 deletions projects/ai-assistance/pyproject.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
[project]
name = "tuning-experiments"
version = "0.1.0"
description = "Add your description here"
readme = "README.md"
requires-python = "==3.10.*"
dependencies = [
"accelerate>=1.3.0",
"bitsandbytes>=0.45.1",
"faiss-gpu>=1.7.2",
"langchain>=0.3.17",
"langchain-community>=0.3.16",
"sentence-transformers>=3.4.1",
"torch>=2.6.0",
"transformers>=4.48.2",
]

[dependency-groups]
dev = [
"ipykernel>=6.29.5",
"notebook>=7.3.2",
]
File renamed without changes.
15 changes: 15 additions & 0 deletions projects/ai-assistance/utils/github.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
from pathlib import Path
from dotenv import load_dotenv
from os import environ

DOTENV_PATH: Path = Path(__file__).parent.parent / '.env'
load_dotenv(DOTENV_PATH, verbose=True)
GITHUB_TOKEN_NAME: str = 'GITHUB'

def get_github_token() -> str | None:
match environ.get(GITHUB_TOKEN_NAME):
case None:
print(f'❌ {GITHUB_TOKEN_NAME} not present, create a .env file at the root of the repo')
return None
case token:
return token
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
Empty file.
File renamed without changes.

0 comments on commit 1f18464

Please sign in to comment.