Unleash the Blazing Power of Cutting-Edge LLMs on Your Own Hardware
Run Llama 3.3, DeepSeek-R1, Phi-4, Gemma 3, Mistral Small 3.1, and other state-of-the-art language models locally with scorching-fast performance. Inferno provides an intuitive CLI and an OpenAI/Ollama-compatible API, putting the inferno of AI innovation directly in your hands.
Navigation
- β¨ Overview
- π Key Features
- βοΈ Installation
- π₯οΈ Command Line Interface (CLI)
- π₯ Getting Started
- π Usage Guide
- π API Usage
- π Native Python Client
- π§© Integrations
- π¦ Requirements
- π§ Advanced Configuration
- π€ Contributing
- π License
- π Full Documentation
Inferno is your personal gateway to the blazing frontier of Artificial Intelligence. Designed for both newcomers and seasoned developers, it provides a powerful yet user-friendly platform to run the latest Large Language Models (LLMs) directly on your local machine. Experience the raw power of models like Llama 3.3 and Phi-4 without relying on cloud services, ensuring full control over your data and costs.
Inferno offers an experience similar to Ollama but turbo-charged with enhanced features, including seamless Hugging Face integration, advanced quantization tools, and flexible model management. Its OpenAI & Ollama-compatible APIs ensure drop-in compatibility with your favorite AI frameworks and tools.
Tip
New to local LLMs? Inferno makes it incredibly easy to get started. Pull a model and ignite your first conversation within minutes!
-
Bleeding-Edge Model Support: Run the latest models such as Llama 3.3, DeepSeek-R1, Phi-4, Gemma 3, Mistral Small 3.1, and more as soon as GGUF versions are available.
-
Hugging Face Integration: Download models with interactive file selection, repository browsing, and direct
repo_id:filename
targeting. -
Dual API Compatibility: Serve models through both OpenAI and Ollama compatible API endpoints. Use Inferno with almost any AI client or framework.
-
Native Python Client: Includes a built-in, OpenAI-compatible Python client for seamless integration into your Python projects. Supports streaming, embeddings, multimodal inputs, and tool calling.
-
Interactive CLI: Command-line interface for downloading, managing, quantizing, and chatting with models.
-
Blazing-Fast Inference: GPU acceleration (CUDA, Metal, ROCm, Vulkan, SYCL) for faster response times. CPU acceleration via OpenBLAS is also supported.
-
Real-time Streaming: Get instant feedback with streaming support for both chat and completions APIs.
-
Flexible Context Control: Adjust the context window size (
n_ctx
) per model or session. Max context length is automatically detected from GGUF metadata. -
Smart Model Management: List, show details, copy, remove, and see running models (
ps
). Includes RAM requirement estimates. -
Embeddings Generation: Create embeddings using your local models via the API.
-
Advanced Quantization: Convert models between various GGUF quantization levels (including importance matrix methods like
iq4_nl
) with interactive comparison and RAM estimates. -
Keep-Alive Management: Control how long models stay loaded in memory when idle.
-
Fine-Grained Configuration: Customize inference parameters such as GPU layers, threads, batch size, RoPE settings, and mlock.
Important
Critical Prerequisite: Install llama-cpp-python
First!
Inferno relies heavily on llama-cpp-python
. For optimal performance, especially GPU acceleration, you MUST install llama-cpp-python
with the correct hardware backend flags before installing Inferno. Failure to do this may result in suboptimal performance or CPU-only operation.
Choose one of the following commands based on your hardware. See the detailed Hardware Acceleration section below for more options and explanations.
- NVIDIA GPU (CUDA):
CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir # Or use pre-built wheels if available (see details below)
- Apple Silicon GPU (Metal):
CMAKE_ARGS="-DGGML_METAL=on" pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir # Or use pre-built wheels if available (see details below)
- AMD GPU (ROCm):
CMAKE_ARGS="-DGGML_HIPBLAS=on" pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir
- CPU Only (OpenBLAS):
CMAKE_ARGS="-DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS" pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir
- Other Backends (Vulkan, SYCL, etc.): See the detailed section below.
Tip
Using a virtual environment (like venv
or conda
) is highly recommended. Ensure you have Python 3.9+ and the necessary build tools (CMake, C++ compiler) installed. Adding --force-reinstall --upgrade --no-cache-dir
helps ensure a clean build against your system's libraries.
Once llama-cpp-python
is installed with your desired backend, you can install Inferno directly from PyPI:
# Install the latest stable release from PyPI
pip install inferno-llm
Or, for development or the latest features, install from source:
# Clone the Inferno repository
git clone https://github.com/HelpingAI/inferno.git
cd inferno
# Install Inferno in editable mode (recommended for development)
pip install -e .
# Or install with all optional dependencies (like quantization tools)
# pip install -e ".[dev]"
llama.cpp
(the engine behind llama-cpp-python
) supports multiple hardware acceleration backends. You need to tell pip
how to build llama-cpp-python
using CMAKE_ARGS
.
How to Set Build Options (Environment Variables vs. CLI)
You can set CMAKE_ARGS
either as an environment variable before running pip install
or directly via the -C / --config-settings
flag.
Environment Variable Method (Linux/macOS):
CMAKE_ARGS="-DOPTION=on" pip install llama-cpp-python ...
Environment Variable Method (Windows PowerShell):
$env:CMAKE_ARGS = "-DOPTION=on"
pip install llama-cpp-python ...
CLI Method (Works Everywhere, Good for requirements.txt):
# Use semicolons to separate multiple CMake args with -C
pip install llama-cpp-python -C cmake.args="-DOPTION1=on;-DOPTION2=off" ...
Supported Backends (Install ONE)
-
CUDA (NVIDIA): Requires NVIDIA drivers & CUDA Toolkit.
CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir
- Pre-built Wheels (Alternative): If you have CUDA 12.1-12.5 and Python 3.10-3.12, try:
# Replace <cuda-version> with cu121, cu122, cu123, cu124, or cu125 pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir \ --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/<cuda-version> # Example: pip install ... --extra-index-url .../whl/cu121
- Pre-built Wheels (Alternative): If you have CUDA 12.1-12.5 and Python 3.10-3.12, try:
-
Metal (Apple Silicon): Requires macOS 11.0+ & Xcode Command Line Tools.
CMAKE_ARGS="-DGGML_METAL=on" pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir
- Pre-built Wheels (Alternative): If you have macOS 11.0+ and Python 3.10-3.12, try:
pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir \ --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/metal
- Pre-built Wheels (Alternative): If you have macOS 11.0+ and Python 3.10-3.12, try:
-
hipBLAS / ROCm (AMD): Requires ROCm toolkit.
CMAKE_ARGS="-DGGML_HIPBLAS=on" pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir
-
OpenBLAS (CPU Acceleration): Recommended for CPU-only systems. Requires OpenBLAS library installed.
CMAKE_ARGS="-DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS" pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir
-
Vulkan: Requires Vulkan SDK. May accelerate various GPUs (Intel, AMD).
CMAKE_ARGS="-DGGML_VULKAN=on" pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir
-
SYCL (Intel GPU): Requires Intel oneAPI Base Toolkit.
# Set up oneAPI environment first (adjust path as needed) source /opt/intel/oneapi/setvars.sh CMAKE_ARGS="-DGGML_SYCL=on -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx" pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir
-
RPC (Distributed): For multi-machine inference setups.
CMAKE_ARGS="-DGGML_RPC=on" pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir
Access Inferno's features through its intuitive CLI:
# Show available commands and options
inferno --help
# Alternatively, run as a Python module
python -m inferno --help
Core Commands:
Command | Description | Example |
---|---|---|
pull <model_id_or_path> |
Download models (GGUF) from Hugging Face | inferno pull meta-llama/Llama-3.3-8B-Instruct-GGUF |
list or ls |
List locally downloaded models & RAM estimates | inferno list |
serve <model_name_or_id> |
Start API server (OpenAI & Ollama compatible) | inferno serve MyLlama3 --port 8080 |
run <model_name_or_id> |
Start interactive chat session in the terminal | inferno run MyLlama3 |
remove <model_name> |
Delete a downloaded model | inferno remove MyLlama3 |
copy <source> <dest> |
Duplicate a model locally | inferno copy MyLlama3 MyLlama3-Experiment |
show <model_name> |
Display detailed model info (metadata, path, etc.) | inferno show MyLlama3 |
ps |
Show running Inferno server processes/models | inferno ps |
quantize <input> [out] |
Convert models (HF or GGUF) to different quant levels | inferno quantize hf:Qwen/Qwen3-0.6B Qwen3-0.6B-Q4_K_M |
compare <models...> |
Compare specs of multiple local models | inferno compare ModelA ModelB |
estimate <model_name> |
Show RAM usage estimates for different quants | inferno estimate MyLlama3-f16 |
version |
Display Inferno version information | inferno version |
Let's ignite your first model!
-
Download a Model: Choose a model from Hugging Face (GGUF format). Inferno helps you select the specific file.
# Example: Download Llama 3.3 8B Instruct (will prompt for file selection) inferno pull meta-llama/Llama-3.3-8B-Instruct-GGUF # Example: Download Mistral Small 3.1 inferno pull mistralai/Mistral-Small-3.1-GGUF # Example: Download Phi-4 Mini inferno pull microsoft/Phi-4-mini-GGUF # Example: Specify a direct file if you know the exact name # inferno pull user/repo-GGUF:model-q4_k_m.gguf
[!WARNING] Some models require Hugging Face authentication. Run
huggingface-cli login
in your terminal beforehand if needed. Inferno will warn you about estimated RAM requirements. -
List Your Models: Verify the download.
inferno list
(You'll see your downloaded model listed, e.g.,
Llama-3.3-8B-Instruct-GGUF
) -
Chat with the Model: Start an interactive session.
inferno run Llama-3.3-8B-Instruct-GGUF
Type your questions and press Enter. Use
/help
inside the chat for commands like changing the system prompt (/set system ...
) or context size (/set context ...
). Use/bye
to exit. -
(Alternative) Start the API Server: Serve the model for use with other applications.
inferno serve Llama-3.3-8B-Instruct-GGUF --port 8000
The server will be available at
http://localhost:8000
. You can now use clients pointing tohttp://localhost:8000/v1
(OpenAI API) orhttp://localhost:8000/api
(Ollama API).
# Interactive download (recommended)
inferno pull <repo_id>
# Example: inferno pull google/gemma-1.1-7b-it-gguf
# Direct file download
inferno pull <repo_id>:<filename.gguf>
# Example: inferno pull google/gemma-1.1-7b-it-gguf:gemma-1.1-7b-it-Q4_K_M.gguf
# HuggingFace prefix with repository ID
inferno pull hf:<repo_id>
# Example: inferno pull hf:mradermacher/DAN-Qwen3-1.7B-GGUF
# HuggingFace prefix with repository ID and quantization
inferno pull hf:<repo_id>:<quantization>
# Example: inferno pull hf:mradermacher/DAN-Qwen3-1.7B-GGUF:Q2_K
Inferno shows available GGUF files, sizes, and estimated RAM needed, warning if it exceeds your system's available RAM.
Convert models to smaller, faster GGUF formats. This is useful if you download a large F16
model or want to experiment with different precision levels.
# Quantize a downloaded F16 GGUF model (interactive method selection)
inferno quantize MyModel-f16 MyModel-Q4_K_M
# Quantize directly from a Hugging Face repo (interactive)
# This downloads the original (often PyTorch/Safetensors) model and converts it.
inferno quantize hf:NousResearch/Hermes-2-Pro-Llama-3-8B Hermes-2-Pro-Llama-3-8B-Q5_K_M
# Specify quantization method directly (e.g., q4_k_m)
inferno quantize MyModel-f16 MyModel-Q4_K_M --method q4_k_m
Common Quantization Methods:
Method | Approx Bits | Size Multiplier (vs F16) | Use Case |
---|---|---|---|
q2_k | ~2.5 bits | ~0.16x | Minimum RAM, experimental quality |
iq3_m | ~3.0 bits | ~0.21x | Good quality for 3-bit (Importance Matrix) |
q3_k_m | ~3.5 bits | ~0.24x | Balanced low RAM / decent quality |
iq4_nl | ~4.0 bits | ~0.29x | Best 4-bit quality (Non-Linear Importance) |
iq4_xs | ~4.0 bits | ~0.29x | Extra small 4-bit (Importance Matrix) |
q4_k_m | ~4.5 bits | ~0.31x | Excellent general-purpose balance |
q5_k_m | ~5.5 bits | ~0.38x | Higher quality, moderate RAM increase |
q6_k | ~6.5 bits | ~0.44x | Very high quality, significant RAM |
q8_0 | ~8.5 bits | ~0.53x | Near-lossless quality, highest non-F16 RAM |
f16 | 16.0 bits | 1.00x | Full precision, highest RAM, source quality |
Note
RAM estimates provided during quantization are approximate. Actual usage depends on context size and backend. Use inferno estimate <model_name>
for more detailed projections.
inferno list # or inferno ls
Shows local model names, original repo, file size, quantization type, estimated base RAM, and download date.
# Serve a locally downloaded model
inferno serve MyModel-Q4_K_M
# Serve directly from Hugging Face (downloads if not present)
inferno serve teknium/OpenHermes-2.5-Mistral-7B-GGUF:openhermes-2.5-mistral-7b.Q4_K_M.gguf
# Specify host and port (0.0.0.0 makes it accessible on your network)
inferno serve MyModel-Q4_K_M --host 0.0.0.0 --port 8080
# Advanced: Offload layers to GPU, set context size
inferno serve MyModel-Q4_K_M --n_gpu_layers 35 --n_ctx 8192
Warning
Using --host 0.0.0.0
exposes the server to your local network. Ensure your firewall settings are appropriate.
inferno run MyModel-Q4_K_M
# Set context size on launch
inferno run MyModel-Q4_K_M --n_ctx 4096
In-Chat Commands:
Command | Description |
---|---|
/help or /? |
Show this help message |
/bye |
Exit the chat |
/set system <prompt> |
Set the system prompt |
/set context <size> |
Set context window size (reloads model) |
/show context |
Show the current and maximum context window size |
/clear or /cls |
Clear the terminal screen |
/reset |
Reset chat history and system prompt |
Inferno exposes OpenAI and Ollama compatible API endpoints when you run inferno serve
.
- OpenAI Base URL:
http://localhost:8000/v1
(Default port) - Ollama Base URL:
http://localhost:8000/api
(Default port)
/v1/models
(GET): List available models (returns the currently served model)./v1/chat/completions
(POST): Generate chat responses (supports streaming)./v1/completions
(POST): Generate text completions (legacy, use chat)./v1/embeddings
(POST): Generate text embeddings.
/api/tags
(GET): List available models (returns the currently served model)./api/chat
(POST): Generate chat responses (supports streaming)./api/generate
(POST): Generate text completions./api/embed
(POST): Generate text embeddings./api/show
(POST): Show details for a loaded model.- (Model management endpoints like
/api/pull
,/api/copy
,/api/delete
are generally handled by the CLI)
import openai
# Point the official OpenAI client to your Inferno server
client = openai.OpenAI(
api_key="dummy-key", # Required by the library, but not used by Inferno
base_url="http://localhost:8000/v1" # Your Inferno server URL
)
# --- Chat Completion ---
try:
response = client.chat.completions.create(
model="MyModel-Q4_K_M", # Must match the model name used in `inferno serve`
messages=[
{"role": "system", "content": "You are Inferno, a helpful AI assistant."},
{"role": "user", "content": "Explain the concept of quantization."}
],
max_tokens=150,
temperature=0.7,
stream=False # Set to True for streaming
)
print("Full Response:")
print(response.choices[0].message.content)
except openai.APIConnectionError as e:
print(f"Connection Error: Is the Inferno server running? {e}")
except Exception as e:
print(f"An error occurred: {e}")
# --- Streaming Chat Completion ---
try:
print("\nStreaming Response:")
stream = client.chat.completions.create(
model="MyModel-Q4_K_M",
messages=[{"role": "user", "content": "Write a short poem about fire."}],
stream=True
)
for chunk in stream:
if chunk.choices and chunk.choices[0].delta and chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
print() # Newline after stream
except openai.APIConnectionError as e:
print(f"Connection Error: Is the Inferno server running? {e}")
except Exception as e:
print(f"An error occurred: {e}")
# --- Embeddings ---
try:
response = client.embeddings.create(
model="MyModel-Q4_K_M", # Ensure the model supports embeddings
input="Inferno is heating up the local AI scene!"
)
print(f"\nEmbedding Vector (first 5 dims): {response.data[0].embedding[:5]}...")
print(f"Total dimensions: {len(response.data[0].embedding)}")
except openai.APIConnectionError as e:
print(f"Connection Error: Is the Inferno server running? {e}")
except Exception as e:
print(f"An error occurred: {e}")
import requests
import json
base_url = "http://localhost:8000/api" # Ollama API base
model_name = "MyModel-Q4_K_M" # Replace with your served model
# --- Chat Completion ---
try:
response = requests.post(
f"{base_url}/chat",
json={
"model": model_name,
"messages": [{"role": "user", "content": "Hello there!"}],
"stream": False
}
)
response.raise_for_status() # Raise an exception for bad status codes
print("Ollama API Chat Response:")
print(response.json()["message"]["content"])
except requests.exceptions.RequestException as e:
print(f"Ollama API Error: {e}")
# --- Embeddings ---
try:
response = requests.post(
f"{base_url}/embed",
json={
"model": model_name,
"input": "This is text to embed."
}
)
response.raise_for_status()
print("\nOllama API Embedding (first 5 dims):")
print(response.json()["embeddings"][:5],"...")
print(f"Total dimensions: {len(response.json()['embeddings'])}")
except requests.exceptions.RequestException as e:
print(f"Ollama API Error: {e}")
Inferno includes its own InfernoClient
, a drop-in replacement for the official openai
client, offering the same interface.
# Ensure inferno is installed: pip install -e .
from inferno.client import InfernoClient
import json # For parsing tool arguments if needed
# Initialize pointing to your server
client = InfernoClient(
api_key="dummy",
base_url="http://localhost:8000/v1", # Use the OpenAI-compatible endpoint
)
model_name = "MyModel-Q4_K_M" # Replace with your served model
# --- Basic Chat ---
print("--- Native Client Chat ---")
response = client.chat.create(
model=model_name,
messages=[{"role": "user", "content": "What is Inferno?"}],
)
print(response["choices"][0]["message"]["content"])
# --- Multimodal Chat (Example - Requires a multimodal model like LLaVA) ---
# Make sure you are serving a model that supports image input (e.g., a LLaVA GGUF)
# model_name_multimodal = "llava-v1.6-mistral-7b-GGUF" # Example name
# print("\n--- Native Client Multimodal Chat (Requires Multimodal Model) ---")
# try:
# response = client.chat.create(
# model=model_name_multimodal, # Use the multimodal model name
# messages=[
# {
# "role": "user",
# "content": [
# {"type": "text", "text": "Describe this image."},
# {"type": "image_url", "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"}}
# ]
# }
# ]
# )
# print(response.get("choices", [{}])[0].get("message", {}).get("content"))
# except Exception as e:
# print(f"Could not run multimodal example: {e}")
# --- Tool Calling (Example - Requires a model supporting tools/functions) ---
# Make sure you are serving a model fine-tuned for tool/function calling
# model_name_tools = "hermes-2-pro-llama-3-8b-GGUF" # Example name
# print("\n--- Native Client Tool Calling (Requires Tool-Supporting Model) ---")
# tools = [{
# "type": "function",
# "function": {
# "name": "get_current_weather",
# "description": "Get the current weather in a given location",
# "parameters": {
# "type": "object",
# "properties": {
# "location": {"type": "string", "description": "The city and state, e.g. San Francisco, CA"},
# "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
# },
# "required": ["location"]
# }
# }
# }]
# try:
# response = client.chat.create(
# model=model_name_tools, # Use the tool-supporting model name
# messages=[{"role": "user", "content": "What's the weather like in Boston?"}],
# tools=tools,
# tool_choice="auto" # or force: {"type": "function", "function": {"name": "get_current_weather"}}
# )
# message = response.get("choices", [{}])[0].get("message", {})
# if message.get("tool_calls"):
# tool_call = message["tool_calls"][0]
# function_name = tool_call["function"]["name"]
# function_args = json.loads(tool_call["function"]["arguments"])
# print(f"Function Call Requested: {function_name}")
# print(f"Arguments: {function_args}")
# # --- Here you would execute the function and send back the result ---
# else:
# print(f"Response Content (No Tool Call): {message.get('content')}")
# except Exception as e:
# print(f"Could not run tool calling example: {e}")
Tip
The InfernoClient
provides a familiar interface for developers already using the openai
package, simplifying integration.
Inferno's OpenAI API compatibility makes it easy to integrate with popular AI frameworks.
# Example using LangChain
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage
# Configure LangChain to use your local Inferno server
llm = ChatOpenAI(
model="MyModel-Q4_K_M", # Your served model name
openai_api_key="dummy",
openai_api_base="http://localhost:8000/v1",
temperature=0.7,
streaming=True # Enable streaming if desired
)
print("--- LangChain Integration Example ---")
# Simple invocation
# response = llm.invoke([HumanMessage(content="Explain the difference between heat and temperature.")])
# print(response.content)
# Streaming invocation
print("Streaming with LangChain:")
for chunk in llm.stream("Write a haiku about a campfire."):
print(chunk.content, end="", flush=True)
print()
Works similarly with LlamaIndex, Semantic Kernel, Haystack, and any tool supporting the OpenAI API standard. Just point them to your Inferno server's URL (http://localhost:8000/v1
).
- RAM: This is the most critical factor.
- ~2-4 GB RAM for 1-3B parameter models (e.g., Phi-3 Mini, Gemma 2B)
- 8 GB+ RAM recommended for 7-8B models (e.g., Llama 3.3 8B, Mistral 7B)
- 16 GB+ RAM recommended for 13B models
- 32 GB+ RAM needed for ~30B models
- 64 GB+ RAM needed for ~70B models
- CPU: A modern multi-core CPU. Performance scales with core count and speed.
- GPU (Highly Recommended): An NVIDIA, AMD, or Apple Silicon GPU significantly accelerates inference. VRAM requirements depend on the model size and number of layers offloaded (
--n_gpu_layers
). Even partial offloading helps. - Disk Space: Enough space for downloaded models (GGUF files can range from ~1GB to 100GB+).
Warning
Running models requiring more RAM than physically available will lead to extreme slowdowns due to disk swapping. Check model RAM estimates (inferno list
, inferno estimate
) before running.
- Python: 3.9 or newer.
- Build Tools: A C++ compiler (like GCC, Clang, or MSVC) and CMake are required for building
llama-cpp-python
. - Core Dependencies:
llama-cpp-python
,fastapi
,uvicorn
,rich
,typer
,huggingface-hub
,pydantic
,requests
. (Installed automatically withpip install -e .
) - (Optional) Git: For cloning the repository.
Pass these options to inferno serve
and/or inferno run
as indicated:
Option | Description | Example | Default |
---|---|---|---|
--host <ip> |
IP address to bind the server to | --host 0.0.0.0 |
127.0.0.1 |
--port <num> |
Port for the API server | --port 8080 |
8000 |
--n_gpu_layers <n> |
Number of model layers to offload to GPU (-1 for max) | --n_gpu_layers 35 |
0 (CPU only) |
--n_ctx <n> |
Context window size (tokens), overrides auto-detection | --n_ctx 8192 |
Auto/4096 |
--n_threads <n> |
Number of CPU threads for computation | --n_threads 4 |
(Auto-detected) |
--use_mlock |
Force model to stay in RAM (prevents swapping if possible) | --use_mlock |
(Disabled) |
Tip
For optimal CPU performance, set --n_threads
to the number of physical cores on your CPU. Check your CPU specs (e.g., via Task Manager on Windows or lscpu
on Linux). Start with --n_gpu_layers -1
to offload as much as possible to VRAM, then reduce if you encounter memory errors.
Help fuel the fire! Contributions are highly welcome.
- Fork the repository on GitHub.
- Clone your fork locally:
git clone https://github.com/HelpingAI/inferno.git
- Create a new branch for your changes:
git checkout -b feature/my-cool-feature
orbugfix/fix-that-issue
. - Make your changes, commit them with clear messages:
git commit -m "Add feature X"
- Push your branch to your fork:
git push origin feature/my-cool-feature
- Open a Pull Request (PR) from your branch to the
main
branch of theHelpingAI/inferno
repository.
Please ensure your code follows basic Python best practices and includes relevant tests or documentation updates if applicable.
Inferno is licensed under the HelpingAI Open Source License. This license promotes open innovation and collaboration while ensuring responsible and ethical use of AI technology.
Find comprehensive guides, API references, advanced configuration details, and tutorials at deepwiki.com/HelpingAI/inferno
Made with β€οΈ by HelpingAI