An intelligent agent that combines local LLM inference with web search capabilities to provide well-researched answers with proper citations. Optimized for python3 3.12 with TinyLlama and GPU acceleration on Apple Silicon M-series chips.
- Clone the repository and create a virtual environment:
# Clone the repository
git clone https://github.com/yourusername/agentic-llm-search.git
cd agentic-llm-search
# Create and activate a virtual environment
python3 -m venv venv
source venv/bin/activate
# Install dependencies
pip3 install -r requirements.txt
- Install ctransformers with Metal support (for Apple Silicon Macs):
# Uninstall any existing ctransformers installation
pip3 uninstall ctransformers --yes
# Reinstall with Metal support
CT_METAL=1 pip3 install ctransformers --no-binary ctransformers
- Download the model and run the application:
# Download the TinyLlama model
python3 download_model.py
# Run the application
python3 main.py
- Internet Search: Fetch and process information from the web using DuckDuckGo search
- Local LLM Inference: Use TinyLlama for efficient inference on your local machine
- GPU Acceleration: Support for Apple Silicon M-series GPU acceleration using Metal
- OpenAI Integration: Optionally use OpenAI models for more powerful responses
- Citation Support: Responses include proper citations to search results
- python3 3.12 Optimized: Built to leverage the latest python3 features
- Content Analysis: Extract and process content from multiple web sources
- Multiple Interfaces: CLI, Web UI, and API options
- Clone the repository
git clone https://github.com/yourusername/agentic-llm-search.git
cd agentic-llm-search
- Set up a python3 virtual environment:
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
- Install dependencies:
pip3 install -r requirements.txt
- Download the TinyLlama model:
python3 download_model.py
- Run the setup script to verify environment and install missing dependencies:
python3 setup.py
- Check compatibility with python3 3.12:
python3 check_compatibility.py
- Verify GPU acceleration support:
python3 check_gpu.py
This project supports hardware acceleration for faster model inference:
On Apple Silicon Macs, the system uses Metal Performance Shaders (MPS) to accelerate model inference:
- Automatically detects M-series chips and configures GPU acceleration
- Uses up to 32 GPU-accelerated layers with ctransformers
- Provides approximately 2-5x speedup compared to CPU-only inference
- Extended context length (4096 tokens) to avoid token limit warnings
For optimal performance on macOS with Apple Silicon, install ctransformers with Metal support:
# Uninstall any existing ctransformers installation
pip3 uninstall ctransformers --yes
# Reinstall with Metal support
CT_METAL=1 pip3 install ctransformers --no-binary ctransformers
This enables GPU acceleration for inference using the Metal framework on Apple Silicon Macs.
You can adjust GPU settings in your .env
file:
USE_GPU=True # Set to False to force CPU only
USE_METAL=True # For Apple Silicon GPUs
CONTEXT_LENGTH=4096 # Increased token context length
GPU_LAYERS=32 # Number of layers to offload to GPU
To test the performance of the LLM model on your system and compare CPU vs GPU speeds:
python3 benchmark.py
# Customize the benchmark
python3 benchmark.py --model ./src/models/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf --runs 5 --context-length 4096
The benchmark tool measures:
- Token generation speed
- Inference time
- Speedup factor with GPU acceleration
- System and hardware configuration
Run the interactive test script:
python3 test_agentic_search.py
Enter your questions when prompted, and the agent will:
- Search the web for relevant information
- Process the search results
- Generate a comprehensive answer with citations
The agent supports two model providers:
To use HuggingFace models (recommended for privacy and no API costs):
# In your code
agent = AgenticLLMAgent(
model_name="./src/models/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf",
model_provider="huggingface"
)
# Or in .env file
DEFAULT_MODEL=./src/models/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf
MODEL_PROVIDER=huggingface
Available HuggingFace models:
./src/models/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf
(recommended for low resource usage)TheBloke/Llama-2-7B-Chat-GGUF
(better quality but requires more RAM)microsoft/phi-2
(good balance of size and quality)
To use OpenAI's models (requires API key):
# .env file
DEFAULT_MODEL=gpt-3.5-turbo
MODEL_PROVIDER=openai
OPENAI_API_KEY=your_api_key_here
Available OpenAI models:
gpt-3.5-turbo
(fast and cost-effective)gpt-4
(higher quality but more expensive)gpt-4-turbo
(latest version)
To use Azure OpenAI Services (requires Azure OpenAI resource):
# .env file
DEFAULT_MODEL=gpt-35-turbo # Should match your Azure OpenAI deployment name
MODEL_PROVIDER=azure-openai
AZURE_OPENAI_API_KEY=your_azure_openai_api_key
AZURE_OPENAI_ENDPOINT=https://your-resource-name.openai.azure.com/
AZURE_OPENAI_API_VERSION=2023-05-15
Available Azure OpenAI models (deployment names may vary):
gpt-35-turbo
(Azure's GPT-3.5)gpt-4
(Azure's GPT-4)gpt-4-turbo
(Azure's GPT-4 Turbo)
The easiest way to use the application is with the included run.sh
script, which handles environment setup, model checking, and provides a simple interface:
# Make the script executable if needed
chmod +x run.sh
# Run the application
./run.sh
The script will:
- Check for Python 3.12+ and set up a virtual environment
- Verify the model is downloaded or download it if missing
- Let you choose a model provider (Local TinyLlama or Azure OpenAI)
- Let you select between CLI and Web Interface
Run the agent in interactive mode:
python3 main.py
Or ask a single question:
python3 main.py "What are the latest developments in AI?"
Additional CLI options:
--model MODEL LLM model to use (default: ./src/models/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf)
--provider PROVIDER Model provider to use (choices: huggingface, openai, azure-openai)
--no-search Disable internet search
--max-results MAX Maximum search results to use (default: 5)
Run the Streamlit web app:
streamlit run app.py
Then open your browser at http://localhost:8501.
Start the FastAPI server:
python3 -m uvicorn api:app --reload
Then access the API at http://localhost:8000 or view the API documentation at http://localhost:8000/docs.
├── app.py # Streamlit web interface
├── api.py # FastAPI web API
├── main.py # Command-line interface
├── requirements.txt # Dependencies
├── src/
│ ├── __init__.py # Core data models
│ ├── agents/ # Agent implementation
│ ├── models/ # LLM model wrappers
│ ├── tools/ # Search and utility tools
│ └── utils/ # Utility functions
└── tests/ # Test cases
pytest tests/
The following sequence diagram illustrates how the different components of the Agentic LLM Search system interact during a typical query:
sequenceDiagram
participant User
participant AgenticLLMAgent
participant SearchQueryOptimizer
participant InternetSearchTool
participant LLMModel
participant AgentModelOrchestrator
User->>AgenticLLMAgent: process_query(query)
activate AgenticLLMAgent
AgenticLLMAgent->>SearchQueryOptimizer: optimize_query(query)
activate SearchQueryOptimizer
SearchQueryOptimizer-->>AgenticLLMAgent: optimized_query
deactivate SearchQueryOptimizer
AgenticLLMAgent->>InternetSearchTool: async_search(optimized_query)
activate InternetSearchTool
InternetSearchTool->>InternetSearchTool: Search web with DuckDuckGo
InternetSearchTool->>InternetSearchTool: Extract content from webpages
InternetSearchTool-->>AgenticLLMAgent: search_results
deactivate InternetSearchTool
AgenticLLMAgent->>AgentModelOrchestrator: generate_research_response(query, search_results)
activate AgentModelOrchestrator
AgentModelOrchestrator->>AgentModelOrchestrator: Format search results for context
AgentModelOrchestrator->>LLMModel: generate_response(query, context)
activate LLMModel
alt HuggingFaceModel
LLMModel->>LLMModel: Process with TinyLlama (GPU accelerated)
else OpenAIModel
LLMModel->>LLMModel: Call OpenAI API
end
LLMModel-->>AgentModelOrchestrator: answer
deactivate LLMModel
AgentModelOrchestrator-->>AgenticLLMAgent: AgentResponse(answer, sources, etc.)
deactivate AgentModelOrchestrator
AgenticLLMAgent-->>User: AgentResponse
deactivate AgenticLLMAgent
graph TD
User([User]) --> |Query| CLI(Command Line Interface)
User --> |Query| WebUI(Web UI - Streamlit)
User --> |API Call| API(API Server - FastAPI)
CLI --> Agent(AgenticLLMAgent)
WebUI --> Agent
API --> Agent
Agent --> |Optimize| QueryOpt(SearchQueryOptimizer)
Agent --> |Search| SearchTool(InternetSearchTool)
Agent --> |Generate Response| Orchestrator(AgentModelOrchestrator)
SearchTool --> |Web Search| DuckDuckGo[(DuckDuckGo)]
SearchTool --> |Content Extraction| WebPages[(Web Pages)]
Orchestrator --> LLM{LLM Model}
LLM --> |Local| HFModel(HuggingFaceModel)
LLM --> |Cloud| OpenAIModel(OpenAIModel)
HFModel --> TinyLlama[(TinyLlama GGUF)]
OpenAIModel --> |API| OpenAI[(OpenAI API)]
style Agent fill:#f9f,stroke:#333,stroke-width:2px
style LLM fill:#bbf,stroke:#333,stroke-width:2px
style SearchTool fill:#bfb,stroke:#333,stroke-width:2px
This diagram shows how the system handles model initialization and setup, especially with GPU acceleration:
sequenceDiagram
participant User
participant Runner as RunnerScript
participant EnvSetup as EnvironmentSetup
participant HFModel as HuggingFaceModel
participant GPU as GPU Detection
participant ModelLoader as ModelLoader
User->>Runner: Start application
activate Runner
Runner->>EnvSetup: setup_huggingface_env()
activate EnvSetup
EnvSetup->>EnvSetup: Check for hf_transfer
EnvSetup->>GPU: detect_apple_silicon()
activate GPU
alt Apple Silicon Detected
GPU-->>EnvSetup: M-series chip detected
EnvSetup->>EnvSetup: setup_gpu_environment()
EnvSetup->>EnvSetup: Configure Metal backend
else Other Hardware
GPU-->>EnvSetup: Standard hardware detected
EnvSetup->>EnvSetup: Use default configuration
end
deactivate GPU
EnvSetup-->>Runner: Environment configured
deactivate EnvSetup
Runner->>HFModel: Initialize model
activate HFModel
HFModel->>HFModel: Set acceleration options
alt Metal GPU Available
HFModel->>ModelLoader: Load with MPS backend
HFModel->>HFModel: Configure 32 GPU layers
HFModel->>HFModel: Set 4096 context length
else CPU Only
HFModel->>ModelLoader: Load with CPU backend
HFModel->>HFModel: Use default settings
end
HFModel-->>Runner: Model ready for inference
deactivate HFModel
Runner-->>User: System ready for queries
deactivate Runner
MIT
The project includes a comprehensive suite of diagnostic tools to help you identify and fix issues:
Run the main diagnostics tool to check your entire system configuration:
python3 diagnostics.py
This tool checks:
- System information and hardware compatibility
- GPU configuration and Metal support for Apple Silicon
- Environment variables and configuration
- Model files and their status
- Module structure and implementation
- Quick functionality test
It provides personalized recommendations based on your specific setup.
Use the log analyzer to diagnose issues from log files:
# Analyze the most recent log file
python3 analyze_logs.py
# Analyze a specific log file
python3 analyze_logs.py --log path/to/logfile.log
# Analyze all log files in a directory
python3 analyze_logs.py --dir logs --all
The log analyzer automatically identifies common error patterns and provides targeted solutions.
To specifically check GPU acceleration support:
python3 check_gpu.py
If you encounter issues with model downloads:
# Run the hf_transfer diagnostics tool
python3 install_hf_transfer.py --diagnose
# Force reinstall hf_transfer
pip3 uninstall -y hf_transfer
pip3 install hf_transfer==0.1.4
If GPU acceleration is not working as expected:
# Force specific configuration in .env
USE_GPU=True
USE_METAL=True # For Apple Silicon
GPU_LAYERS=32 # Adjust based on your GPU capability
If you see context length warnings in model output:
# Add to your .env file
CONTEXT_LENGTH=4096
If the model is crashing due to memory constraints:
- Try a smaller model variant
- Reduce
GPU_LAYERS
setting in.env
- Set
USE_GPU=False
to use CPU only mode - Adjust batch size with
BATCH_SIZE=1
in.env
Built with python3, OpenAI, HuggingFace, DuckDuckGo Search, FastAPI, and Streamlit.