https://github.com/angrysky56/llada_gui_new/tree/main
A graphical user interface for interacting with the LLaDA (Large Language Diffusion with mAsking) model.
Currently maxes out my 12gb VRAM using 4 bit (reads around 20gb) but the new optimizations are working well and much faster- maybe 10x.
./start_memory_optimized.sh
Prototype memory system now available, slower and VRAM intensive. Derived from:
https://github.com/synthience/mcp-titan-cognitive-memory
Generally uses around 40gb RAM in CPU mode.
This is a GUI wrapper for the LLaDA model, an 8B scale diffusion model trained entirely from scratch that rivals LLaMA3 8B in performance. Unlike conventional autoregressive language models, LLaDA uses a diffusion approach with masking to generate text.
Important: This GUI is a third-party tool and not officially affiliated with the original LLaDA project. All credit for the underlying LLaDA model goes to the original authors at the Gaoling School of Artificial Intelligence, Renmin University of China. Please visit their official repository for more information about the model.
This GUI includes several optimizations to make the model run efficiently on consumer hardware:
- Smart CPU-GPU Offloading: Intelligently moves tensors between CPU and GPU to minimize memory usage
- Token Buffer Management: Manages token data efficiently to reduce peak memory requirements
- Adaptive Step Scheduling: Uses fewer steps for easier tokens, more for difficult ones
- Block-Level Processing: Processes tokens in blocks for better GPU utilization
- Progressive Generation: High-confidence tokens are revealed early in the process
- Chunked Operations: Large operations are broken into manageable chunks
These optimizations allow the model to run on GPUs with 8-12GB VRAM while providing faster generation than the original implementation.
- Text Generation: Generate text responses to your prompts
- Intuitive Interface: Easy-to-use controls for interacting with the model
- Configurable Parameters: Adjust generation length, sampling steps, and more
- Diffusion Visualization: Watch the diffusion process unfold in real-time
- Token Evolution: See how masked tokens evolve into predicted text
- Memory Management: Options to optimize memory usage, including:
- Real-time memory monitoring
- 4-bit and 8-bit quantization options
- CPU fallback for low-memory situations
- Automatic parameter adjustment based on available memory
- Performance Optimizations: Built-in tools to improve performance:
- Memory-efficient settings for lower GPU usage
- Attention slicing for handling larger prompts
- Precision control for speed/memory tradeoffs
- Python 3.10 or later
- PyQt6
- PyTorch 2.0 or later
- Transformers 4.38.2
- CUDA-capable GPU with at least 10GB memory (for optimal performance)
- CPU-only mode is also supported (slower but works on any machine)
-
Clone this repository:
git clone https://github.com/angrysky56/llada-gui.git cd llada-gui
-
Use the provided installation script:
chmod +x install.sh ./install.sh
The script will:
- Create a virtual environment
- Install all required packages
- Set up desktop integration if applicable
-
Alternatively, you can manually set up the environment:
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate pip install -r requirements.txt
There are several ways to start the application:
-
Using the start script:
./start_gui.sh
-
Direct Python execution:
./venv/bin/python run.py
-
Using the desktop file (if installed): Double-click the
LLaDA_GUI.desktop
file in your applications menu or desktop.
-
Enter your prompt in the text input area
-
Adjust generation parameters as needed:
- Generation Length: Number of tokens to generate
- Sampling Steps: Number of diffusion steps (higher = better quality but slower)
- Block Length: Size of blocks for semi-autoregressive generation
- Temperature: Controls randomness (0 = deterministic, higher = more random)
- CFG Scale: Classifier-free guidance strength
- Remasking Strategy: Method to select which tokens remain masked
-
Select hardware options:
- Choose between CPU or GPU
- Select memory optimization (normal precision, 8-bit, or 4-bit quantization)
-
Click "Generate" to start the process
-
Watch the diffusion process in the visualization tab
-
View the final output in the text output tab
If you encounter out-of-memory errors:
- Reduce Generation Length and Sampling Steps
- Try 8-bit or 4-bit quantization options
- Switch to CPU mode if necessary (will be slower but more reliable)
- Use the built-in performance optimizer (described below)
This application includes built-in performance optimization tools that can significantly reduce memory usage and improve generation speed.
-
Launch the optimizer:
python optimize_launcher.py
Or use the desktop shortcut: Double-click the
LLaDA_Optimizer.desktop
file. -
Select optimizations in the GUI:
- GPU Memory Optimizations
- Config File Patches
- Worker Code Optimizations
-
Apply optimizations by clicking "Apply Optimizations"
-
Restart the application to use the optimized version
Unlike autoregressive models that generate one token at a time, LLaDA works by:
- Starting with a completely masked sequence of the desired length
- At each step, predicting values for all masked tokens simultaneously
- Based on prediction confidence, keeping some tokens and remasking others
- Repeating until all tokens are predicted
The visualization tab shows this process in action, with:
- Gray boxes for masked tokens
- Colored boxes for predicted tokens (color intensity indicates confidence)
The application is organized into the following components:
llada_gui.py
: Main GUI application codellada_worker.py
: Worker thread for asynchronous model executiondiffusion_visualization.py
: Visualization of the diffusion processmemory_monitor.py
: Real-time memory usage trackingconfig.py
: Application configuration and constantsutils.py
: Utility functionsrun.py
: Entry point scriptoptimizations/
: Performance optimization toolsonnx/
: Experimental ONNX conversion utilities
This GUI is built on top of the LLaDA model developed by researchers at the Gaoling School of Artificial Intelligence, Renmin University of China. Please cite their work when using this application:
@article{nie2025large,
title={Large Language Diffusion Models},
author={Nie, Shen and Zhu, Fengqi and You, Zebin and Zhang, Xiaolu and Ou, Jingyang and Hu, Jun and Zhou, Jun and Lin, Yankai and Wen, Ji-Rong and Li, Chongxuan},
journal={arXiv preprint arXiv:2502.09992},
year={2025}
}
This application is provided as-is under the MIT License. See the LICENSE file for details.
The LLaDA model has its own license from the original developers. Please refer to the original repository for more information.
Contributions are welcome! Please feel free to submit a Pull Request.