Large Language Models (LLMs) are typically distributed in formats optimized for training (like PyTorch) and can be extremely large (hundreds of gigabytes), making them impractical for most real-world applications. This tool addresses two critical challenges in LLM deployment:
- Size: Original models are too large to run on consumer hardware
- Format: Training formats are not optimized for inference
Explore and experiment with the LLM Quantization tool on Hugging Face Spaces: LLM Quantization Demo
I built this tool to help AI Researchers achieve the following:
- Converting models from Hugging Face to GGUF format (optimized for inference)
- Quantizing models to reduce their size while maintaining acceptable performance
- Making deployment possible on consumer hardware (laptops, desktops) with limited resources
- LLMs in their original format require significant computational resources
- Running these models typically needs:
- High-end GPUs
- Large amounts of RAM (32GB+)
- Substantial storage space
- Complex software dependencies
This tool provides:
-
Format Conversion
- Converts from PyTorch/Hugging Face format to GGUF
- GGUF is specifically designed for efficient inference
- Enables memory mapping for faster loading
- Reduces dependency requirements
-
Quantization
- Reduces model size by up to 4-8x
- Converts from FP16/FP32 to more efficient formats (INT8/INT4)
- Maintains reasonable model performance
- Makes models runnable on consumer-grade hardware
-
Accessibility
- Enables running LLMs on standard laptops
- Reduces RAM requirements
- Speeds up model loading and inference
- Simplifies deployment process
This tool helps developers and researchers to:
- Download LLMs from Hugging Face Hub
- Convert models to GGUF (GPT-Generated Unified Format)
- Quantize models for efficient deployment
- Upload processed models back to Hugging Face
- Model Download: Direct integration with Hugging Face Hub
- GGUF Conversion: Convert PyTorch models to GGUF format
- Quantization Options: Support for various quantization levels
- Batch Processing: Automate the entire conversion pipeline
- HF Upload: Option to upload processed models back to Hugging Face
Quantizer Name | Purpose | Benefits | When to Use |
---|---|---|---|
Q2_K | Quantizes model to 2 bits using K mode | Minimizes memory usage, faster inference | Use for highly memory-constrained environments |
Q3_K_l | 3-bit quantization using low precision mode | Balance between size reduction and inference quality | When a small model size with moderate precision is needed |
Q3_K_M | 3-bit quantization with medium precision mode | Better performance with slight increase in memory usage | When moderate precision and size reduction are desired |
Q3_K_S | 3-bit quantization using high precision mode | Higher inference quality with minimal size reduction | When inference quality is a higher priority than size |
Q4_0 | 4-bit quantization with zero mode | Reduced model size with minimal impact on performance | Use when a larger model is required but memory is limited |
Q4_1 | 4-bit quantization with another precision mode | Better performance than Q4_0 with slight increase in size | When a balance of size and performance is required |
Q4_K_M | 4-bit quantization using K mode with medium precision | Further optimized performance with reduced model size | For performance optimization in moderately sized models |
Q4_K_S | 4-bit quantization using K mode with high precision | Optimized for size with higher precision | When slightly higher precision and smaller size are needed |
Q5_0 | 5-bit quantization using zero mode | Larger model size with enhanced precision | Use when memory is not a major constraint and high precision is required |
Q5_1 | 5-bit quantization with an alternative mode | Offers trade-off between size and performance | For improved performance at the cost of some additional memory usage |
Q5_K_M | 5-bit quantization using K mode with medium precision | Better model compression and performance | When model performance is crucial and space is a concern |
Q5_K_S | 5-bit quantization using K mode with high precision | Optimal performance with minimal size increase | Use for high-performance applications with moderate memory limits |
Q6_K | 6-bit quantization using K mode | Larger model size but better precision | For applications where precision is critical and space is more available |
Q8_0 | 8-bit quantization with zero mode | Maximum size reduction with reasonable precision | Use when model size is most critical and higher precision is not needed |
BF16 | 16-bit Brain Floating Point quantization | Balances precision and size with higher performance | When a high level of performance is needed with moderate memory usage |
F16 | 16-bit Floating Point quantization | Offers good precision and performance with moderate memory usage | When maintaining a high precision model is essential |
F32 | 32-bit Floating Point quantization | Highest precision, best for model training and inference | Use when maximum precision is required for sensitive tasks |
GGUF (GPT-Generated Unified Format) offers several advantages:
GGUF (GPT-Generated Unified Format) is a file format specifically designed for efficient deployment and inference of large language models.
- GGUF is specifically designed for model inference (running predictions) rather than training.
- It's the native format used by llama.cpp, a popular framework for running LLMs on consumer hardware.
- Reduces memory usage compared to the original PyTorch/Hugging Face formats.
- Allows running larger models on devices with limited RAM.
- Supports various quantization levels (reducing model precision from FP16/FP32 to INT8/INT4).
- Models in GGUF format can be memory-mapped (mmap), meaning they can be loaded partially as needed.
- Reduces initial loading time and memory overhead.
- Works well across different operating systems and hardware.
- Doesn't require Python or PyTorch installation.
- Can run on CPU-only systems effectively.
- Contains model configuration, tokenizer, and other necessary information in a single file.
- Makes deployment simpler as all required information is bundled together.
# Clone the repository
git clone https://github.com/bhaskatripathi/LLM_Quantization.git
# Install dependencies
pip install -r requirements.txt
# Run the Streamlit application
streamlit run app.py
Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
- Fork the Project
- Create your Feature Branch (
git checkout -b feature/AmazingFeature
) - Commit your Changes (
git commit -m 'Add some AmazingFeature'
) - Push to the Branch (
git push origin feature/AmazingFeature
) - Open a Pull Request
This project is licensed under the MIT License
- Python 3.8+
- Streamlit
- Hugging Face Hub account (for model download/upload)
- Sufficient storage space for model processing
The tool currently supports various model architectures including:
- DeepSeek models
- Mistral models
- Llama models
- Qwen models
- And more...
If you encounter any issues or have questions:
- Check the existing issues
- Create a new issue with a detailed description
- Include relevant error messages and environment details
- Hugging Face for the model hub
- llama.cpp for GGUF format implementation
- All contributors and maintainers
Made with ❤️ for the AI community