A powerful and user-friendly Python tool to scrape and download images from websites with optional optimization features. Perfect for bulk image downloading, research, and content collection.
- Bulk Image Downloading: Download all images from any webpage
- Smart URL Detection: Automatically handles relative and absolute URLs
- Lazy Loading Support: Detects and downloads lazy-loaded images (
data-src
) - Image Optimization: Optional compression to reduce file sizes
- Concurrent Downloads: Fast multi-threaded downloading
- GUI Directory Selection: Easy-to-use file dialog for choosing save location
- Duplicate Prevention: Automatic filename collision handling
- Comprehensive Logging: Detailed logs with timestamps
- Progress Tracking: Real-time download progress and statistics
- Error Handling: Robust error recovery and reporting
- Python 3.6+
- Required packages:
requests
- HTTP library for downloadingbeautifulsoup4
- HTML parsingpillow
- Image processing (optional, for optimization)tkinter
- GUI components (usually included with Python)
# Clone the repository
git clone https://github.com/mahdidevlp/image-scraper.git
cd image-scraper
# Install dependencies
pip install -r requirements.txt
# Run the script
python image_scraper.py
# Clone and setup
git clone https://github.com/mahdidevlp/image-scraper.git
cd image-scraper
# Create virtual environment
python -m venv image_scraper_env
# Activate virtual environment
source image_scraper_env/bin/activate # Linux/Mac
# or
image_scraper_env\\Scripts\\activate # Windows
# Install dependencies
pip install -r requirements.txt
# Run the script
python image_scraper.py
# Make script executable
chmod +x image_scraper.py
# Create symbolic link (optional)
sudo ln -s $(pwd)/image_scraper.py /usr/local/bin/image-scraper
# Run from anywhere
image-scraper
Simply run the script and follow the prompts:
python image_scraper.py
The script will ask you for:
- Website URL - The webpage to scrape images from
- Optimization preference - Whether to compress images
- Save directory - Where to save the downloaded images
Enter the webpage URL to scrape images from: https://example.com/gallery
Optimize images to reduce file size? (y/n): y
[GUI dialog opens for directory selection]
2025-01-07 15:30:45 [INFO] Fetching images from https://example.com/gallery
2025-01-07 15:30:46 [INFO] Found 25 images.
2025-01-07 15:30:46 [INFO] Downloading 25 images to /home/user/Downloads/images
2025-01-07 15:30:47 [INFO] (1/25) Downloaded: gallery_image_001.jpg
2025-01-07 15:30:47 [INFO] (2/25) Downloaded: photo_002.png
...
2025-01-07 15:31:15 [INFO] Download Summary:
2025-01-07 15:31:15 [INFO] Total images found: 25
2025-01-07 15:31:15 [INFO] Successfully downloaded: 23
2025-01-07 15:31:15 [INFO] Failed to download: 2
2025-01-07 15:31:15 [INFO] Images saved to: /home/user/Downloads/images
- Finds images in
<img>
tags withsrc
attributes - Detects lazy-loaded images using
data-src
attributes - Handles both relative and absolute URLs
- Removes duplicate URLs automatically
- Downloads up to 5 images simultaneously
- Significantly faster than sequential downloading
- Configurable thread pool size
- Optional JPEG compression (85% quality)
- Converts RGBA/LA images to RGB
- Reduces file sizes by 20-50% typically
- Preserves image quality while saving space
- Automatic filename generation for images without extensions
- Collision detection and automatic renaming
- Preserves original filenames when possible
- Creates directories automatically
- Real-time progress tracking
- Detailed error reporting
- Timestamped log files
- Console and file logging
You can customize the script behavior by editing these variables:
# In download_image function
max_workers = 5 # Concurrent download threads
timeout = 10 # Request timeout in seconds
chunk_size = 8192 # Download chunk size
# In optimize_image function
quality = 85 # JPEG compression quality (1-100)
optimize = True # Enable PIL optimization
Modify the headers in get_image_urls()
function:
headers = {
'User-Agent': 'Your Custom User Agent',
'Referer': 'https://example.com',
# Add more headers as needed
}
- JPEG/JPG - Full support with optimization
- PNG - Full support with RGBA conversion
- GIF - Download only (no optimization)
- BMP - Full support
- WebP - Full support
- TIFF - Download support
-
"Module not found" errors
pip install requests beautifulsoup4 pillow
-
Permission denied errors
chmod +x image_scraper.py
-
No images found
- Check if the website uses JavaScript to load images
- Try a different webpage
- Verify the URL is accessible
-
Download failures
- Some websites block automated requests
- Images might be behind authentication
- Check internet connection
Enable debug logging by modifying the logging level:
logging.basicConfig(level=logging.DEBUG)
# Planned features
python image_scraper.py --url https://example.com --output ./images --optimize --max-workers 10
Create a list of URLs in a text file and process them:
# Create urls.txt with one URL per line
# Run batch processing script (see examples/)
Contributions are welcome! Please feel free to submit issues, feature requests, or pull requests.
- Fork the repository
- Create a feature branch:
git checkout -b feature/amazing-feature
- Install development dependencies:
pip install -r requirements-dev.txt
- Make your changes and add tests
- Commit your changes:
git commit -m 'Add amazing feature'
- Push to the branch:
git push origin feature/amazing-feature
- Open a Pull Request
- Follow PEP 8 guidelines
- Use meaningful variable names
- Add docstrings to functions
- Include error handling
- Write unit tests for new features
This project is licensed under the MIT License - see the LICENSE file for details.
- Beautiful Soup - HTML parsing library
- Requests - HTTP library for Python
- Pillow - Python Imaging Library
- Tkinter - GUI toolkit
If you encounter any issues or have questions:
- Check the Issues page
- Create a new issue with:
- Python version and OS
- Error messages or logs
- URL you're trying to scrape (if not sensitive)
- Steps to reproduce the issue
- Command-line interface with arguments
- Batch processing from URL lists
- GUI application using tkinter
- Website-specific scrapers (Instagram, Pinterest, etc.)
- Image filtering by size/format
- Resume interrupted downloads
- Configuration file support
- Docker container
- Progress bar for GUI mode
Typical Performance:
- Small images (< 100KB): 10-20 images/second
- Medium images (100KB-1MB): 5-10 images/second
- Large images (> 1MB): 2-5 images/second
Factors affecting speed:
- Internet connection speed
- Server response time
- Image file sizes
- Number of concurrent workers
Made with β€οΈ for content creators and researchers