WebGroper is a Python class designed to recursively scrape and download media files (images, PDFs, etc.) from a specified website directory, such as the /wp-content/uploads
directory of a WordPress site.
- Recursively traverses URLs to find and download media files.
- Ignores resized images generated by WordPress.
- Saves downloaded files in a structured directory.
- Python 3.x
requests
librarybeautifulsoup4
library
-
Clone the repository or download the script.
-
Install the required libraries using pip:
pip install requests beautifulsoup4
- Create an instance of the
WebGroper
class with the desired parameters. - Call the
traverse_url_recursive
method with the starting URL.
Example:
from webgroper import WebGroper
# Initialize the WebGroper class
web_groper = WebGroper(
output_directory="groped_data",
time_between_download_requests=1,
ignore_sizes_regex=r"-\d+x\d+\.[a-z]+"
)
# Start scraping from the specified URL
web_groper.traverse_url_recursive("https://example-wordpress-site.com/wp-content/uploads/")