WebGroper

WebGroper is a Python class designed to recursively scrape and download media files (images, PDFs, etc.) from a specified website directory, such as the /wp-content/uploads directory of a WordPress site.

Features

Recursively traverses URLs to find and download media files.
Ignores resized images generated by WordPress.
Saves downloaded files in a structured directory.

Requirements

Python 3.x
requests library
beautifulsoup4 library

Installation

Clone the repository or download the script.
Install the required libraries using pip:
```
pip install requests beautifulsoup4
```

Usage

Create an instance of the WebGroper class with the desired parameters.
Call the traverse_url_recursive method with the starting URL.

Example:

from webgroper import WebGroper

# Initialize the WebGroper class
web_groper = WebGroper(
    output_directory="groped_data",
    time_between_download_requests=1,
    ignore_sizes_regex=r"-\d+x\d+\.[a-z]+"
)

# Start scraping from the specified URL
web_groper.traverse_url_recursive("https://example-wordpress-site.com/wp-content/uploads/")

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
webgroper		webgroper
.gitignore		.gitignore
README.md		README.md
example.py		example.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WebGroper

Features

Requirements

Installation

Usage

About

Languages

developer-sumit/web-groper-python

Folders and files

Latest commit

History

Repository files navigation

WebGroper

Features

Requirements

Installation

Usage

About

Topics

Resources

Stars

Watchers

Forks

Languages