This simple project is used to automatically generate backups of specific webpages.
The Python script alone is not sufficient for a repetitive action, so we need to set up a cron job. Of course, the cron job only runs when the computer is on, so the setup on a server is recommended.
Run the setup.sh
. You may have to make the file executable first with chmod +x setup.sh
.
Since Git can't index empty folders and I don't want to work with a .gitkeep file, you need to manually create a folder called "archive" in this project with mkdir archiv
.
Open the crontab file with crontab -e
and create a new cron job by adding a line according to the given scheme (minute hour day month day-of-week command-line-to-execute).
The time parameters can be freely selected. However, the web scraper script must be called in the command. Make sure you specify the correct path to the python file.
The new line could look like one of the following examples:
- Run the web-scraper every full hour:
0 * * * * /usr/stupid-web-scraper/main.py
- Run the web-scraper every 15 minutes from 8 am to 6 pm from Monday to Friday:
*/15 8-18 * * 1-6 /usr/stupid-web-scraper/main.py
See wiki.ubuntuusers.de/Cron for more information.
Save the file and make sure cron is actually running with service cron status
.
If not you have to start the service with sudo service cron start
.
Store the links to all webpages to be called in url_list.csv
file.
All other settings can be adjusted in config.py
file.
Start the cron jobs
service cron start
Stop the cron jobs
service cron stop
- Core functionality to scrap multiple webpages
- Add Logging
- Set configs in a separate file
- Set limit for saving web pages
- Adding a random time for retrieving the web pages
- Adjustable path to the backup (archive) folder
- Implementation for different operating systems
- Linux (using cron job)
- Windows (using scheduler)
- Using Docker
- Parallelize the web requests
- Send regular summaries as emails