Skip to content

Katsutami7moto/online-library-parser

Repository files navigation

online-library-parser

A tool to parse and download books from tululu.org

How to install

Python3 should be already installed. Download the ZIP archive of the code and unzip it. Then open terminal form unzipped directory and use pip (or pip3, if there is a conflict with Python2) to install dependencies:

pip install -r requirements.txt

How to use

This projects contains two scripts: parse_tululu.py and parse_tululu_category.py.

Both scripts download book texts in *.txt format and book covers as pictures (usually in *.jpg format), if a book has a cover. Texts are stored in books directory, covers - in images directory; both directories will be created automatically.

If the text file for a book is unavailable, the tool will continue downloading next books; the cover for this book will also not be downloaded, even if there is one.

The differences of these two scripts are described below.

Parse and download books by ID

You can run the first script in two ways:

  • without command line arguments, to download books by first 10 ids:
python3 parse_tululu.py
  • with two command line arguments, to download books from start_id to end_id, inclusively:
python3 parse_tululu.py --start_id 20 --end_id 30

or in short notation:

python3 parse_tululu.py -s 20 -e 30

start_id must be less than end_id and both arguments must be provided.

Parse and download sci-fi books by pages

This script has several optional arguments:

  • --start_page (default value is 1) and --end_page (default value is the automatically detected number of the last page) define, what pages of sci-fi category books will be downloaded from, inclusively:
python3 parse_tululu_category.py --start_page 238 --end_page 347
  • --dest_folder (default value is the folder where the script is stored) defines where directories for book texts and covers will be created:
python3 parse_tululu_category.py --start_page 238 --end_page 347 --dest_folder /home/username/grandpa_scifi_books/files
  • --json_path (default value is the folder where the script is stored) defines where the books_catalog.json file with the information about those books will be created:
python3 parse_tululu_category.py --start_page 238 --end_page 347 --dest_folder /home/username/grandpa_scifi_books/files --json_path /home/username/grandpa_scifi_books/metadata
  • --skip_txt and --skip_img, if added, disable download of text files or pictures, respectively:
python3 parse_tululu_category.py --start_page 238 --end_page 347 --json_path /home/username/grandpa_scifi_books/metadata --skip_txt --skip_img

Create website

  1. Delete media and pages directories from downloaded repository.
  2. Download books from, e.g., first 5 pages with this command:
python3 parse_tululu_category.py --end_page 5
  1. Run this command to create pages of the website using information from books_catalog.json file:
python3 render_website.py

You can open the website here while it's local, or stop the script and open pages from pages directory. Press Читать button to open text file of a book.

GitHub Pages website

An example of the website is available here.

It should look like this:

screenshot

Project Goals

The code is written for educational purposes on online-course for web-developers dvmn.org.

About

A tool to parse and download books from tululu.org

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published