A tool to parse and download books from tululu.org
Python3 should be already installed.
Download the ZIP archive of the code and unzip it.
Then open terminal form unzipped directory and use pip
(or pip3
, if there is a conflict with Python2) to install dependencies:
pip install -r requirements.txt
This projects contains two scripts: parse_tululu.py
and
parse_tululu_category.py
.
Both scripts download book texts in *.txt format and book covers as pictures (usually in *.jpg format), if a book has a cover. Texts are stored in books
directory, covers - in images
directory; both directories will be created automatically.
If the text file for a book is unavailable, the tool will continue downloading next books; the cover for this book will also not be downloaded, even if there is one.
The differences of these two scripts are described below.
You can run the first script in two ways:
- without command line arguments, to download books by first 10 ids:
python3 parse_tululu.py
- with two command line arguments, to download books from
start_id
toend_id
, inclusively:
python3 parse_tululu.py --start_id 20 --end_id 30
or in short notation:
python3 parse_tululu.py -s 20 -e 30
start_id
must be less than end_id
and both arguments must be provided.
This script has several optional arguments:
--start_page
(default value is 1) and--end_page
(default value is the automatically detected number of the last page) define, what pages of sci-fi category books will be downloaded from, inclusively:
python3 parse_tululu_category.py --start_page 238 --end_page 347
--dest_folder
(default value is the folder where the script is stored) defines where directories for book texts and covers will be created:
python3 parse_tululu_category.py --start_page 238 --end_page 347 --dest_folder /home/username/grandpa_scifi_books/files
--json_path
(default value is the folder where the script is stored) defines where thebooks_catalog.json
file with the information about those books will be created:
python3 parse_tululu_category.py --start_page 238 --end_page 347 --dest_folder /home/username/grandpa_scifi_books/files --json_path /home/username/grandpa_scifi_books/metadata
--skip_txt
and--skip_img
, if added, disable download of text files or pictures, respectively:
python3 parse_tululu_category.py --start_page 238 --end_page 347 --json_path /home/username/grandpa_scifi_books/metadata --skip_txt --skip_img
- Delete
media
andpages
directories from downloaded repository. - Download books from, e.g., first 5 pages with this command:
python3 parse_tululu_category.py --end_page 5
- Run this command to create pages of the website using information from
books_catalog.json
file:
python3 render_website.py
You can open the website here while it's local, or stop the script and open pages from pages
directory. Press Читать
button to open text file of a book.
An example of the website is available here.
It should look like this:
The code is written for educational purposes on online-course for web-developers dvmn.org.