-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH 19 - EPUB Import Support #82
Conversation
Thank you very much, this looks super! Nice that you covered the different python versions. I wasn’t even aware of the spooled temp file class … Thanks for handling the error case and activating the test too. Lute does let you edit individual pages in the book, there’s a menu item in the read pane sidebar. But currently you can’t delete full pages, that would be needed if the preamble is longer than one page. Is there anything users should be aware of for their EPUB imports? Eg no DRM or whatever, etc? I can update the manual with any notes. cheers! |
I stumbled on the spooled file thing since my default python version is 3.10 and I was receiving an error but when I looked at the code for the spooled file I found it seemed to have the seekable etc methods. Then found that it was a shortcoming the python team shored up in recently. Ah I didn't realize you could edit the pages! Just tried it and was able to remove the preamble content that I didn't want. Should be fine for users too. Yes, for EPUB imports they need to be DRM free. Also, I guess its worth mentioning they can edit the first few pages to remove the preamble content if they don't want it. They way the DRM works is that the file structure remains intact, but then inside of an "xhtml" file the content is encrypted. I tested a DRMed EPUB just now and there's an unhandled error from openepub (UnicodeDecodeError). This should be recast to EpubError so that downstream users of the library can handle it. I'll push a fix to openepub for this. The one change I'll also push to Lute for it is to bump up to |
Other than those two notes, I guess one more mention is it doesn't support |
Thank you very much, looks great. Trying it out now. |
Hi @sakolkar - I tried the import with an epub3 of A Tale of Two Cities from Gutenberg -- https://www.gutenberg.org/ebooks/61887. the import is breaking the epub into many short lines: which results in several separate sentences: The text in the underlying epub is not broken into these lines, it's continuous: I'll try it with another epub. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Putting this on hold until the epub parsing stuff is worked out in the underlying library. Currently, books are broken into many short lines. Happy to look at that with you if I can add any value.
Interesting find! Looks like the inside of that book has
I've put together a fix in |
Implementation Details:
Looked at the options for python libraries that can be used for extracting text from EPUB files. Due to the license on these libraries decided to start a new library with a more permissive license. This library is available on PyPi (openepub). So far it allows extracting text but in the future can support more features as they come.
openepub's library code can be reviewed here: https://github.com/sakolkar/openepub
Plugged in openepub into Lute as per the TODO items had outlined. Crafted a test EPUB file manually with only the bare minimum to meet EPUB spec (https://www.w3.org/TR/epub-33).
META-INF/container.xml
specifies a single package inEPUB/package.opf
and other necessary boilerplate.EPUB/package.opf
has a single reference to an itemEPUB/tengo.xhtml
and other necessary boilerplate.EPUB/tengo.xhtml
has the core body contents<p>Tengo un amigo.</p>
and necessary boilerplate.Special Notes:
The Flask WTF framework uses a
SpooledTemporaryFile
for the uploaded files, however, these are not compatible directly with thezipfile
standard library that is used insideopenepub
until Python 3.11. So, there is a safeguard to check if the python version is above3.11
and if it is not it rewrites the content into aTemporaryFile
and then usesopenepub
. The check looks to see if theseekable
method exists which was added toSpooledTemporaryFile
in the Python 3.11 release.Since the upload is a single action, users don't get a chance to edit the text before the book is created. Also, currently Lute doesn't support editing the book post-creation. This leads to some pre-amble text at the start of books being included. See this image:
![image](https://private-user-images.githubusercontent.com/9434751/292942657-9a18ccda-a6e5-4099-99d3-e595967ae6e6.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzg5MTQ3NTUsIm5iZiI6MTczODkxNDQ1NSwicGF0aCI6Ii85NDM0NzUxLzI5Mjk0MjY1Ny05YTE4Y2NkYS1hNmU1LTQwOTktOTlkMy1lNTk1OTY3YWU2ZTYucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI1MDIwNyUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNTAyMDdUMDc0NzM1WiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9MzBlYmQ1NDY3NmI3MjNkOTRhMTVkYzVkNmMyNjNkZWZhZDVmODY0MzY3NjcyODZkY2Y3YjE0NTk3OGYzY2VlYiZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QifQ.sPJcxrsmAnhEGc4RgBm9eGfcFOXB5JxKk5tN07jPghs)
If users upload an incorrectly structured EPUB, then an error message is flashed and they are sent back to the root page.
![image](https://private-user-images.githubusercontent.com/9434751/292942824-831ea971-f803-41d9-aab1-61a2baca2221.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzg5MTQ3NTUsIm5iZiI6MTczODkxNDQ1NSwicGF0aCI6Ii85NDM0NzUxLzI5Mjk0MjgyNC04MzFlYTk3MS1mODAzLTQxZDktYWFiMS02MWEyYmFjYTIyMjEucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI1MDIwNyUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNTAyMDdUMDc0NzM1WiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9NGQ2ZDg0ZmNlNzM4ZWMwZmY1NjA1NGJmMjQ2ZjE2NjVmNWMyMmMzNDJkMmU0Y2ZlMmM1MDdkNDk1ODdlODZlNyZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QifQ.LVFkWIhy99OM4FbjsAxj6K8Cg8KimE06_lQFbVIox5A)