Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH 19 - EPUB Import Support #82

Merged
merged 7 commits into from
Dec 28, 2023
Merged

GH 19 - EPUB Import Support #82

merged 7 commits into from
Dec 28, 2023

Conversation

sakolkar
Copy link
Contributor

Implementation Details:

Looked at the options for python libraries that can be used for extracting text from EPUB files. Due to the license on these libraries decided to start a new library with a more permissive license. This library is available on PyPi (openepub). So far it allows extracting text but in the future can support more features as they come.

openepub's library code can be reviewed here: https://github.com/sakolkar/openepub

Plugged in openepub into Lute as per the TODO items had outlined. Crafted a test EPUB file manually with only the bare minimum to meet EPUB spec (https://www.w3.org/TR/epub-33).

  • META-INF/container.xml specifies a single package in EPUB/package.opf and other necessary boilerplate.
  • EPUB/package.opf has a single reference to an item EPUB/tengo.xhtml and other necessary boilerplate.
  • EPUB/tengo.xhtml has the core body contents <p>Tengo un amigo.</p> and necessary boilerplate.

Special Notes:

  • The Flask WTF framework uses a SpooledTemporaryFile for the uploaded files, however, these are not compatible directly with the zipfile standard library that is used inside openepub until Python 3.11. So, there is a safeguard to check if the python version is above 3.11 and if it is not it rewrites the content into a TemporaryFile and then uses openepub. The check looks to see if the seekable method exists which was added to SpooledTemporaryFile in the Python 3.11 release.

  • Since the upload is a single action, users don't get a chance to edit the text before the book is created. Also, currently Lute doesn't support editing the book post-creation. This leads to some pre-amble text at the start of books being included. See this image:
    image

  • If users upload an incorrectly structured EPUB, then an error message is flashed and they are sent back to the root page.
    image

@sakolkar sakolkar changed the base branch from master to develop December 27, 2023 02:15
@jzohrab
Copy link
Collaborator

jzohrab commented Dec 27, 2023

Thank you very much, this looks super!

Nice that you covered the different python versions. I wasn’t even aware of the spooled temp file class …

Thanks for handling the error case and activating the test too.

Lute does let you edit individual pages in the book, there’s a menu item in the read pane sidebar. But currently you can’t delete full pages, that would be needed if the preamble is longer than one page.

Is there anything users should be aware of for their EPUB imports? Eg no DRM or whatever, etc? I can update the manual with any notes.

cheers!

@sakolkar
Copy link
Contributor Author

I stumbled on the spooled file thing since my default python version is 3.10 and I was receiving an error but when I looked at the code for the spooled file I found it seemed to have the seekable etc methods. Then found that it was a shortcoming the python team shored up in recently.

Ah I didn't realize you could edit the pages! Just tried it and was able to remove the preamble content that I didn't want. Should be fine for users too.

Yes, for EPUB imports they need to be DRM free. Also, I guess its worth mentioning they can edit the first few pages to remove the preamble content if they don't want it.

They way the DRM works is that the file structure remains intact, but then inside of an "xhtml" file the content is encrypted. I tested a DRMed EPUB just now and there's an unhandled error from openepub (UnicodeDecodeError). This should be recast to EpubError so that downstream users of the library can handle it.

I'll push a fix to openepub for this. The one change I'll also push to Lute for it is to bump up to openepub>=0.0.5

@sakolkar
Copy link
Contributor Author

Other than those two notes, I guess one more mention is it doesn't support .mobi or the .azw types that the Kindle's favour. But those should in theory be caught by the form since it checks on file extension. It should handle EPUBs files fine since I had openepub follow the minimum required spec for EPUB files and we're not worried about image content and hyperlinked remote content which in theory is possible for EPUB

@sakolkar
Copy link
Contributor Author

Now, this is the error message when a user tries to upload an EPUB with DRM enabled.
image

@jzohrab
Copy link
Collaborator

jzohrab commented Dec 27, 2023

Thank you very much, looks great. Trying it out now.

@jzohrab
Copy link
Collaborator

jzohrab commented Dec 27, 2023

Hi @sakolkar - I tried the import with an epub3 of A Tale of Two Cities from Gutenberg -- https://www.gutenberg.org/ebooks/61887.

the import is breaking the epub into many short lines:

image

which results in several separate sentences:

image

The text in the underlying epub is not broken into these lines, it's continuous:

image

I'll try it with another epub.

@jzohrab
Copy link
Collaborator

jzohrab commented Dec 27, 2023

Same result with the older epub format:

image

Did you have the same issues with your epubs you tried?

@jzohrab jzohrab added the question Further information is requested label Dec 27, 2023
@jzohrab jzohrab self-assigned this Dec 27, 2023
@jzohrab jzohrab self-requested a review December 27, 2023 11:41
Copy link
Collaborator

@jzohrab jzohrab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Putting this on hold until the epub parsing stuff is worked out in the underlying library. Currently, books are broken into many short lines. Happy to look at that with you if I can add any value.

@jzohrab jzohrab mentioned this pull request Dec 27, 2023
@sakolkar
Copy link
Contributor Author

Interesting find! Looks like the inside of that book has <p> tags with additional newlines inside:

<p>Erase el mejor de los tiempos
y el más detestable de los tiempos;
la época de la sabiduría y la
época de la bobería, el período
....
entre uno y otro, tanto en lo que
al bien se refiere como en lo que
toca al mal, sólo en grado superlativo
es aceptable la comparación.</p>

I've put together a fix in openepub==0.0.6. It no longer uses the Beautiful Soup get_text() method but instead iterates over the tags and for string content does a bit of whitespace cleanup.

@jzohrab
Copy link
Collaborator

jzohrab commented Dec 28, 2023

Tried with a few epubs and this looks great. Loads them successfully, and with the test coverage in the underlying library I think this a great addition.

Tale of Two Cities test with updated lib:

image

Great work, thanks! I'll add a small note to the "Text file" control on the book import page too so ppl know it's available.

@jzohrab jzohrab merged commit ba12ea4 into LuteOrg:develop Dec 28, 2023
@sakolkar sakolkar deleted the GH-19 branch January 3, 2024 05:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants