GH 19 - EPUB Import Support #82

sakolkar · 2023-12-27T02:14:53Z

Implementation Details:

Looked at the options for python libraries that can be used for extracting text from EPUB files. Due to the license on these libraries decided to start a new library with a more permissive license. This library is available on PyPi (openepub). So far it allows extracting text but in the future can support more features as they come.

openepub's library code can be reviewed here: https://github.com/sakolkar/openepub

Plugged in openepub into Lute as per the TODO items had outlined. Crafted a test EPUB file manually with only the bare minimum to meet EPUB spec (https://www.w3.org/TR/epub-33).

META-INF/container.xml specifies a single package in EPUB/package.opf and other necessary boilerplate.
EPUB/package.opf has a single reference to an item EPUB/tengo.xhtml and other necessary boilerplate.
EPUB/tengo.xhtml has the core body contents <p>Tengo un amigo.</p> and necessary boilerplate.

Special Notes:

The Flask WTF framework uses a SpooledTemporaryFile for the uploaded files, however, these are not compatible directly with the zipfile standard library that is used inside openepub until Python 3.11. So, there is a safeguard to check if the python version is above 3.11 and if it is not it rewrites the content into a TemporaryFile and then uses openepub. The check looks to see if the seekable method exists which was added to SpooledTemporaryFile in the Python 3.11 release.
Since the upload is a single action, users don't get a chance to edit the text before the book is created. Also, currently Lute doesn't support editing the book post-creation. This leads to some pre-amble text at the start of books being included. See this image:
If users upload an incorrectly structured EPUB, then an error message is flashed and they are sent back to the root page.

jzohrab · 2023-12-27T03:52:47Z

Thank you very much, this looks super!

Nice that you covered the different python versions. I wasn’t even aware of the spooled temp file class …

Thanks for handling the error case and activating the test too.

Lute does let you edit individual pages in the book, there’s a menu item in the read pane sidebar. But currently you can’t delete full pages, that would be needed if the preamble is longer than one page.

Is there anything users should be aware of for their EPUB imports? Eg no DRM or whatever, etc? I can update the manual with any notes.

cheers!

sakolkar · 2023-12-27T04:26:36Z

I stumbled on the spooled file thing since my default python version is 3.10 and I was receiving an error but when I looked at the code for the spooled file I found it seemed to have the seekable etc methods. Then found that it was a shortcoming the python team shored up in recently.

Ah I didn't realize you could edit the pages! Just tried it and was able to remove the preamble content that I didn't want. Should be fine for users too.

Yes, for EPUB imports they need to be DRM free. Also, I guess its worth mentioning they can edit the first few pages to remove the preamble content if they don't want it.

They way the DRM works is that the file structure remains intact, but then inside of an "xhtml" file the content is encrypted. I tested a DRMed EPUB just now and there's an unhandled error from openepub (UnicodeDecodeError). This should be recast to EpubError so that downstream users of the library can handle it.

I'll push a fix to openepub for this. The one change I'll also push to Lute for it is to bump up to openepub>=0.0.5

sakolkar · 2023-12-27T04:30:19Z

Other than those two notes, I guess one more mention is it doesn't support .mobi or the .azw types that the Kindle's favour. But those should in theory be caught by the form since it checks on file extension. It should handle EPUBs files fine since I had openepub follow the minimum required spec for EPUB files and we're not worried about image content and hyperlinked remote content which in theory is possible for EPUB

sakolkar · 2023-12-27T04:36:54Z

Now, this is the error message when a user tries to upload an EPUB with DRM enabled.

jzohrab · 2023-12-27T11:17:31Z

Thank you very much, looks great. Trying it out now.

jzohrab · 2023-12-27T11:37:10Z

Hi @sakolkar - I tried the import with an epub3 of A Tale of Two Cities from Gutenberg -- https://www.gutenberg.org/ebooks/61887.

the import is breaking the epub into many short lines:

which results in several separate sentences:

The text in the underlying epub is not broken into these lines, it's continuous:

I'll try it with another epub.

jzohrab · 2023-12-27T11:41:14Z

Same result with the older epub format:

Did you have the same issues with your epubs you tried?

jzohrab

Putting this on hold until the epub parsing stuff is worked out in the underlying library. Currently, books are broken into many short lines. Happy to look at that with you if I can add any value.

sakolkar · 2023-12-28T03:51:09Z

Interesting find! Looks like the inside of that book has <p> tags with additional newlines inside:

<p>Erase el mejor de los tiempos
y el más detestable de los tiempos;
la época de la sabiduría y la
época de la bobería, el período
....
entre uno y otro, tanto en lo que
al bien se refiere como en lo que
toca al mal, sólo en grado superlativo
es aceptable la comparación.</p>

I've put together a fix in openepub==0.0.6. It no longer uses the Beautiful Soup get_text() method but instead iterates over the tags and for string content does a bit of whitespace cleanup.

jzohrab · 2023-12-28T04:53:50Z

Tried with a few epubs and this looks great. Loads them successfully, and with the test coverage in the underlying library I think this a great addition.

Tale of Two Cities test with updated lib:

Great work, thanks! I'll add a small note to the "Text file" control on the book import page too so ppl know it's available.

sakolkar added 3 commits December 24, 2023 17:55

LuteOrgGH-19: Add openepub library to extract text for EPUBs

847b798

LuteOrgGH-19: sample ebook for acceptance test

0105382

LuteOrgGH-19: error handling on bad EPUB files

2785872

sakolkar changed the base branch from master to develop December 27, 2023 02:15

LuteOrgGH-19: remove extra print statement

f47dbd3

LuteOrgGH-19: bump openepub version

5a6ca19

jzohrab added the question Further information is requested label Dec 27, 2023

jzohrab self-assigned this Dec 27, 2023

jzohrab self-requested a review December 27, 2023 11:41

jzohrab requested changes Dec 27, 2023

View reviewed changes

jzohrab mentioned this pull request Dec 27, 2023

Add import .epub files #19

Closed

sakolkar added 2 commits December 27, 2023 20:54

LuteOrgGH-19: bump openepub version to v0.0.6

f8c216a

Merge branch 'develop' into LuteOrgGH-19

b99cbc4

jzohrab merged commit ba12ea4 into LuteOrg:develop Dec 28, 2023

sakolkar deleted the GH-19 branch January 3, 2024 05:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH 19 - EPUB Import Support #82

GH 19 - EPUB Import Support #82

sakolkar commented Dec 27, 2023

jzohrab commented Dec 27, 2023

sakolkar commented Dec 27, 2023

sakolkar commented Dec 27, 2023

sakolkar commented Dec 27, 2023

jzohrab commented Dec 27, 2023

jzohrab commented Dec 27, 2023 •

edited

Loading

jzohrab commented Dec 27, 2023

jzohrab left a comment

sakolkar commented Dec 28, 2023

jzohrab commented Dec 28, 2023

GH 19 - EPUB Import Support #82

GH 19 - EPUB Import Support #82

Conversation

sakolkar commented Dec 27, 2023

Implementation Details:

Special Notes:

jzohrab commented Dec 27, 2023

sakolkar commented Dec 27, 2023

sakolkar commented Dec 27, 2023

sakolkar commented Dec 27, 2023

jzohrab commented Dec 27, 2023

jzohrab commented Dec 27, 2023 • edited Loading

jzohrab commented Dec 27, 2023

jzohrab left a comment

Choose a reason for hiding this comment

sakolkar commented Dec 28, 2023

jzohrab commented Dec 28, 2023

jzohrab commented Dec 27, 2023 •

edited

Loading