Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Differentiate actual paragraphs and LUTE's page splitting #475

Closed
lef-est opened this issue Aug 30, 2024 · 8 comments
Closed
Labels
fixed Fixed in develop or master, to be launched.

Comments

@lef-est
Copy link

lef-est commented Aug 30, 2024

Is your feature request related to a problem? Please describe.

Where to break a parapraph is a stylistic choice by the author and may aid in storytelling. This information is lost because LUTE splits paragraphs to meet the token count per page.

Describe the solution you'd like

Indent original paragraphs, while leaving the "new paragphs" resutled from LUTE's splitting unchanged the same way they are treated now.
If possible, I'd like LUTE to not remove empty lines (or all whitespaces) for the same reason. Many authors use 2 or 3 lines for sub-chapter breaks. Some even differentiate the use of 2 lines or 3 lines.

Describe alternatives you've considered

During book creation, give users the option of "don't split paragraphs" (unchecked by default). Not the best solution performance-wise and probably takes more work.

Additional context

With LUTE we already lose a lot of information such as bold, italic, underline, images, tables. I understand the difficulty of incorporating them, which entails fundamental changes. But paragraph breaks and empty lines are probably the easiest ones to handle and deserve a chance.

@lef-est lef-est changed the title Differentiate actual paragraphs and LUTE's page splitting [Feature Request] Differentiate actual paragraphs and LUTE's page splitting Aug 30, 2024
@jzohrab
Copy link
Collaborator

jzohrab commented Aug 30, 2024

Lute splits by sentence to keep the token count reasonable: some paragraphs can get really long, or the formatting can get weird when people copy/paste text from different places. It's hard to say what the best solution is here, I still feel that splitting paragraphs up is necessary, and I don't have a great answer for this ... unless I do something like first try to group by paragraphs, and then check the page size and only split the paragraph if the page is, say, 50% longer than the maximum size. It's a bit convoluted and tricky, but maybe that would suffice. Does that seem reasonable?

If possible, I'd like LUTE to not remove empty lines (or all whitespaces) for the same reason.

Yes this sounds reasonable and I'm not sure why I did this in the first place. :-)

bold, italic, underline, images, tables.

Yeah that's so tough! Would be nice though, I agree. There's an issue to allow for markdown on import of text files (though tables still wouldn't be possible, b/c it's bananas)

Let me know re the paragraph grouping and page size threshold idea ... it could be tricky-ish.

@LangNerd23
Copy link

Allthough the question was not adressed to me: I personally like this idea with a two step process (First grouping paragraphs and only further split if it exceeds a certain threshold).

At the moment I still have quit a lot of work in pre-processing books, often setting manual separators (---) to avoid parts of the problems mentoined above.

@jzohrab
Copy link
Collaborator

jzohrab commented Oct 23, 2024

I've had annoying paragraph splits as well, in places that didn't make sense and complicated the reading of a tough book.

@jzohrab
Copy link
Collaborator

jzohrab commented Oct 29, 2024

I've been thinking about this one on and off for a few days, and don't have a perfect solution for it yet, but probably good enough: I'm considering doing grouping based on paragraphs, and just not bothering to split paragraphs at all, even if the paragraph is huge. For the most part, that should be fine ... if someone is reading James Joyce or Faulkner with Lute, they're probably outside of the target user base anyway. And such users could just add a new page and split a long paragraph manually if they wanted to.

With this change, the "max words" thing would become a "threshold", and paragraphs would be added to the current page until the threshold is crossed, at which point a new page would be started. The threshold could be exceeded by any number of tokens -- but it's probably good enough.

@lef-est
Copy link
Author

lef-est commented Dec 28, 2024

It's hard to say what the best solution is here, I still feel that splitting paragraphs up is necessary, and I don't have a great answer for this ...

I agree. That's why I put "do not split" under "alternative solution you've considered" :)

I've been thinking about this one on and off for a few days, and don't have a perfect solution for it yet, but probably good enough: I'm considering doing grouping based on paragraphs, and just not bothering to split paragraphs at all, even if the paragraph is huge. For the most part, that should be fine ... if someone is reading James Joyce or Faulkner with Lute, they're probably outside of the target user base anyway. And such users could just add a new page and split a long paragraph manually if they wanted to.

With this change, the "max words" thing would become a "threshold", and paragraphs would be added to the current page until the threshold is crossed, at which point a new page would be started.

Yes! I'd love to see this implemented!

unless I do something like first try to group by paragraphs, and then check the page size and only split the paragraph if the page is, say, 50% longer than the maximum size. It's a bit convoluted and tricky, but maybe that would suffice. Does that seem reasonable?

Let me know re the paragraph grouping and page size threshold idea ... it could be tricky-ish.

It'll give the best looking pages but as you said it is convoluted. It sounds like a lot to ask, so I proposed what I feel like is the least-effort solution (inject different styling) haha. I'd be happy with any improvement.
Sorry for the late reply. I forgot to check notifications.

@jzohrab jzohrab added this to Lute-v3 Jan 8, 2025
@jzohrab jzohrab moved this to Todo in Lute-v3 Jan 8, 2025
@jzohrab jzohrab moved this from Todo to In Progress in Lute-v3 Jan 8, 2025
@jzohrab jzohrab moved this from In Progress to Todo in Lute-v3 Jan 8, 2025
@jzohrab jzohrab added the blocked label Jan 8, 2025
@jzohrab
Copy link
Collaborator

jzohrab commented Jan 8, 2025

Blocked by #562 , "refactor book creation" - when that's done, this will be clearer to work on. -- update: done now.

@jzohrab jzohrab removed the blocked label Jan 9, 2025
@jzohrab jzohrab moved this from Todo to In Progress in Lute-v3 Jan 9, 2025
@jzohrab
Copy link
Collaborator

jzohrab commented Jan 9, 2025

Have added a "split by" drop down box ("paragraphs", "sentences") to the book creation form. For example, given the below text:

Para one sentence one.  Para one sentence two.  Para one sentence three.
Para two sentence one.  Para two sentence two.  Para two sentence three.

Different split settings will return different pages:

By Sentences, threshold 7 words per page:

Para one sentence one. Para one sentence two.
Para one sentence three.
Para two sentence one.
Para two sentence two. Para two sentence three.

By Paragraphs, threshold 7 words per page:

Para one sentence one. Para one sentence two. Para one sentence three.
Para two sentence one. Para two sentence two. Para two sentence three.

This seems good, tests passing etc. Merged into develop, will launch soon.

@jzohrab jzohrab added the fixed Fixed in develop or master, to be launched. label Jan 9, 2025
@jzohrab jzohrab moved this from In Progress to Done in Lute-v3 Jan 9, 2025
@jzohrab
Copy link
Collaborator

jzohrab commented Jan 10, 2025

Lanched in 3.9.2

@jzohrab jzohrab closed this as completed Jan 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fixed Fixed in develop or master, to be launched.
Projects
Archived in project
Development

No branches or pull requests

3 participants