Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework FormattedText model to better support USX3/USFM3 import #93

Open
schierlm opened this issue Aug 3, 2024 · 10 comments
Open

Rework FormattedText model to better support USX3/USFM3 import #93

schierlm opened this issue Aug 3, 2024 · 10 comments
Assignees
Milestone

Comments

@schierlm
Copy link
Owner

schierlm commented Aug 3, 2024

The current FormattedText model, which is used as intermediate format for every conversion (except conversions between two Paratext formats) has been there since the beginning of BibleMultiConverter. Yet, other Bible formats have evolved. Threrfore, rework the internal model.

Some ideas:

  1. FormattingInstructionKind: Add new constants

    • PSALM_TITLE (titles of Psalms which sometimes are part of verse 1, sometimes before it)
    • ADDED_TEXT (text added by the translator which is not linked to original source, often conjunctions

    When exporting those to a format that does not support them, treat both as ITALIC.

  2. Add Speaker markup to mark text spoken by a person other than Jesus. Speakers can be identified
    by labels (e.g. "Moses") or Strongs numbers (e.g. "H4872").

  3. Rework LineBreakKind based on
    ExtendedLineBreakKind
    used for Paratext export

  4. GrammarInformation: Add suffix letters for Strongs numbers (optional), also add a way to add
    arbitrary key-value pairs (like in OSIS or Paratext). Values need not be ASCII only (e.g. Greek Lemma).

  5. Links: Support

    • Anchors in the text (by id)
    • Links to those anchors
    • Links to external hyperlinks
    • Links to external images (which may be displayed inline if supported by the format)
  6. Footnotes: Add a flag whether a footnote contains text or cross references. For now, this is done by adding XREF_MARKER to the beginning of the footnote text, but many new formats have this distrinction and parsing for magic strings gets cumbersome.

  7. Cross References: Support cross references that span more than one book; also support cross references that do not reference individual verses, but whole chapters or books.

As this is a major task (needs to touch most of the modules), my plan is in a first step to only update the roundtrip formats, and make the other formats "just" work again (using fallbacks or ignoring the new options). Will keep a list of status of the modules (e.g. compiles again, tested, compared against format spec), trying to not make a format worse than before anywhere in the process.

When exporting other features from USFM to FormattedText, use ExtraAttributes wherever possible. This should also include custom tags and custom milestones. There should be an option to convert UBXF alignment milestones (for a single alignment source) to GrammarInformation instead of extra attributes.

Did I miss anything? Feature should be present in both USFM3/USX3 and in more than one other format.

// cc @Rolf-Smit @Michahel @shadow-light @paul1149

@Rolf-Smit
Copy link
Contributor

Lately I have not been actively working with this tool, I mostly use it to convert from USX (2/3) to USFM (3) (which my application surprisingly can parse faster then XML).

From my perspective I don't have any remarks about your plans to rework FormattedText. I think it would be good if the intermediate format supports as many features as possible, in a sensible and generic way.

@schierlm
Copy link
Owner Author

Updated the issue to not forget to add support for UBXF alignment milestones.

@schierlm
Copy link
Owner Author

@Rolf-Smit just a heads up: in a553d4b I changed the intermediate format used by Paratext formats by moving Figure, VerseStart and VerseEnd to be BookContent instead of CharacterContent (all Paratext formats supported so far do not support those nested in character tags or footnotes anyway). This makes some parsing easier and removes some ugly workarounds that made extending the format harder.

Not sure if that affects your use cases.

schierlm added a commit that referenced this issue Sep 25, 2024
@schierlm
Copy link
Owner Author

UBXF alignment milestones are now implemented, see 93eeed0 (part of main branch)

A very early alpha of the new FormattedText model is available in the newmodel branch (supported formats).
Nightly build: https://nightly.link/schierlm/BibleMultiConverter/workflows/main.yaml/newmodel/BibleMultiConverter-AllInOneEdition-Release.zip

@schierlm schierlm added this to the v0.1 milestone Oct 20, 2024
@shadow-light
Copy link

Hi Michael, thanks for working on this. Seems to run ok. Has USFM<->USX changed at all, or just when converting to/from other formats?

@schierlm
Copy link
Owner Author

schierlm commented Nov 22, 2024

Has USFM<->USX changed at all, or just when converting to/from other formats?

USFM<->USX has changed alot when I added support for USFM 3 format. Also I found a few bugs while implementing this feature, but I backported them to the normal version.

As a consequence, the current nightly build has many changes in USFM<->USX conversion compared to the previously released version; on the other hand, USFM<->USX behaviour in the normal nightly version and the one from the newmodel branch should be identical.

@shadow-light
Copy link

So just to clarify, I'll get the USFM3 upgrade if I use builds from master here: https://github.com/schierlm/BibleMultiConverter/actions ?

@schierlm
Copy link
Owner Author

So just to clarify, I'll get the USFM3 upgrade if I use builds from master here: https://github.com/schierlm/BibleMultiConverter/actions ?

Correct.

@shadow-light
Copy link

Am I right in saying USFM1-2 is also valid USFM3? Where as USX1-2 is not valid USX3 as they don't have end verse markers?

@schierlm
Copy link
Owner Author

Yes, all USFM formats are also valid in later versions of the standard. USX is not, first because of the end markers, and second because of the different XML schema which dropped some deprecated attributes and values.

@schierlm schierlm pinned this issue Dec 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants