Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement fine-grained extraction of translatable text #25

Merged
merged 1 commit into from
May 1, 2023

Conversation

mgeisler
Copy link
Collaborator

@mgeisler mgeisler commented May 1, 2023

Before, we would extract text based on the byte offsets in the original document. As a consequence of this, the extracted text would look precisely like the original: the Markdown was copied directly from the original. In particular, text from a block quote would contain the leading ‘>’ characters and paragraphs in list items would contain leading whitespace.

Now, we instead extract text by grouping the Markdown parse events into those which should be translated and those who should be skipped. We use this in two ways:

  • When extracting messages in ‘mdbook-xgettext’, we turn the translatable events back into Markdown. The structure of the document (headings, lists, block quotes, …) is no longer present in the extracted messages: only the text content itself it extracted.

  • When translating, we replace the sequence of translatable events with the events from the translation. We do this while leaving the structure of the document unchanged.

The result of this is a much more robust system: editing one list item no longer impacts adjacent list items, moving a paragraph into a block quote no longer changes the paragraph.

As a side effect of how we turn events into messages, links are now all expanded. This makes the messages larger, but it removes a common source of errors where ‘[foo][1]’ would end up pointing to the wrong location if the reference link was updated.

Part of #19.

@mgeisler mgeisler requested a review from djmitche May 1, 2023 07:16
Copy link
Collaborator

@djmitche djmitche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice! How will you handle migrating existing translations?

src/lib.rs Outdated Show resolved Hide resolved
src/lib.rs Outdated Show resolved Hide resolved
src/lib.rs Outdated Show resolved Hide resolved
Comment on lines +213 to +231
let new_state = cmark_resume_with_options(
events.clone(),
String::new(),
state.clone(),
options.clone(),
)
.unwrap();

// Block quotes and lists add padding to the state. This is
// reflected in the rendered Markdown. We want to capture the
// Markdown without the padding to remove the effect of these
// structural elements.
let state_without_padding = state.map(|state| State {
padding: Vec::new(),
..state
});
cmark_resume_with_options(events, &mut markdown, state_without_padding, options).unwrap();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear to me why this calls cmark_resume_with_options twice.

Is the idea to return an accurate state (new_state) but return markdown rendered without the padding?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the idea to return an accurate state (new_state) but return markdown rendered without the padding?

Yes, precisely! The padding is the "> " and list indents and I'm trying to avoid putting that into the .po file.

@mgeisler
Copy link
Collaborator Author

mgeisler commented May 1, 2023

Very nice!

Thanks!

How will you handle migrating existing translations?

My next step is to write a little normalization tool: it should be enough to go through a .po file and run both msgid and msgstr fields through the extract_messages function: hopefully we end up with n messages from the msgid and n messages from the msgstr. This way we have n new pairs. If we get a different number of messages, we can mark some (or all) of the pairs as fuzzy in the .po file.

src/lib.rs Show resolved Hide resolved
Before, we would extract text based on the byte offsets in the
original document. As a consequence of this, the extracted text would
look precisely like the original: the Markdown was copied directly
from the original. In particular, text from a block quote would
contain the leading ‘>’ characters and paragraphs in list items would
contain leading whitespace.

Now, we instead extract text by grouping the Markdown parse events
into those which should be translated and those who should be skipped.
We use this in two ways:

- When extracting messages in ‘mdbook-xgettext’, we turn the
  translatable events back into Markdown. The structure of the
  document (headings, lists, block quotes, …) is no longer present in
  the extracted messages: only the text content itself it extracted.

- When translating, we replace the sequence of translatable events
  with the events from the translation. We do this while leaving the
  structure of the document unchanged.

The result of this is a much more robust system: editing one list item
no longer impacts adjacent list items, moving a paragraph into a block
quote no longer changes the paragraph.

As a side effect of how we turn events into messages, links are now
all expanded. This makes the messages larger, but it removes a common
source of errors where ‘[foo][1]’ would end up pointing to the wrong
location if the reference link was updated.

Part of #19.
@mgeisler mgeisler enabled auto-merge May 1, 2023 20:16
@mgeisler mgeisler merged commit 44b4b46 into main May 1, 2023
@mgeisler mgeisler deleted the fine-grained-extraction branch May 1, 2023 20:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants