Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve "title" page in docx by using DocProperties instead of simple text #5839

Open
agusmba opened this issue Oct 21, 2019 · 21 comments
Open

Comments

@agusmba
Copy link
Contributor

agusmba commented Oct 21, 2019

This is a request for enhancement related to how pandoc creates the "title" page in word.

Currently the metadata values used in docx's title page are inserted as text (title, author, date, subtitle, abstract), and they are also included as document properties.

While this looks good if your workflow stops there, it's not so convenient if you modify the docx later on, and want to change any of those properties. Basically you'd need to change them twice (one in the text on the title page and again as a document property).

The request would be to make pandoc write the title page using DocProperty references for these values instead of using simple text, allowing future evolution of the docx by changing only the docx metadata (no need to re-type the title, etc.)

If it's needed, I could analyze the differences between the new and current approach at the xml level.

Thanks!

@jgm
Copy link
Owner

jgm commented Oct 21, 2019

Sounds like a good idea: please do let us know what changes would be needed in the XML.

@jgm
Copy link
Owner

jgm commented Oct 21, 2019

Second thoughts: one issue may be that titles can contain formatting, but DocProperties not.
If your title contains complex formatting (boldface, italics, math, etc.), would we really be able to substitute a DocProperties field?

@agusmba
Copy link
Contributor Author

agusmba commented Oct 22, 2019

Second thoughts: one issue may be that titles can contain formatting, but DocProperties not.
If your title contains complex formatting (boldface, italics, math, etc.), would we really be able to substitute a DocProperties field?

You may be right, and we could lose complex formatting on the title if we inserted it as a DocProperty.

Personally I think it is more valuable to have the title as a Property than to support complex formatting there (very simple formatting can be achieved with the Title style, which applies to the whole title), but I understand others could need complex formatting in the title text (so this would be a breaking change).

Unless we find a sensible solution for both I guess this is currently a no-go 🤔

@jgm jgm closed this as completed Oct 23, 2019
@tstenner
Copy link
Contributor

Unless we find a sensible solution for both I guess this is currently a no-go thinking

How about inserting the property if it's an unformatted string and the formatted title as-is otherwise?

@jgm
Copy link
Owner

jgm commented Oct 23, 2019

I'm not sure. That might lead to some unpredictability -- people expecting the things to be in sync based on past behavior, and then this breaking when a bit of formatting is added.

How does it work, anyway? If we insert the property, does that mean they can no longer manually edit the title? Or does editing the title affect the property? In the latter case, what happens if they do try to add formatting? If editing the title isn't possible, that would be bad I think.

@agusmba
Copy link
Contributor Author

agusmba commented Oct 24, 2019

Quick testing with a Word document (no pandoc involved):

  • If we insert the Title property, editing it in the text modifies the property and viceversa. However styling is very limited. Trying to bold or empahsize a word in the title affects the whole title (I guess it modifies the style for the whole title since internally the content is simple text)
  • If we insert a custom DocProperty (like date or subtitle), the text can be edited and styled completely, however changes in the word document are not reflected in the property. This means that it the field is updated, we lose any editing or styling done in the text. Editing can be done in the document metadata, and that is reflected in the text.

It seems that if we want rich and complex styling in the document text, they cannot be tied to their document properties. On the other hand if they are not tied, the document metadata and the text could diverge, unless you are careful and manually modify both at the same time always (when editing in Word)

@jgm
Copy link
Owner

jgm commented Oct 24, 2019

Given this, I think we should just keep things as they are.

@bjornbm
Copy link
Contributor

bjornbm commented May 12, 2021

I would like to reopen discussion of this point with the suggestions below. I would be willing to take a stab at the coding if a pull request would be likely to be accepted.

Extension

How about adding an extension for the docx writer (for example -t docx+metadata_fields, disabled by default) that would insert title, subtitle, author(s), date and possibly abstract as document property fields as suggested in the original issue description.

XML

With the extension enabled, the generated XML would look something like:

<w:p><w:pPr><w:pStyle w:val="Title"/></w:pPr><w:sdt><w:sdtPr><w:alias w:val="Title"/><w:dataBinding w:prefixMappings="xmlns:ns0='http://purl.org/dc/elements/1.1/' xmlns:ns1='http://schemas.openxmlformats.org/package/2006/metadata/core-properties' " w:xpath="/ns1:coreProperties[1]/ns0:title[1]" /></w:sdtPr></w:sdt></w:p>

<w:p ><w:pPr><w:pStyle w:val="Subtitle"/></w:pPr><w:fldSimple w:instr=" DOCPROPERTY &quot;Subtitle&quot; \* MERGEFORMAT "><w:r ><w:t>THE_SUBTITLE</w:t></w:r></w:fldSimple></w:p>

<w:p><w:pPr><w:pStyle w:val="Author"/></w:pPr><w:sdt><w:alias w:val="Author"/><w:sdtPr><w:dataBinding w:prefixMappings="xmlns:ns0='http://purl.org/dc/elements/1.1/' xmlns:ns1='http://schemas.openxmlformats.org/package/2006/metadata/core-properties' " w:xpath="/ns1:coreProperties[1]/ns0:creator[1]" /></w:sdtPr></w:sdt></w:p>

<w:p ><w:pPr><w:pStyle w:val="Date"/></w:pPr><w:fldSimple w:instr=" DOCPROPERTY &quot;Date&quot; \* MERGEFORMAT "><w:r ><w:t>THE_DATE</w:t></w:r></w:fldSimple></w:p>

<w:p ><w:pPr><w:pStyle w:val="Abstract"/></w:pPr><w:fldSimple w:instr=" DOCPROPERTY &quot;abstract&quot; \* MERGEFORMAT "><w:r ><w:t>THE_ABSTRACT</w:t></w:r></w:fldSimple></w:p>

Author formatting

Multiple authors will not be on individual lines, but separated by semi-colons (as per how pandoc populates the author docproperty). It think this is an acceptable trade-off.

Abstract formatting

I feel that dropping formatting for title, subtitle, author, date is a worthwhile trade-off (and expected for anyone who has use for this extension). For abstract I am not so sure. Generally speaking, since the abstract is included in the docproperties I think it would be nice to be able to link the representation to the property. But perhaps removing formatting in the abstract is too heavy-handed for many cases, in particular since it also removes paragraph separations.

Perhaps a second extension abstract_field could be added. Should this one be implicitly enabled by metadata_fields so to disable (and keep formatting/paragraphs) one would use -t docx+metadata_fields-abstract_field.

As a side note: paragraphs in the abstract get mashed together without any separator in the docproperties. That is, the example below becomes This is the abstract.It consists of two paragraphs. It seems to me pandoc should insert a space character between paragraphs in the abstract. Probably deserves an issue of it's own?

Example

Here is a document for testing purposes. Just run it through pandoc to a docx.

---
title:  'This is the title: it contains a colon'
subtitle:  'This is the subtitle'
author:
- Author One
- Author Two
date: 2021-05-11
abstract: |
    This is the abstract.

    It consists of two paragraphs.
...

```{=openxml}
<w:p><w:pPr><w:pStyle w:val="Title"/></w:pPr><w:sdt><w:sdtPr><w:alias w:val="Title"/><w:dataBinding w:prefixMappings="xmlns:ns0='http://purl.org/dc/elements/1.1/' xmlns:ns1='http://schemas.openxmlformats.org/package/2006/metadata/core-properties' " w:xpath="/ns1:coreProperties[1]/ns0:title[1]" /></w:sdtPr></w:sdt></w:p>

<w:p ><w:pPr><w:pStyle w:val="Subtitle"/></w:pPr><w:fldSimple w:instr=" DOCPROPERTY &quot;Subtitle&quot; \* MERGEFORMAT "><w:r ><w:t>THE_SUBTITLE</w:t></w:r></w:fldSimple></w:p>

<w:p><w:pPr><w:pStyle w:val="Author"/></w:pPr><w:sdt><w:alias w:val="Author"/><w:sdtPr><w:dataBinding w:prefixMappings="xmlns:ns0='http://purl.org/dc/elements/1.1/' xmlns:ns1='http://schemas.openxmlformats.org/package/2006/metadata/core-properties' " w:xpath="/ns1:coreProperties[1]/ns0:creator[1]" /></w:sdtPr></w:sdt></w:p>

<w:p ><w:pPr><w:pStyle w:val="Date"/></w:pPr><w:fldSimple w:instr=" DOCPROPERTY &quot;Date&quot; \* MERGEFORMAT "><w:r ><w:t>THE_DATE</w:t></w:r></w:fldSimple></w:p>

<w:p ><w:pPr><w:pStyle w:val="Abstract"/></w:pPr><w:fldSimple w:instr=" DOCPROPERTY &quot;abstract&quot; \* MERGEFORMAT "><w:r ><w:t>THE_ABSTRACT</w:t></w:r></w:fldSimple></w:p>
```

In the resulting document the dataBinding fields for title and author auto-update but the DOCPROPERTY fields for subtitle, date and abstract have to be updated manually (F9). This will not be a problem with a pandoc extension since the appropriate values (for, for example, THE_DATE) can be substituted by the writer.

FYI this is what you will get (after updating the fields), with the frontmatter generated by pandoc appearing first followed by the openxml contents:

image

@bjornbm
Copy link
Contributor

bjornbm commented May 17, 2021

@jgm, my comment/proposal seems to have gone unnoticed. Should I open a new issue to get attention? (I guess I will if I don't hear anything within the next few days.) Thanks!

@jgm
Copy link
Owner

jgm commented May 17, 2021

Sorry, I haven't had a chance to think about this. But we can reopen the issue.

@jgm jgm reopened this May 17, 2021
@jgm
Copy link
Owner

jgm commented May 17, 2021

Thanks for the test file; it's good to know how to use these properties.
But I'm less sanguine than you about being content with unformatted titles. What about titles that have equations in them, for example, or sub/superscripts?

Citing #3109, #3034 and #7256 as related issues.

@bjornbm
Copy link
Contributor

bjornbm commented May 17, 2021

But I'm less sanguine than you about being content with unformatted titles. What about titles that have equations in them, for example, or sub/superscripts?

My thinking was that in such cases one simply wouldn't use +metadata-fields. If one wants to use this extension, one will have to accept the restrictions that already apply to the docproperties populated by pandoc.

@jgm
Copy link
Owner

jgm commented May 17, 2021

I'm still unclear about the motivation for the change. The stated motivation is to allow one to change the title, author, or abstract in just one place, rather than having to change it both in the properties and in the document itself. This is an issue that would come up only if you use pandoc to generate the docx and then do further work on the docx itself (rather than regenerating again from a markdown source).

I tried inserting the Title property into a document using Word. When I then modified the property, the document didn't update. And modifying the rendered field in the document didn't update the property. I must be missing something.

@agusmba
Copy link
Contributor Author

agusmba commented May 18, 2021

You need to insert the special smart title for it to work, as in:

image

If you just insert the document property with the generic field selector it won't auto update:

image

@bjornbm
Copy link
Contributor

bjornbm commented May 18, 2021

I tried inserting the Title property into a document using Word. When I then modified the property, the document didn't update. And modifying the rendered field in the document didn't update the property. I must be missing something.

What @agusmba said. I struggle to find where to do that on mac, might be one of those things where the “insert title” function is available only on Windows, but the field itself actually works on mac. Anyway, you can use pandoc with the example in #5839 (comment) (just edited to correct indentation in the yaml metadata) to see what it looks like. Here is a screenshot on mac of the editable field you get for title (and authors):

image

@bjornbm
Copy link
Contributor

bjornbm commented May 18, 2021

I'm still unclear about the motivation for the change. The stated motivation is to allow one to change the title, author, or abstract in just one place, rather than having to change it both in the properties and in the document itself. This is an issue that would come up only if you use pandoc to generate the docx and then do further work on the docx itself (rather than regenerating again from a markdown source).

Yes. Unfortunately, in a professional environment I often generate a docx with pandoc that is then shared and further edited by non-pandoc folks.

As a side note: even if fields such as date do not auto-sync with the property, the fact that it is a field will indicate to collaborators that they should update in properties and then update the field (and any date fields in headers/footers) rather than doing “inline changes” of each date. At least with a modicum of instruction. My motivation is that it makes it more likely that people will work with the document in Word in such a way that the document contents and properties stay in sync.

@jgm
Copy link
Owner

jgm commented May 18, 2021

Thanks for explaining the motivation further. I would prefer not to add an extension. What about a convention like this?

---
title: [This is my title]{.usefield}
---

@jgm
Copy link
Owner

jgm commented May 18, 2021

Or perhaps:

---
title: My title
author: Me
usefields: true
---

@bjornbm
Copy link
Contributor

bjornbm commented May 18, 2021

Sure. The first one is nice in that it allows to be selective in which properties to make into fields, although it will be bit verbose when desired for all/most fields. I think I could live with that. :)

@jgm
Copy link
Owner

jgm commented May 18, 2021

Is .usefield the best option? Would .linkproperty make more sense? Or something else?

@bjornbm
Copy link
Contributor

bjornbm commented May 18, 2021

Maybe .linkproperty is better, since the title and author(s) aren't the same types of fields as others. Don't have any other suggestions off the top of my head.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants