Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ability to convert yaml front matter into custom properties in docx and vice versa #3034

Open
rdwatters opened this issue Jul 20, 2016 · 27 comments

Comments

@rdwatters
Copy link

I think this is pretty self-explanatory, but the ability to set custom properties in a Word document would be a total game-changer, especially for those of us working for companies so entrenched in the MS garden that we use SharePoint as a DMS. I have hundreds of md files that I'd love to convert to Word and have them retain the custom key in the key-value pair. Then it's just a drag and drop into a SharePoint library, which should pick up those custom properties and map them correctly for list views, etc.

I appreciate this is a very big ask. Cheers.

@jgm
Copy link
Owner

jgm commented Jul 20, 2016

@jkr

+++ Ryan Watters [Jul 20 16 10:14 ]:

I think this is pretty self-explanatory, but the ability to set custom properties in a Word document would be a total game-changer, especially for those of us working for companies so entrenched in the MS garden that we use SharePoint as a DMS. I have hundreds of md files that I'd love to convert to Word and have them retain the custom key in the key-value pair. Then it's just a drag and drop into a SharePoint library, which should pick up those custom properties and map them correctly for list views, etc.

I appreciate this is a very big ask. Cheers.


You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
#3034

@jkr
Copy link
Collaborator

jkr commented Jul 20, 2016

@rdwatters Would you be able to upload a docx file with some of these properties, and a pointer to which particularly properties you're interested in? What you're requesting might well be possible, but I'd have to see where in the maze of xml those properties reside.

@rdwatters
Copy link
Author

Absolutely @jkr. I'll put together some screenshots at work later today.

@rdwatters
Copy link
Author

@jkr @jgm

Here is a link to a docx created in MS Word for Mac Version 15.24; my hope is that these custom properties do not change depending on whether it's Windows/OSX:

https://www.dropbox.com/s/86qs1hfng6l601r/pandoc-sample.docx?dl=0

Not sure if it helps, but...

  • "Content" was not auto-populated when I selected text in the document, styled it as "Title" and saved it. Only after I manually entered the title into the "Summary" field in properties did it show up.
  • My "User Information," which is saved in preferences, doesn't really seem to transfer over, so this might be different those who have Office365.
  • I actually set up a temporary SharePoint instance (hence why it took me so long to get back to you), created a new document content type that inherited from "Dublin Core Columns", an out-of-box content type for SP13 +, and then added "Pandoc1" to my views in SharePoint lists and the word "awesome" appeared from within the SP UI. If you're not familiar with SharePoint, in other words, saving it locally means I can surface it within the SP environment:) Woohoo!

pandoc-word-properties1
pandoc-word-properties2
pandoc-word-properties3
pandoc-word-properties4

I guess I would have to dig deeper to see how these different key-values would map, but things like "keywords" in the .docx property (usually "tags" in a .md .yml front matter) seem to be equivalent.

I know I'm throwing out a ton in this single comment, so let me know how I can help/clarify.

@jkr
Copy link
Collaborator

jkr commented Sep 3, 2016

So, just so I understand this -- would you like to be have the toplevel metadata map into the properties?

---
title: My Title
keywords: publishing, microsoft, etc.
...

Blah blah blah.

or do you want to have the ability to have a different title, like you do in the supplied document?

---
title: My Title
docx-props: 
   title: This wasn't added...
   keywords: publishing, microsoft, etc..
...

Blah blah blah

I don't know what Sharepoint is, or how 365 works -- are they important to this, or are we just trying to get the metadata into props?

@jkr
Copy link
Collaborator

jkr commented Sep 3, 2016

Or is it actually the info under "custom" (pandoc1="wicked", pandoc2="awesom") that you're most concerned with?

There are a lot of properties here, so I want to make sure I'm looking at the right thing.

@rdwatters
Copy link
Author

rdwatters commented Sep 5, 2016

@jkr Both excellent questions that demonstrate how terrible my examples were. The following is long-winded but thorough. I really, really appreciate you taking the time to look into this.

SharePoint is significant in that its the #1 most popular intranet tool (>50% of Fortune 500 companies) and is MS's native document management system. It is also notoriously difficult to extract content from. Companies invest enormous amounts of time tagging and cataloguing word documentss with metadata that is embedded directly into the document but then is only easily digestible by a further SharePoint instance. Ideally, Pandoc would be able to do these conversions bidirectionally between Word and MD - again, I appreciate that this is a huge ask.

Here is a temp repo that houses the word versions used in the following screenshots:
https://github.com/rdwatters/pandoc-word-samples

1. (see word-only.docx). So here are the properties of the doc created in Word. These properties (Title, Subject, Author, Manager, Company, Category, Keywords, Comments) are all out-of-the-box:

word-only_docx_properties_part1

Here is a shot of the added Custom Property (for this example, the property is "CustomProperty", data type is text, and value is "Hello Pandoc":

word-only_docx_properties_part2

As an aside, I'm using two separate titles to demonstrate how (a) Pandoc pulls from the title in page copy during a .docx => .md conversion, whereas MS (for the purpose of its document management system/SharePoint) pulls the title from the properties pane of the .docx. A minor improvement for Pandoc in its .docx => .md conversion would be to first check for a "Title" property in the properties pane to add to the yaml of a markdown file and then, if the properties pane does not include the title, pull the text styled as "Title" in the word document. Per your question re: adding two separate titles, I don't think adding two separate titles adds much value. Right now, when pandoc converts .md => .docx, it takes title: from the markdown file's yaml and adds it to the body copy (with appropriate styling) and also to the title field in properties, which is AWESOME.

2. So now that we have a Word document (created locally) with the out-of-box properties and one custom property added, the doc can be be added to a SharePoint list, which admin looks like the following screenshot. Note how "Comments" has been converted to "Description" and the "Creator" column (I'm using DCMI metadata for this example, which is an out-of-the-box content type in SharePoint) is auto-populated from the Author.

word-only_docx_properties_part3

3. You can see in the last screenshot that "Publisher" doesn't have a value. These values can be updated within the SharePoint UI and are embedded in the Word document itself:

sharepoint-added_docx_properties_part1

4. And here is the resulting update in the SharePoint list (keep in mind these lists will often have 10ks of Word documents):

sharepoint-added_docx_properties_part2

5. (see word-with-properties-added-in-sp.docx) Now that these properties are in the document, I'll download a copy of the doc and retitle it to word-with-properties-added-in-sp.docx. When I open this local copy and go to properties, everything remains, BUT both "Publisher" and "Contributors" cannot be accessed from within Word. They are in the document and will automatically update in a SharePoint list, but they cannot be accessed directly from the property panes directly within the Word app:

sharepoint-added_docx_properties_part3

6. (COPY-word-with-properties-added-in-sp.docx) And finally, just to double check that both the custom property (ie, CustomProperty) and the properties we added via SharePoint directly (ie, the DCMI "Contributors" and "Publisher") are actually embedded in the document, I'll upload the locally renamed version of the original document as well as a copy to the SharePoint list, which shows the properties (ie, metadata) actually travel with the document:

sharepoint-added_docx_properties_part4

So what would be the ideal workflow?

Having the document I created above in Word write back to a markdown file with all the out-of-the-box properties (listed above), the custom properties (in this case, just "CustomProperty"), and the other content type properties (in this case, "Publisher" and "Contributors") as typical key-value pairs....and vice versa.

All this insanity because the last 5 years of markdown evangelism have lead to exactly 0 converts in my last three companies.

Hopefully that makes sense. Thanks again!

@jkr
Copy link
Collaborator

jkr commented Sep 7, 2016

Having the document I created above in Word write back to a markdown file with all the out-of-the-box properties (listed above), the custom properties (in this case, just "CustomProperty"), and the other content type properties (in this case, "Publisher" and "Contributors") as typical key-value pairs....and vice versa.

So now I'm confused. It seems like you're asking for the ability to go docx -> markdown, while the title suggests you're interested in going markdown -> docx. While I appreciate the completeness of your description, I'm still unsure of what you want pandoc to do. A simple input file with expected output file would definitely help.

Now, assuming you want something bidirectional, let me mention some concerns.

  1. docx -> md: Most Word users don't know, or particularly care, what the metadata is on their document. Most are missing titles, and those that aren't often have the first line of the file when it was first saved, whatever that was. Having that unknown metadata transfer into yaml could be annoying.

    That being said, in this direction, it would be quite simple to write a python script that would produced full yaml from a docx file, and insert it into the resulting markdown file. (It would just require unzipping with zipfile, and parsing xml with etree or the like). I could write a version of that for you when I get a chance.

  2. md -> docx: this is a bit easier, but it requires that we come up with a convention for which values in the yaml metadata get transfered over to the docx metadata. And then document those. So the issues here aren't technical so much as user-interface-based. But that's still an important consideration.

@rdwatters rdwatters changed the title [FEATURE REQUEST] Ability to convert yaml front matter into custom properties in docx [FEATURE REQUEST] Ability to convert yaml front matter into custom properties in docx and vice versa Sep 7, 2016
@rdwatters
Copy link
Author

Agreed w/r/t title of this feature. I have updated it accordingly. Now in terms of your responses:

  1. I agree that most people don't care about the metadata in Word docs, but that's not the point nor do I think it should prevent this feature since a) if they don't care, the metadata fields would be empty, which would mean not actually writing the respective yaml to the front matter of the markdown; hence, it wouldn't be annoying, and b) I'm not talking about a single-person use case as much as I'm talking about companies that have tremendous amounts of IP wrapped up in DOCX and housed, specifically, in SharePoint (see my above comments and statistics). In the case of such a DMS, most users are required to add the appropriate fields to a document when checking it in.
  2. Also agree. Does Pandoc already have a set of base front matter key-values it prefers? If so, I can use this to put together a simple input and output. My inclination is the several I mentioned above would work just fine, but if additional metadata (eg, contributors above) is housed elsewhere in the document and not necessarily accessible from the Word interface (but still read in the DMS), having the ability to still write that metadata could be handy, but I could see that being a separate feature request in the future.

@Sealatron
Copy link

I've stumbled across this issue today after struggling with docx Custom Properties. It would be amazing if pandoc could handle them. Much like @rdwatters I work at a company that has all of it's documentation in docx format with plenty of custom properties.

If it helps clarify things, here's my immediate use case:

I currently use markdown with YAML at the top of the document like so:

---
title: MyTitle
author: Sealatron
...


# Heading
## Sub Heading

I convert these to docx using a command like this:
pandoc -s --reference-docx reference.docx -f markdown+yaml_metadata_block -t docx input.md -o output.md

My issue is that there are custom properties in reference.docx that appear in the headers and footers. Just now they transfer across to my output.docx with their original values (thus giving me erroneous values w.r.t. the doc I'm writing), but what I'd ideally like to be able to do is override them somehow in my original input.md like so:

---
title: MyTitle
author: Sealatorn
custom1: MyCustom1Value
custom2: MyCustom1Value
...

This is of course just an example, I don't know what the correct syntax would be. I'd expect the output to look like:

  1. If any custom properties as defined in YAML don't exist in reference.docx, add them to output.docx
  2. If any custom properties as defined in YAML do exist in reference.docx, override the values with the new ones in output.docx.

Obviously like @rdwatters said, this is a big ask (just glancing at the raw xml of the reference.docx told me how weird the format was) but I'd love to see something like this in pandoc in future!

@mb21
Copy link
Collaborator

mb21 commented Apr 13, 2017

I guess what would really help is the exact XML output needed for different cases...

@Sealatron
Copy link

Agreed, sorry I left that out of my example.

I first created reference.docx by converting an example markdown document to docx. Within Word, I created 'Custom Properties' as shown in @rdwatters comments above. Here's what that looks like within reference.docx::docProps\custom.xml:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Properties xmlns="http://schemas.openxmlformats.org/officeDocument/2006/custom-properties" xmlns:vt="http://schemas.openxmlformats.org/officeDocument/2006/docPropsVTypes">
    <property fmtid="{D5CDD505-2E9C-101B-9397-08002B2CF9AE}" pid="2" name="Custom1">
        <vt:lpwstr>MyCustom1Value</vt:lpwstr>
    </property>
    <property fmtid="{D5CDD505-2E9C-101B-9397-08002B2CF9AE}" pid="3" name="Custom2">
        <vt:lpwstr>MyCustom2Value</vt:lpwstr>
    </property>
    <property fmtid="{D5CDD505-2E9C-101B-9397-08002B2CF9AE}" pid="4" name="Custom3">
        <vt:lpwstr>MyCustom3Value</vt:lpwstr>
    </property>
</Properties>

These DocProperty values are inserted using word as 'Fields' in the headers/footers. The xml for those looks like this:

        <w:r>
            <w:fldChar w:fldCharType="begin"/>
        </w:r>
        <w:r>
            <w:instrText xml:space="preserve"> DOCPROPERTY  Custom1  \* MERGEFORMAT </w:instrText>
        </w:r>
        <w:r>
            <w:fldChar w:fldCharType="separate"/>
        </w:r>
        <w:r>
            <w:t>MyCustom1Value</w:t>
        </w:r>
        <w:r>
            <w:fldChar w:fldCharType="end"/>
        </w:r>

After running the pandoc command in my previous comment, these fields remain in the headers/footers of the output.docx, but the corresponding custom.xml doesn't exist.

What I'd like to be able to do is specify custom properties like the above somehow in my original markdown document so that the fields in my output.docx headers/footers will update correctly. This might be as straightforward as creating a custom.xml from the provided YAML?

@jgm jgm changed the title [FEATURE REQUEST] Ability to convert yaml front matter into custom properties in docx and vice versa Ability to convert yaml front matter into custom properties in docx and vice versa Apr 15, 2017
@reneknuvers
Copy link

I too like this feature to be added to the pandoc-docx-writer. However I don't think the issue is 'bite-sized' at this moment. May I suggest the first step towards the full implementation suggested in the above posts?

implementation of all default metadata in the docx-writer in a similar way as the epub writer uses it. Microsoft uses Dublin Core in core.xml

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<cp:coreProperties xmlns:cp="http://schemas.openxmlformats.org/package/2006/metadata/core-properties" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:dcmitype="http://purl.org/dc/dcmitype/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"><dc:title>rkn-titel</dc:title><dc:subject>rkn-onderwerp</dc:subject><dc:creator>rkn-auteur</dc:creator><cp:keywords>rkn-trefwoord1 rkntrefwoord2</cp:keywords><dc:description>rkn-opmerking</dc:description><cp:lastModifiedBy>René Knuvers</cp:lastModifiedBy><cp:revision>1</cp:revision><dcterms:created xsi:type="dcterms:W3CDTF">2018-01-11T18:20:00Z</dcterms:created><dcterms:modified xsi:type="dcterms:W3CDTF">2018-01-11T18:24:00Z</dcterms:modified><cp:category>rkn-categorie</cp:category></cp:coreProperties>`

so it would be nice to populate those values from within the YAML-metadatablock in a markdown file, or a separate xml as could be used for EPUB.

---
Title: rkn-title
Subject: rkn-onderwerp
Author: rkn-auteur  % note that this will populate the "dc:creator" field
Keywords:
- rkntrefwoord1
- rkntrefwoord2
Revision: 1
Description: rkn-opmerking
---

Supporting all other fields from DCMI (http://dublincore.org/documents/dc-xml-guidelines/) would be nice. This would also include fields for a unique document identifier and a status field.

Step 2 could be implementation of populating 'custom.xml':

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Properties xmlns="http://schemas.openxmlformats.org/officeDocument/2006/custom-properties" xmlns:vt="http://schemas.openxmlformats.org/officeDocument/2006/docPropsVTypes"><property fmtid="{D5CDD505-2E9C-101B-9397-08002B2CF9AE}" pid="2" name="Afdeling"><vt:lpwstr>rkn-afdeling</vt:lpwstr></property><property fmtid="{D5CDD505-2E9C-101B-9397-08002B2CF9AE}" pid="3" name="Documentnummer"><vt:lpwstr>rkn-documentnummer</vt:lpwstr></property><property fmtid="{D5CDD505-2E9C-101B-9397-08002B2CF9AE}" pid="4" name="Gecontroleerd door"><vt:lpwstr>rkn-gecontroleerd door</vt:lpwstr></property><property fmtid="{D5CDD505-2E9C-101B-9397-08002B2CF9AE}" pid="5" name="Project"><vt:lpwstr>rkn-project</vt:lpwstr></property><property fmtid="{D5CDD505-2E9C-101B-9397-08002B2CF9AE}" pid="6" name="Status"><vt:lpwstr>rkn-status</vt:lpwstr></property><property fmtid="{D5CDD505-2E9C-101B-9397-08002B2CF9AE}" pid="7" name="Taal"><vt:lpwstr>rkn-taal</vt:lpwstr></property></Properties>'

... but that may be too hard for a simple implementation. Some of the custom fields are actually prepopulated but seem language dependant. I'm Dutch and use a multilingual MAC-Word 16.8 (171210) set to Dutch (nl-NL).

@agusmba
Copy link
Contributor

agusmba commented Apr 24, 2018

This would be a very nice feature.

At least in the md -> docx direction, as it would add the template-variable-substitution that we have on the template based writers (all save docx, odt, ppt...).

We would be removing the need to edit headers and footers in word after conversion to enter the correct "title" or "department" etc.

@agusmba
Copy link
Contributor

agusmba commented Oct 9, 2018

Under #2839 a commit has been included to support odt custom properties (on the writer). It would be great to have something similar for docx.

@jgm
Copy link
Owner

jgm commented Oct 9, 2018

Looks like we'd need, in [Content_Types].xml:

<Override PartName="/docProps/custom.xml" ContentType="application/vnd.openxmlformats-officedocument.custom-properties+xml"/></Types>

In .rels/rels:

<Relationship Id="rId5" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/custom-properties" Target="docProps/custom.xml"/>

and then we'd need docProps/custom.xml:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Properties xmlns="http://schemas.openxmlformats.org/officeDocument/2006/custom-properties" xmlns:vt="http://schemas.openxmlformats.org/officeDocument/2006/docPropsVTypes">
<property fmtid="{D5CDD505-2E9C-101B-9397-08002B2CF9AE}" pid="2" name="Foo bar">
<vt:lpwstr>hello there</vt:lpwstr>
</property>
<property fmtid="{D5CDD505-2E9C-101B-9397-08002B2CF9AE}" pid="3" name="Zoopie">
<vt:lpwstr>1123</vt:lpwstr>
</property>
</Properties>

jgm added a commit that referenced this issue Oct 9, 2018
So far, we don't actually write any custom properties,
but we have the infrastructure to add this.

See #3034.
@jgm
Copy link
Owner

jgm commented Oct 9, 2018

A general question affecting this and #2839: how should we identify custom properties?

In #2839 I just took everything except title, author, date, and lang as a custom property.

But perhaps instead we should look for a special section custom:

custom:
  prop1: foo
  prop2: bar

This would avoid getting properties like toc: true and so on.

@mb21
Copy link
Collaborator

mb21 commented Oct 9, 2018

Well, what makes these metadata keys more custom than others? If anything, they seem to belong to the top level, together with title, author, etc. And toc: true would seem to belong to a sub-level, like template: (of course, backwards-compatibility...)

Maybe a question to anybody in this thread: would it hurt to put all properties in the docx? Even toc: true, header-includes, etc? I'm guessing it wouldn't hurt... maybe people using SharePoint in their company have a naming-scheme already with a custom prefix?

@agusmba
Copy link
Contributor

agusmba commented Oct 10, 2018

Good question.

Personally I like the simplicity of having everything that is not an "official" property of the target format as a custom property, like what was done for #2839

I guess having a special section would work too, although some property duplication in the front matter could occur in this case

Whatever is decided in the end should be the same for ODT and DOCX

Thank you for tackling this!!

@rdwatters
Copy link
Author

Maybe a question to anybody in this thread: would it hurt to put all properties in the docx? Even toc: true, header-includes, etc? I'm guessing it wouldn't hurt...

Probably not. As long as the custom property isn't mapping to any sort of library/site column in SharePoint, it shouldn't matter. (At some point, it has to be the Pandoc user's responsibility to match these.)

Personally I like the simplicity of having everything that is not an "official" property of the target format as a custom property, like what was done for #2839

If I understand this correctly, yes, I agree. The standard properties, I believe, are everything in the "summary," as shown in the screenshot above.

One might, after some digging, start to suspect that Microsoft made this unnecessarily complex intentionally 😉

@agusmba
Copy link
Contributor

agusmba commented Dec 20, 2018

@jgm How could I help move this one forward?

I see a couple "core" (core.xml) properties probably missing:

  • subject
  • description

The extended properties (app.xml) seem to include:

  • Manager
  • Company

I guess any additional property should go to the custom.xml

Regarding the use of a special section for custom properties, I think it wouldn't be necessary, but I'm not against it either. We could add all not-ignored-properties (not ending in _) to the custom bag, or just the contents of the custom section in the yaml. What would make more sense taking into account the rest of the writers?

BTW docx custom properties can be strings, but also numbers, dates and booleans:

<property name="AMBCustomKey" pid="4" fmtid="{D5CDD505-2E9C-101B-9397-08002B2CF9AE}">
<vt:lpwstr>AMB Custom Value</vt:lpwstr>
</property>

<property name="Número de documento" pid="5" fmtid="{D5CDD505-2E9C-101B-9397-08002B2CF9AE}">
<vt:i4>42</vt:i4>
</property>

<property name="Fecha de registro" pid="6" fmtid="{D5CDD505-2E9C-101B-9397-08002B2CF9AE}">
<vt:filetime>2019-01-30T23:00:00Z</vt:filetime>
</property>

<property name="AMBBinary" pid="7" fmtid="{D5CDD505-2E9C-101B-9397-08002B2CF9AE}">
<vt:bool>true</vt:bool>
</property>

In case we want to take that into account.

@agusmba
Copy link
Contributor

agusmba commented Jan 11, 2019

I added some basic support for this in a PR, trying to replicate what was done for ODT.
Even though we might end up changing which properties to write, both ODT and DOCX writers should be consistent.

@agusmba
Copy link
Contributor

agusmba commented Jan 14, 2019

Speaking about the docx writer:

I think I'll ignore extended properties for now (app.xml), but having support for core.xml already in pandoc, I think it would be useful to take advantage of it instead of putting all as custom properties.

Docx core properties title, creator, keywords, created, modified are already supported by pandoc (creator is author in pandoc properties, and the last two are calculated automatically).

However there are additional properties available which we could equate with some pandoc ones:

docx core property pandoc property notes
subject subtitle similar to other writers? No, see remarks in the PR
description abstract
language lang does Word use it?

There are additional docx core properties that we could parse from pandoc directly without changing their names:

  • description
  • subject
  • category
  • identifier
  • revision
  • version

This is a non-exhaustive list, but I think I got the main entries.

I'll try to modify the PR accordingly.

UPDATE:

Identifier and version in core.xml seem to get lost or are not accesible from within Word. Revision is also giving me grief in core.xml, I can completely break Word with it. I will remove these three from core (note that they still can be used as custom properties).

I'm not sure if language is doing anything in core, I do not see a difference one way or another. Should I also move it to custom?

Subject is quite different from subtitle and should not be made equivalent. Thanks @HeirOfNorton

@agusmba
Copy link
Contributor

agusmba commented Jan 26, 2019

I've improved the PR #5252 in order to align writing document properties and custom properties in docx, odt and pptx

@jgm
Copy link
Owner

jgm commented Feb 2, 2019

@agusmba can this issue now be closed? if not, what remains to be done?

@agusmba
Copy link
Contributor

agusmba commented Feb 4, 2019

Well, I only implemented the writer part, and the title of this issue includes and vice versa, so I guess it also requests the reader part. We could open a different issue for the reader part and close this one, or keep this one open.

@jgm
Copy link
Owner

jgm commented Feb 4, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants