-
-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ability to convert yaml front matter into custom properties in docx and vice versa #3034
Comments
+++ Ryan Watters [Jul 20 16 10:14 ]:
|
@rdwatters Would you be able to upload a docx file with some of these properties, and a pointer to which particularly properties you're interested in? What you're requesting might well be possible, but I'd have to see where in the maze of xml those properties reside. |
Absolutely @jkr. I'll put together some screenshots at work later today. |
Here is a link to a docx created in MS Word for Mac Version 15.24; my hope is that these custom properties do not change depending on whether it's Windows/OSX: https://www.dropbox.com/s/86qs1hfng6l601r/pandoc-sample.docx?dl=0 Not sure if it helps, but...
I guess I would have to dig deeper to see how these different key-values would map, but things like "keywords" in the .docx property (usually "tags" in a I know I'm throwing out a ton in this single comment, so let me know how I can help/clarify. |
So, just so I understand this -- would you like to be have the toplevel metadata map into the properties?
or do you want to have the ability to have a different title, like you do in the supplied document?
I don't know what Sharepoint is, or how 365 works -- are they important to this, or are we just trying to get the metadata into props? |
Or is it actually the info under "custom" (pandoc1="wicked", pandoc2="awesom") that you're most concerned with? There are a lot of properties here, so I want to make sure I'm looking at the right thing. |
@jkr Both excellent questions that demonstrate how terrible my examples were. The following is long-winded but thorough. I really, really appreciate you taking the time to look into this. SharePoint is significant in that its the #1 most popular intranet tool (>50% of Fortune 500 companies) and is MS's native document management system. It is also notoriously difficult to extract content from. Companies invest enormous amounts of time tagging and cataloguing word documentss with metadata that is embedded directly into the document but then is only easily digestible by a further SharePoint instance. Ideally, Pandoc would be able to do these conversions bidirectionally between Word and MD - again, I appreciate that this is a huge ask. Here is a temp repo that houses the word versions used in the following screenshots: 1. (see word-only.docx). So here are the properties of the doc created in Word. These properties (Title, Subject, Author, Manager, Company, Category, Keywords, Comments) are all out-of-the-box: Here is a shot of the added Custom Property (for this example, the property is "CustomProperty", data type is text, and value is "Hello Pandoc":
2. So now that we have a Word document (created locally) with the out-of-box properties and one custom property added, the doc can be be added to a SharePoint list, which admin looks like the following screenshot. Note how "Comments" has been converted to "Description" and the "Creator" column (I'm using DCMI metadata for this example, which is an out-of-the-box content type in SharePoint) is auto-populated from the Author. 3. You can see in the last screenshot that "Publisher" doesn't have a value. These values can be updated within the SharePoint UI and are embedded in the Word document itself: 4. And here is the resulting update in the SharePoint list (keep in mind these lists will often have 10ks of Word documents): 5. (see word-with-properties-added-in-sp.docx) Now that these properties are in the document, I'll download a copy of the doc and retitle it to 6. (COPY-word-with-properties-added-in-sp.docx) And finally, just to double check that both the custom property (ie, CustomProperty) and the properties we added via SharePoint directly (ie, the DCMI "Contributors" and "Publisher") are actually embedded in the document, I'll upload the locally renamed version of the original document as well as a copy to the SharePoint list, which shows the properties (ie, metadata) actually travel with the document: So what would be the ideal workflow? Having the document I created above in Word write back to a markdown file with all the out-of-the-box properties (listed above), the custom properties (in this case, just "CustomProperty"), and the other content type properties (in this case, "Publisher" and "Contributors") as typical key-value pairs....and vice versa. All this insanity because the last 5 years of markdown evangelism have lead to exactly 0 converts in my last three companies. Hopefully that makes sense. Thanks again! |
So now I'm confused. It seems like you're asking for the ability to go docx -> markdown, while the title suggests you're interested in going markdown -> docx. While I appreciate the completeness of your description, I'm still unsure of what you want pandoc to do. A simple input file with expected output file would definitely help. Now, assuming you want something bidirectional, let me mention some concerns.
|
Agreed w/r/t title of this feature. I have updated it accordingly. Now in terms of your responses:
|
I've stumbled across this issue today after struggling with docx Custom Properties. It would be amazing if pandoc could handle them. Much like @rdwatters I work at a company that has all of it's documentation in If it helps clarify things, here's my immediate use case: I currently use markdown with YAML at the top of the document like so:
I convert these to docx using a command like this: My issue is that there are custom properties in
This is of course just an example, I don't know what the correct syntax would be. I'd expect the output to look like:
Obviously like @rdwatters said, this is a big ask (just glancing at the raw xml of the |
I guess what would really help is the exact XML output needed for different cases... |
Agreed, sorry I left that out of my example. I first created
These DocProperty values are inserted using word as 'Fields' in the headers/footers. The xml for those looks like this:
After running the pandoc command in my previous comment, these fields remain in the headers/footers of the What I'd like to be able to do is specify custom properties like the above somehow in my original markdown document so that the fields in my |
I too like this feature to be added to the pandoc-docx-writer. However I don't think the issue is 'bite-sized' at this moment. May I suggest the first step towards the full implementation suggested in the above posts? implementation of all default metadata in the docx-writer in a similar way as the epub writer uses it. Microsoft uses Dublin Core in core.xml
so it would be nice to populate those values from within the YAML-metadatablock in a markdown file, or a separate xml as could be used for EPUB.
Supporting all other fields from DCMI (http://dublincore.org/documents/dc-xml-guidelines/) would be nice. This would also include fields for a unique document identifier and a status field. Step 2 could be implementation of populating 'custom.xml':
... but that may be too hard for a simple implementation. Some of the custom fields are actually prepopulated but seem language dependant. I'm Dutch and use a multilingual MAC-Word 16.8 (171210) set to Dutch (nl-NL). |
This would be a very nice feature. At least in the We would be removing the need to edit headers and footers in word after conversion to enter the correct "title" or "department" etc. |
Under #2839 a commit has been included to support odt custom properties (on the writer). It would be great to have something similar for docx. |
Looks like we'd need, in <Override PartName="/docProps/custom.xml" ContentType="application/vnd.openxmlformats-officedocument.custom-properties+xml"/></Types> In <Relationship Id="rId5" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/custom-properties" Target="docProps/custom.xml"/> and then we'd need <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Properties xmlns="http://schemas.openxmlformats.org/officeDocument/2006/custom-properties" xmlns:vt="http://schemas.openxmlformats.org/officeDocument/2006/docPropsVTypes">
<property fmtid="{D5CDD505-2E9C-101B-9397-08002B2CF9AE}" pid="2" name="Foo bar">
<vt:lpwstr>hello there</vt:lpwstr>
</property>
<property fmtid="{D5CDD505-2E9C-101B-9397-08002B2CF9AE}" pid="3" name="Zoopie">
<vt:lpwstr>1123</vt:lpwstr>
</property>
</Properties> |
So far, we don't actually write any custom properties, but we have the infrastructure to add this. See #3034.
A general question affecting this and #2839: how should we identify custom properties? In #2839 I just took everything except title, author, date, and lang as a custom property. But perhaps instead we should look for a special section
This would avoid getting properties like |
Well, what makes these metadata keys more custom than others? If anything, they seem to belong to the top level, together with Maybe a question to anybody in this thread: would it hurt to put all properties in the docx? Even |
Good question. Personally I like the simplicity of having everything that is not an "official" property of the target format as a custom property, like what was done for #2839 I guess having a special section would work too, although some property duplication in the front matter could occur in this case Whatever is decided in the end should be the same for ODT and DOCX Thank you for tackling this!! |
Probably not. As long as the custom property isn't mapping to any sort of library/site column in SharePoint, it shouldn't matter. (At some point, it has to be the Pandoc user's responsibility to match these.)
If I understand this correctly, yes, I agree. The standard properties, I believe, are everything in the "summary," as shown in the screenshot above. One might, after some digging, start to suspect that Microsoft made this unnecessarily complex intentionally 😉 |
@jgm How could I help move this one forward? I see a couple "core" (core.xml) properties probably missing:
The extended properties (app.xml) seem to include:
I guess any additional property should go to the custom.xml Regarding the use of a special section for custom properties, I think it wouldn't be necessary, but I'm not against it either. We could add all not-ignored-properties (not ending in BTW docx custom properties can be strings, but also numbers, dates and booleans:
In case we want to take that into account. |
I added some basic support for this in a PR, trying to replicate what was done for ODT. |
Speaking about the docx writer: I think I'll ignore extended properties for now (app.xml), but having support for core.xml already in pandoc, I think it would be useful to take advantage of it instead of putting all as custom properties. Docx core properties title, creator, keywords, created, modified are already supported by pandoc (creator is author in pandoc properties, and the last two are calculated automatically). However there are additional properties available which we could equate with some pandoc ones:
There are additional docx core properties that we could parse from pandoc directly without changing their names:
This is a non-exhaustive list, but I think I got the main entries. I'll try to modify the PR accordingly. UPDATE: Identifier and version in core.xml seem to get lost or are not accesible from within Word. Revision is also giving me grief in core.xml, I can completely break Word with it. I will remove these three from core (note that they still can be used as custom properties). I'm not sure if language is doing anything in core, I do not see a difference one way or another. Should I also move it to custom? Subject is quite different from subtitle and should not be made equivalent. Thanks @HeirOfNorton |
I've improved the PR #5252 in order to align writing document properties and custom properties in docx, odt and pptx |
@agusmba can this issue now be closed? if not, what remains to be done? |
Well, I only implemented the writer part, and the title of this issue includes and vice versa, so I guess it also requests the reader part. We could open a different issue for the reader part and close this one, or keep this one open. |
Well, I only implemented the writer part, and the title of this issue includes *and vice versa*, so I guess it also requests the reader part. We could open a different issue for the reader part and close this one, or keep this one open.
That's fine, we can keep this open -- I just wanted to
summarize what was still needed: reading custom
properties in docx into pandoc metadata.
|
I think this is pretty self-explanatory, but the ability to set custom properties in a Word document would be a total game-changer, especially for those of us working for companies so entrenched in the MS garden that we use SharePoint as a DMS. I have hundreds of md files that I'd love to convert to Word and have them retain the custom key in the key-value pair. Then it's just a drag and drop into a SharePoint library, which should pick up those custom properties and map them correctly for list views, etc.
I appreciate this is a very big ask. Cheers.
The text was updated successfully, but these errors were encountered: