-
-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature Request: DOCX header/footer converted to <header/> and <footer/> in HTML #5211
Comments
Docx header and footers are supposed to be repeated on each page, right? That's not the case with the HTML elements. Also, from http://pandoc.org/MANUAL.html#description:
And pandoc's intermediate AST is not currently able to represent header and footers. |
Indeed, Regardless, pandoc's intermediate language not supporting headers and footers is the bigger hurdle here. Are there plans to incorporate headers and footers into it, per chance? |
No, currently not. Changing the AST is always a major undertaking, as all readers and writers need to be adjusted. That includes having a good way of outputting this to LaTeX etc. It could also be argued that headers/footers are usually used to display current chapter name, page number etc. and headers/footers are thus part of the layout, not part of the document body – and pandoc doesn't aim to do layout conversions. |
I hear you on that; if the target format is missing a comparable representation of some concept in the source format, then you'll be unable to reliably do round-trip conversion between the two... So, in that sense, I agree with your sentiment in the linked ticket; I see no need to force target formats to represent everything available, if they have no comparable concept of it. That said, just because ALL potential formats don't have a comparable representation of a given concept, I don't agree that the internal, intermediary, representation for that concept then needn't exist. Rather, I would contend that, EVERY(within reason) concept which is represented in ANY 2+ source formats should still be represented in the intermediate format (provided it's content of some kind as per the guidelines). That way, those target formats which can utilize it, will, and those that cannot, won't. I wouldn't consider headers and footers to be layout/styling. They certainly do contain content, and thus should be "in-scope" according to the guidelines |
Headers and footers are definitely layout/styling, as I see it. Yes, they contain some content, but just storing this content usually won't be too helpful in reproducing the original document. (Unless you store a great deal of structural information too: page number goes on the right on right-side pages, on the left on left-side pages, company logo is centered, chapter name on the right on left-side pages, section name on the left on right-side pages, ...)
It's the simplicity of our basic document model that allows us to handle so many different formats. If the AST was complex enough to represent everything that can be expressed in docx, LaTeX, or groff, well, I can't even imagine what that would look like, and I certainly wouldn't want to write a reader or writer for that. Lossless conversions (including round-trips) between very expressive formats is not a design goal. So, I consider this out of scope as a request for an AST change. The most I would consider would be parsing header or footer content into metadata fields (which would not require an AST change). However, I suspect that most headers/footers make heavy use of fields for things like page numbers, and these aren't going to be representable in pandoc anyway. |
I think the main use case for wanting to extract information from the docx header/footer is related with extracting some kind of title/organization information (or similar), since company logos or page numbers are probably unnecessary. This could be best achieved by having one or several document properties in the docx with said info, and having pandoc convert that into metadata (which could be rendered into yaml frontmatter, or html, etc.). Having this would improve the roundtrip experience if we also had #3034 |
As an alternative to putting these things in metadata fields, we could add a Div with class @jkr is the expert on the docx reader; maybe he could comment on how hard it would be to add a feature that extracts this information. Note also that in many it will be undesirable to have header and footer put into the converted document, since these will consist of nothing more than page numbers or maybe a repetition of the document's title. So if we did add a feature like this we should probably put it under an extension. |
It shouldn't be hard technically -- there's a header.xml and footer.xml, with associated relations. But I would be pretty wary of putting it into a div as part of the document proper. It would take some judgement on our part to figure out whether things were meaningful or presentation. So if we were to do this, I'd say it should be metadata. But I'm not still not sure how something like an author's name (in the header in MLA format) would appear. Maybe differentiating between auto fields (name, page) and custom fields? |
A few of the doc converters that I have come across (Tika, LibreOffice etc) put the header as text at the top of the document and the footer at the end of the document, and text boxes where they appear in the document (albeit only html and txt). This tends to be easiest as you know where to look, know all the content is there, and can move to header sections in the resulting document when required. Particularly conversions to HTML and Text as the headers are generally only required at the top and bottom as there are no "pages". Perhaps you could have a parameter in the converter such as --docx-header=top --docx-footer=bottom --doc-textbox=inplace. At the moment the text is lost and it is an important part of the document. On the rare occasion that there are different headers per page, could they be put in place where they relate to the text, or at the top with some kind of indicative text. |
My use case is that I convert docx to tei-simple with pandoc, and there is essential info in the page header that is nowhere in the page content: the name of the transcriber. |
I'm curious as to why the header and footer info in a
.docx
file are not represented in an HTML converted output?Seems to me that the header and footer information from a
.docx
file should be added into either<div id="header/footer" />
tags or the corresponding<header/>
and<footer/>
tags in the HTML output.The text was updated successfully, but these errors were encountered: