Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: DOCX header/footer converted to <header/> and <footer/> in HTML #5211

Open
hyprhare opened this issue Jan 10, 2019 · 10 comments

Comments

@hyprhare
Copy link

hyprhare commented Jan 10, 2019

I'm curious as to why the header and footer info in a .docx file are not represented in an HTML converted output?

Seems to me that the header and footer information from a .docx file should be added into either <div id="header/footer" /> tags or the corresponding <header/> and <footer/> tags in the HTML output.

@mb21
Copy link
Collaborator

mb21 commented Jan 10, 2019

Docx header and footers are supposed to be repeated on each page, right? That's not the case with the HTML elements.

Also, from http://pandoc.org/MANUAL.html#description:

Because pandoc’s intermediate representation of a document is less expressive than many of the formats it converts between, one should not expect perfect conversions between every format and every other.

And pandoc's intermediate AST is not currently able to represent header and footers.

@hyprhare
Copy link
Author

Indeed, .docx headers and footers are repeated for every "page". When rendering .html files you are only ever viewing one "page". Thus the header and footer appear once, that's expected. I would say that's fine, at least the information would be carried over, rather than getting dropped on the floor... Perhaps, adding an option to generate multiple .html files for every page in a .docx would be more appropriate?

Regardless, pandoc's intermediate language not supporting headers and footers is the bigger hurdle here. Are there plans to incorporate headers and footers into it, per chance?

@mb21
Copy link
Collaborator

mb21 commented Jan 10, 2019

Are there plans to incorporate headers and footers into it, per chance?

No, currently not. Changing the AST is always a major undertaking, as all readers and writers need to be adjusted. That includes having a good way of outputting this to LaTeX etc.

It could also be argued that headers/footers are usually used to display current chapter name, page number etc. and headers/footers are thus part of the layout, not part of the document body – and pandoc doesn't aim to do layout conversions.

@hyprhare
Copy link
Author

hyprhare commented Jan 10, 2019

I hear you on that; if the target format is missing a comparable representation of some concept in the source format, then you'll be unable to reliably do round-trip conversion between the two... So, in that sense, I agree with your sentiment in the linked ticket; I see no need to force target formats to represent everything available, if they have no comparable concept of it.

That said, just because ALL potential formats don't have a comparable representation of a given concept, I don't agree that the internal, intermediary, representation for that concept then needn't exist. Rather, I would contend that, EVERY(within reason) concept which is represented in ANY 2+ source formats should still be represented in the intermediate format (provided it's content of some kind as per the guidelines). That way, those target formats which can utilize it, will, and those that cannot, won't.

I wouldn't consider headers and footers to be layout/styling. They certainly do contain content, and thus should be "in-scope" according to the guidelines

@jgm
Copy link
Owner

jgm commented Jan 11, 2019

Headers and footers are definitely layout/styling, as I see it. Yes, they contain some content, but just storing this content usually won't be too helpful in reproducing the original document. (Unless you store a great deal of structural information too: page number goes on the right on right-side pages, on the left on left-side pages, company logo is centered, chapter name on the right on left-side pages, section name on the left on right-side pages, ...)

Rather, I would contend that, EVERY(within reason) concept which is represented in ANY 2+ source formats should still be represented in the intermediate format (provided it's content of some kind as per the guidelines).

It's the simplicity of our basic document model that allows us to handle so many different formats. If the AST was complex enough to represent everything that can be expressed in docx, LaTeX, or groff, well, I can't even imagine what that would look like, and I certainly wouldn't want to write a reader or writer for that. Lossless conversions (including round-trips) between very expressive formats is not a design goal.

So, I consider this out of scope as a request for an AST change. The most I would consider would be parsing header or footer content into metadata fields (which would not require an AST change). However, I suspect that most headers/footers make heavy use of fields for things like page numbers, and these aren't going to be representable in pandoc anyway.

@agusmba
Copy link
Contributor

agusmba commented Jan 11, 2019

I think the main use case for wanting to extract information from the docx header/footer is related with extracting some kind of title/organization information (or similar), since company logos or page numbers are probably unnecessary.

This could be best achieved by having one or several document properties in the docx with said info, and having pandoc convert that into metadata (which could be rendered into yaml frontmatter, or html, etc.). Having this would improve the roundtrip experience if we also had #3034

@jgm
Copy link
Owner

jgm commented Sep 26, 2019

As an alternative to putting these things in metadata fields, we could add a Div with class header at the beginning, and a Div with class footer at the end. (Here I'm referring to pandoc AST elements. IF we wanted to, we could also make the HTML writer render the header Div as <header>, etc.)

@jkr is the expert on the docx reader; maybe he could comment on how hard it would be to add a feature that extracts this information.

Note also that in many it will be undesirable to have header and footer put into the converted document, since these will consist of nothing more than page numbers or maybe a repetition of the document's title. So if we did add a feature like this we should probably put it under an extension.

@jkr
Copy link
Collaborator

jkr commented Sep 27, 2019

It shouldn't be hard technically -- there's a header.xml and footer.xml, with associated relations. But I would be pretty wary of putting it into a div as part of the document proper. It would take some judgement on our part to figure out whether things were meaningful or presentation. So if we were to do this, I'd say it should be metadata. But I'm not still not sure how something like an author's name (in the header in MLA format) would appear. Maybe differentiating between auto fields (name, page) and custom fields?

@Zeppelin456
Copy link

A few of the doc converters that I have come across (Tika, LibreOffice etc) put the header as text at the top of the document and the footer at the end of the document, and text boxes where they appear in the document (albeit only html and txt). This tends to be easiest as you know where to look, know all the content is there, and can move to header sections in the resulting document when required. Particularly conversions to HTML and Text as the headers are generally only required at the top and bottom as there are no "pages". Perhaps you could have a parameter in the converter such as --docx-header=top --docx-footer=bottom --doc-textbox=inplace. At the moment the text is lost and it is an important part of the document. On the rare occasion that there are different headers per page, could they be put in place where they relate to the text, or at the top with some kind of indicative text.

@dirkroorda
Copy link

My use case is that I convert docx to tei-simple with pandoc, and there is essential info in the page header that is nowhere in the page content: the name of the transcriber.
Seeing that Pandoc does not transfer this information from docx to tei-simple, I think I have to extract it with custom code straight from the docx (which is not too difficult, since it is in the header_x_.xml) but it is a pity that Pandoc cannot preserve this information.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants