-
-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature Request: Add support for chunked (multiple file) HTML and HTMLHelp. #6122
Comments
This is not really just HTML. You may want to chunk up a large Markdown file into smaller Markdown files too. |
Hm, maybe the first step would be writing a format-independent function splitIntoChunks :: FilePath -> Int -> Pandoc -> [(FilePath, Pandoc)] where the Int parameter is the heading level to split at, and the FilePath is a file path template to be used (e.g. This function would split up the document into sections and rewrite any internal links so that they point to the correct paths. Not a hard thing to write. Perhaps there should also be an option for adding "next," "previous," and "up" links to each chunk, as in the HTML output produced by texinfo? We could use arrows instead of the words "Next", "Previous", and "Top" to avoid English-centrism? Just adding this to Shared would be helpful. Then we'd need to think about how to integrate it onto the command line. Perhaps the simplest approach would be this: if the output file is |
This would be very useful. Outputting a zip file is simple, but the first thing any makefile or batch file is going to have to do is unzip it in order to further process it. How about just specifying a folder name instead of a file name? If the folder doesn't exist already perhaps you could add a trailing slash or backslash to indicate that it's a folder. Or maybe just an option that means output a folder of files instead of one file (that is, chunked). This is more verbose, but it's clearer what you are doing. An option for next, previous, and up links (using arrows) would be nice. For HtmlHelp, we also need to create the project (.hhp), content (.hhc), and index (.hhk) files. Perhaps HtmlHelp is a a separate issue, and if you want me to create a new issue specifically for it I can do so. But any HtmlHelp writer will need to make use of the chunked html output option, so it's good to think about how to integrate both of these into the command line. For that matter, epub output is related as well. |
That's a possibility. I like the idea of keeping the simple invariant that pandoc produces one file, but I can see this would be m ore convenient. |
An option for next, previous, and up links (using arrows) would be nice.
I think this is a job for the template, assuming each file would be run
through the template separately. Pandoc could add metadata fields
`this-file: NAME`, `prev-file: NAME`, `next-file: NAME` so that people can
include and design those links if and as they want them in the template.
|
That makes sense to me! |
In order to facilitate building static sites (or dumping to templates used in static sites such as For example, I might want to run the command like: pandoc -f markdown -t html5 \
--chunks chapters --chunk-dest ~/projects/some-site/templates/my-book/ \
{first,last,second}-chapter.md And the output would be:
Or, there might be a way to specify other patterns so that someone could use config like chunk-name: '{{ section[0]["name"] }}/{{ section[1]["name"] }}{{ ext }}' To get output like |
It may make sense for Pandoc to work with trees of files instead of single streams:
|
This functionality might also be useful in filters. |
I'm very excited for this possibility. Does being in "next release" mean that it is actually decided to implement it? |
I'm afraid the "next release" tag has been aspirational so far... |
Understood. That’s why I asked. Thanks so much for all your wonderful work. |
It seems like bookdown somehow does this even though it's using Pandoc: pandoc in bookdown docs The HTML output is split into different files and crossreferences work. I guess this tells me there's some way of doing this now ... any ideas how? |
It apparently happens here. I don't know R and it's 1100 lines ... there's a lot going on here. |
Fwiw, somebody made a pretty comprehensive filter-based version of multiple-output html files that fixes crossreference urls ... I haven't tested: |
I'm pretty sure this isn't the best path, but epub files are made of multiple chunks of .xhtml, personally and for a while I've been doing this by generating .epub files with Pandoc and then using a task runner to automate the unzipping > extracting > parsing > processing > moving > fixing > renaming of the xhtml files as needed. That's an ugly hack I made a couple of years ago to solve this need and for a very specific case, maybe something like that could work for you meanwhile. |
Thanks @barriteau -- do you have your code for that? As may be the case for others, I'm making large html docs and having performance issues. There's only so much improvement I can get out of lazy loading images and the like ... mostly it's MathJax. But there's no significant reason for it to be one-file other than Pandoc. A stop-gap solution until this feature is implemented would be most welcome :) |
Yup, but I'm afraid that in its actual conditions is of no use for you, it's an old Grunt task with a lot of extra and specific routines for other different stuff. I'll take a look to it to find if it's worth to clean it for sharing and reuse, I'll let you know :) |
I've looked at how Bookdown does it before. Part of the reason it is so complicated is because it supports a fair number of Pandoc options, which changes the output that it then has to process. In fact, I use Bookdown currently. One of the things that makes me hopeful about Pandoc making this change is that it might fix a couple of problems I've got with Bookdown related to its splitting process. |
I still think my Feb. So, rough plan would be
|
@jgm Still a somewhat vague idea of mine – do you think it’s possible to make your ideas more general? For example:
|
Something else to consider: In bookdown you can specify to split the HTML up by chapter, by section, or by file. I like that flexibility, fwiw, especially the split by file option. Split by chapter sometimes gives me way too long of webpages. Split by section sometimes leaves me with nothing but a chapter title on one webpage, and then you've got to go to the next webpage to get to the next section. Split by file lets me decide. |
@rauschma we already have a MediaBag to contain assets used by the document. These get passed through the plumbing in PandocMonad, so we shouldn't need to represent them explicitly. But I take the core of your idea to be that we might want to support "trees" (directories containing multiple documents) in both input and output (my proposal above is output only). This would require, at least, the change noted in https://groups.google.com/g/pandoc-discuss/c/M_UPUFs1G6o/m/hKGN-V8YBwAJ. |
@jtbayly - I don't know what "split by file" would really mean, when you're splitting up a Pandoc document. (It doesn't come chunked into files.) |
But it accepts multiple files as input, doesn't it? |
jgm, I think you are referring to your Feb 6 comment, not Feb 20. <rant>I detest github's "relative" dates. When I see "commented 22 days ago", I have no idea when that was without looking at a calendar. And "2 months ago" is meaningless.</rant> In terms of planning, how would the TOC be done, and could that be templated as well? I'm thinking formats such as epub and htmlhelp need a TOC file in one form or another, and it would be nice if the output zip file (or directory) contained the TOC information in a form that could be turned into the required file. Even if you only intend to use the chunked html as a static web-site, you probably want to generate a TOC someplace in your site, perhaps a banner or column on every page. This file should respect the Another question I have is how would I create an index. Here I am referring to an alphabetical index like you might see at the end of a book, not a TOC. Epub, HtmlHelp, and pdf all support such a concept. AFAIK Pandoc does not support an index natively. This may be a separate issue, and off-topic here, but I'd be interested in any thoughts you have about how to do this, even if it involves a filter and/or post-processing the output zip file/directory. |
@jtbayly Yes, you can specify multiple files as input; however, everything is concatenated before parsing, and the parser doesn't even know which parts come from which files (this could be improved by https://groups.google.com/g/pandoc-discuss/c/M_UPUFs1G6o/m/hKGN-V8YBwAJ); moreover, the AST doesn't contain slots to represent source positions. A 'Pandoc' is an abstract representation of a document; you can get the same 'Pandoc' from multiple files or from one. |
@dm413 Yes, we need to figure out how to deal with the TOC. I think the simplest option is to generate a TOC for the whole document (tree) and put it in one of the generated files. But this may not be the best approach if you want the TOC in a side banner. As for an index, that's a separate issue in a way, since you could want an index even with non-chunked output. Currently there's no built in way to construct one, but it's certainly possible to use a filter to define an indexing system. One difficulty with building in a general index system is that the requirements tend to be format-dependent. IF you want, you can create a separate issue for indexes on this tracker (if there isn't one already). |
I did a quick search, there is issue #6415 Built-in support for indices? |
For chunking, I’d prefer:
Doing templating well, is difficult and maybe better done via an external general-purpose programming language (vs. via configuring Pandoc declaratively). Output:
Input – two options:
Open question:
|
This module provides functions to split Pandoc documents into chunks to be rendered in separate files, e.g. one per section. Internal identifiers are rewritten appropriately to point to the new locations. See #6122.
This module provides functions to split Pandoc documents into chunks to be rendered in separate files, e.g. one per section. Internal identifiers are rewritten appropriately to point to the new locations. See #6122.
I've written an experimental Chunks module for generic chunk-ing (issue6122 branch). |
This module provides functions to split Pandoc documents into chunks to be rendered in separate files, e.g. one per section. Internal identifiers are rewritten appropriately to point to the new locations. See #6122.
This module provides functions to split Pandoc documents into chunks to be rendered in separate files, e.g. one per section. Internal identifiers are rewritten appropriately to point to the new locations. See #6122.
I'm working on this feature now in the |
Please see https://pandoc.org/chunkedhtml-demo/ for a demo of the current code. Comments welcome. |
Thanks for your work on this. The demo output looks great. The cross-page links work. There are navigation links at the top. It's quite usable. How does this work in practice? Is any of this template driven? What command line options exist? For example,
One issue for me -- Is the TOC available in a format that can be massaged into other formats. For example:
Can any of this be done with templates? You may not have gotten to this stuff yet -- which is fine. Just want to see where we are and how this might develop. Thanks! |
We have the open PR #8485: it needs adjustments if branch |
So far, the section splitting level is determined by For this demo I used
Yes, all the link rendering is done in the template, so you can remove them or change them. <nav id="sitenav">
<div class="sitenav">
<span class="navlink">
$if(up.url)$
Up: <a href="$up.url$" accesskey="u" rel="up">$up.title$</a>
$endif$
</span>
<span class="navlink">
$if(top)$
Top: <a href="$top.url$" accesskey="t" rel="top">$if(toc-title)$$toc-title$$else$Contents$endif$</a>
$endif$
</span>
</div>
<div class="sitenav">
<span class="navlink">
$if(next.url)$
Next: <a href="$next.url$" accesskey="n" rel="next">$next.title$</a>
$endif$
</span>
<span class="navlink">
$if(previous.url)$
Previous: <a href="$previous.url$" accesskey="p" rel="previous">$previous.title$</a>
$endif$
</span>
</div>
</nav>
I should implement sensitivity to
We do have a data structure with all of this. I could maybe provide it in JSON form as a template variable? Not sure what would be the best way to make it available. Perhaps having it accessible from Lua is best.
I plan to modify the templates to make it possible to include the TOC on every page if you want. (This would be the full TOC, though, not, say, just a section. I think that's what is most useful, no?) |
awesome.
I don't know enough about Pandoc templating to know whether we could use that to generate the files. If so, that would be a nice solution.
I'm not sure what's the best method either. I haven't done much with templates in Pandoc, so I'm probably not the best person to consult about this. It would be great if we could directly implement the HTMLHelp project and content files using templates, though I'm guessing we'd have to run Pandoc three times, once to generate the content (html files), and twice more to generate the project and content files (each time with a different template). I don't know if the template can do that -- the content file is kind-of xml, but the project file looks more like an ini file. Lua is also an option, and could presumably generate the content and project files in a single pass with the html.
I agree the full TOC is what most people would want. I'm not so sure about including it on every page though. Isn't that usually done by generating a separate navigation file, and referencing that in an iframe or something? I haven't done this sort of thing in many years, and html and css have changed quite a bit in the meantime, so maybe it's done differently now. |
I'm setting it up to produce a json hierarchical sitemap in the same directory. |
I'm calling this issue closed. Please test using the nightly at |
Reopening to explore this idea: I had introduced It occurred to me that we might be able to simplify this. Suppose we said that the splitting of chunked HTML output was determined by So the question is whether there's any reason to allow chunking that is less fine-grained or more-fine-grained than the TOC-depth. For example,
|
I can say with certainty that the flexibility would be beneficial to me. In particular, it would look like the second scenario you outlined. Or it would look like something I described above, where I just want to be able to manually control where the splits happen. Sometimes in the same book a second level header is followed immediately by a third level header, without any paragraphs in between and other times it has what amounts to its own chapter. There’s no good way to assume when the 2nd level header should be broken out separately to its own page or be bundled in with the following 3rd level header and its content. Manual is how I want to be able to do it. |
Most of the time I want the TOC to match the chunking level ( But occasionally I have lower level sections that I want in the TOC but don't want to make into a separate chunk ( So I would prefer to keep both these command line options. If it would be possible to make Note that none of these options allow you to split at different levels in different parts of the document, or have different toc depths in different parts of the document. Which seems to be what @jtbayly is looking for? This has never been possible in pandoc; the I haven't had a chance to try out the nightly build yet. I'll do that in the next day or so. Thanks for your work on this. |
I prefer the flexibility of being able to do chunking level independently of TOC level. I can imagine wanting more fine-grained and less fine-grained chunking than the TOC, depending on my goals with the TOC. (I'll try to test the nightly build soon—thanks for your work on this @jgm!) |
To clarify, I don't need variable or manually modifiable TOC depth. I just want to be able to specify which chunks go together on a single page, irregardless of the TOC settings. This seems to be the same as you, @dm413, if I'm reading you correctly. |
Manual chunking isn't available -- I'm not even sure how that would be indicated. |
Gotcha. Sorry if I maybe misread something somewhere. I was imagining either a split tag of some sort, eg <!--- split here ---> right within the content, or chunking based on the files that were passed in. |
This works great. I tested one of my documentation projects with the current nightly build (1186). The options
There is a A suggestion: If the output filename extension is If Info for others who want to try this out:
It's a nice new feature. Thank you! |
@jgm and @tarleb
Should I move this question to PR #8485 or to the mailing list? |
No. I will change that. |
Feedback wanted at #8581 about either changing the files in the chunk so they don't start with a section number, or possibly allowing you to specify a template for the filenames. |
It would be useful if Pandoc could produce multiple output files by splitting the output based on sections (header) levels. The output files should maintain links across files, and the table of contents should link to all files.
Chunked HTML output would produce a set or folder of HTML files. This is useful for generating static websites (for example).
HTMLHelp output is a compressed version of chunked HTML specific to windows. The way this is done by other tools (such as doxygen) is to generate a folder of chunked HTML along with a HTMLHelp project file and content file. And perhaps an index file, but I don't think Pandoc has a built-in concept of index terms, so I would skip this for now. These files are then run thru HTMLHelp Workshop, a Microsoft tool that is used to generate the HTMLHelp file.
HTMLHelp has its own pane for the TOC, generated from the content file. The content file should respect the pandoc toc-depth setting. Since there is a separate TOC pane, the normal TOC at the top of the file should be suppressed by default.
You could also consider adding these as input formats. For chunked HTML, the issues seem to be what order to read the files, and making sure the links are correctly handled. For HTMLHelp (on Windows), the HTMLHelp reader can split a HTMLHelp (chm) file into the original discrete files for further processing in the same way as chunked HTML.
Note that the already supported epub format is another version of a chunked html format.
This issue has been raised in the pandoc-discuss mailing list. Various ideas have been proposed, including:
Add "Next" and "Previous" links to each HTML output page. This probably needs to be an optional feature.
Extend the idea of chunked output to formats other than HTML. For example, individual chapters sent to separate ODT or DOCX files (or RST, markdown, etc.).
The text was updated successfully, but these errors were encountered: