Fixed pdf export path and naming failures #986

MSeal · 2019-04-08T04:40:23Z

The fix involves isolating all file paths to a temp directory with safe characters. I tried doing something less intrusive but the underlying latext command can't handle a variety of cases and I had to wrap just about all pathing for the exporter in a temp_dir wrapper to get all cases covered.

Fixes #965 and #633

Might address #974 given it sounded link complication by path characters.

Also I unintentionally fixed #947 in digging into the original problem I believe.

t-makaro · 2019-04-11T01:54:27Z

nbconvert/exporters/pdf.py

-                rc = self.run_bib(tex_file)
-            if rc:
+                # This can fail for non-conversion blocking reasons
+                self.run_bib(tex_file)
                rc = self.run_latex(tex_file)


This second latex should only be run if the bibtex command is successful. Otherwise, the conversion will take almost twice as long as necessary. Maybe nest an if statement.

Good catch, will change that.

t-makaro · 2019-04-11T01:58:30Z

nbconvert/exporters/pdf.py

-
+            if not rc:
+                raise RuntimeError('Failed to run "{}" commands'.format(
+                    [cmd.format(filename=tex_file) for cmd in self.latex_command]))


This won't display the captured output, will it? I think displaying the captured output is necessary if this is to fix #896 / #947. Otherwise, the user will still have zero idea why latex failed.

It does emit a critical log that goes to stderr by default. I didn't put the full output in the error message because it can be 100's or 1000's or lines long when I was testing. I'll look at truncating the output and putting a piece of it here so there's some visibility if log messages aren't being streamed to the client for some reason.

#988 Will fix latex error messages being far too long and not user friendly. I think that along with the fix to the notebook2.ipynb (which I see you did here), and the check for latex's return code beside the check for the file (like in #947) should nicely return any latex error and resolve #896.

I'll rip out the message trimming I put in after #988 merges. I think this PR now implements #947 -- by raising on latex failure inside the run_command -- which the author hasn't been updating for the past week anyway.

Ah, I didn't see the fact that you moved the latex failure into run_command. I like that. It also leaves a nice way to give better bibtex errors in the future.

MSeal · 2019-04-11T08:28:56Z

Only test with an issue still is the test_linked_images one. This is a bit tricky to fix -- need to see if I can just map the relative paths to a quoted absolute path when building the tex.

MSeal · 2019-04-14T08:46:41Z

Fixed the relative pathing (and spaces in paths) issues for latex/pdf extraction, I did a lot of testing locally to try different combinations of pathing issues and tried to update or add tests to capture those cases most concisely.

xelatex is annoyingly unable to handle spaces or other special characters in paths. Even with grffile on it only sorta works (depending on characters) but then will render all the parts of the path before the space character in a weird way alongside the actual image. To fix this I pass down path information to trigger a copy to the temp build directory during conversion. This could be extended to make non-local pathed files usable in convertion with a small follow-up PR.

MSeal · 2019-04-14T17:22:32Z

nbconvert/exporters/tests/files/notebook2.ipynb

@@ -178,7 +178,7 @@
   "metadata": {},
   "source": [
    "Make sure markdown parser doesn't crash with empty Latex formulas blocks\n",
-    "$$$$\n",
+    "$$ $$\n",


The rest of the file changes were just rerunning the notebook -- this was the only real change

MSeal · 2019-04-14T17:35:22Z

nbconvert/templates/latex/base.tplx

@@ -20,6 +20,8 @@ This template does not define a docclass, the inheriting class must define this.
    % Basic figure setup, for now with no caption control since it's done
    % automatically by Pandoc (which extracts ![](path) syntax from Markdown).
    \usepackage{graphicx}
+    % protect against spaces in asset names
+    \usepackage[space]{grffile}


Only sort of works, but it keeps things from hard crashing and just introduces rendering idiosyncrasies for when any code paths might leave a space in a rendered asset path.

This seems like it's redundant with the addition of grffile below.

grffile only protects against spaces in names when using pdflatex, but not XeLaTeX.

In the example I provided at https://github.com/mpacer/tex_ex/ grffile protects against spaces in names when using XeLaTeX. But, it requires adding the fix from https://github.com/ho-tex/oberdiek/issues/31#issuecomment-441094438 to get it to work.

MSeal · 2019-04-14T17:37:22Z

nbconvert/templates/latex/document_contents.tplx

@@ -35,7 +35,7 @@

 % Display markdown
 ((* block data_markdown -*))
-    ((( output.data['text/markdown'] | citation2latex | strip_files_prefix | convert_pandoc('markdown+tex_math_double_backslash', 'latex'))))
+    ((( output.data['text/markdown'] | citation2latex | strip_files_prefix | convert_pandoc('markdown+tex_math_double_backslash', 'latex', extra_args=[], relative_path_replacement=resources.working_directory, build_path_replacement=resources.output_files_dir))))


These values are set in _fake_output_files_dir in exporters/pdf.py. This was the easiest way for the wrapping process to inform the conversion that we're operating in a different directory than the file, and to provide a latex-safe path for conversions to take place within.

MSeal · 2019-04-14T17:38:12Z

nbconvert/tests/test_nbconvertapp.py

@@ -444,7 +444,7 @@ def test_linked_images(self):
        """
        Generate PDFs with an image linked in a markdown cell
        """
-        with self.create_temp_cwd(['latex-linked-image.ipynb', 'testimage.png']):
+        with self.create_temp_cwd(['latex-linked-image.ipynb', 'test image.png']):


Makes the test cover more edges by having a space in the asset.

meeseeksmachine · 2019-04-15T18:22:29Z

This pull request has been mentioned on Jupyter Community Forum. There might be relevant details there:

https://discourse.jupyter.org/t/request-for-help-for-a-new-nbconvert-release/811/1

MSeal · 2019-04-15T19:08:55Z

nbconvert/utils/pandoc.py

+        # However the will still render but will double render some path
+        # characters alongside the image. To avoid this also use build_path_replacement.
+
+        # => (!\[(?:[iI]mage|[aA]lt [tT]ext|)\]\()((?![^:\.]+:\/\/)[^\/][^)\"]*\.[^)\"]*[^ ])( )?(\"[^)\"]+\")?(\))


add a link to regex 101 here

MSeal · 2019-04-15T19:13:38Z

nbconvert/filters/tests/test_markdown.py

+    @onlyif_cmds_exist('pandoc')
+    def test_markdown2latex_path_reference(self):
+        """markdown2latex replace_relative_path against referenced path test"""
+        s = 'Some ![image](reference)\n[reference]: ref.png'


todo try with 2 newlines and see if it gets fixed

mpacer · 2019-04-15T21:19:19Z

nbconvert/exporters/pdf.py

+            self.run_latex(tex_file)
+            # This can fail for non-conversion blocking reasons
+            if self.run_bib(tex_file):
+                self.run_latex(tex_file)


Technically — if we're running bibTeX, we should be running LaTeX one more time. That is it should look like

xelatex file.tex bibtex file.tex xelatex file.tex xelatex file.tex

For those who want to know why — https://tex.stackexchange.com/questions/53235/why-does-latex-bibtex-need-three-passes-to-clear-up-all-warnings explains it. But basically because LaTeX was designed in a way that it kept little in memory, it needs to separately create references (LaTeX run 1) , create a file to house all the bibliography references (bibTeX), add the reference file contents to the page (LaTeX run 2), resolve all references (LaTeX run 3).

I believe the run_latex command actually runs latex 3 times by default, so it should be fine.

…ces in the path

…to-latex into separate function.

… better image tag matching

mpacer · 2019-04-22T01:34:30Z

nbconvert/filters/markdown.py

-    return convert_pandoc(source, markup, 'latex', extra_args=extra_args)
+    return convert_pandoc(source, markup, 'latex',
+        extra_args=extra_args,
+        relative_path_replacement=relative_path_replacement,


I would not add these keyword arguments to convert_pandoc. I think it would be cleaner to do it in either:

a jinja filter that is run prior to calling convert_pandoc

a preprocessor that changes these links before jinja templating is applied

going into the details of this would be too long for this little comment, but I'll add more details about the implications of both later.

The same point should be made for build_path_replacement=build_path_replacement below.

mpacer · 2019-04-22T01:34:51Z

nbconvert/filters/markdown.py

@@ -33,7 +33,8 @@ def markdown2html_mistune(source):
 ]


-def markdown2latex(source, markup='markdown', extra_args=None):
+def markdown2latex(source, markup='markdown', extra_args=None,
+                   relative_path_replacement=None, build_path_replacement=None):


I also wouldn't add these keyword arguments here.

mpacer · 2019-04-22T01:35:08Z

nbconvert/filters/pandoc.py

@@ -1,7 +1,8 @@
 from nbconvert.utils.pandoc import pandoc


-def convert_pandoc(source, from_format, to_format, extra_args=None):
+def convert_pandoc(source, from_format, to_format, extra_args=None,
+                   relative_path_replacement=None, build_path_replacement=None):


See other comment about these flags.

mpacer · 2019-04-22T01:35:20Z

nbconvert/filters/pandoc.py

@@ -18,9 +19,21 @@ def convert_pandoc(source, from_format, to_format, extra_args=None):
    to_format : string
        Pandoc format for output.

+    extra_args : list (optional)


Many thanks for adding this!

mpacer · 2019-04-22T03:25:18Z

A high level point — this is doing a lot of cool things, but how much of the issues are fixed by adding the following to the preamble?

    \usepackage{grffile}

\makeatletter
\def\Gread@@xetex#1{%
\IfFileExists{"\Gin@base".bb}%
    {\Gread@eps{\Gin@base.bb}}%
    {\Gread@@xetex@aux#1}%
}
\makeatother

I got that from https://github.com/ho-tex/oberdiek/issues/31#issuecomment-441094438 which is a patch while maintainers of latex core push this fix out more broadly.

If you want to see a working example of this in a directory structure you can clone https://github.com/mpacer/tex_ex/ and run xelatex on it's base.tex and it should work for you.

mpacer · 2019-04-22T03:56:21Z

nbconvert/templates/latex/base.tplx

@@ -161,7 +163,7 @@ This template does not define a docclass, the inheriting class must define this.

    ((* block commands *))
    % Prevent overflowing lines due to hard-to-break entities
-    \sloppy 
+    \sloppy


I would see if you can turn off your auto-formatter for .tplx and .tpl files. There are situations in which having spaces can be semantically meaningful in LaTeX.

mpacer · 2019-04-22T04:25:38Z

So @MSeal and I were working through this today.

My major suggestion is to keep the pandoc APIs unchanged.

Instead, switch the current code into its own jinja filter (in the nbconvert/filters/ directory) and ensure that it is added to the list of filters available in the environment (you can follow along with https://github.com/jupyter/nbconvert/blob/master/nbconvert/exporters/latex.py#L48-L51).

Fortunately, the model you're already using here

nbconvert/nbconvert/utils/pandoc.py

Lines 79 to 81 in 7cf79f8

    
           if 'markdown' in fmt: 
        
               source = replace_markdown_paths( 
        
                   source, relative_path_replacement, build_path_replacement)

is literally equivalent to adding a filter before the convert_pandoc call in the template here

nbconvert/nbconvert/templates/latex/document_contents.tplx

Line 67 in 7cf79f8

    
           ((( cell.source | citation2latex | strip_files_prefix | convert_pandoc('markdown+tex_math_double_backslash', 'json', extra_args=[], relative_path_replacement=resources.working_directory, build_path_replacement=resources.output_files_dir) | resolve_references | convert_pandoc('json','latex'))))

What might be even cleaner would be to have two separate filters rather than one: replace_relative_paths and replace_build_paths. These could reasonably live inside the existing filter_links.py

so then the code would look something like

# inside filters/filter_links.py
def replace_relative_paths(source, working_directory):
    …
def replace_build_paths(source, output_files_dir):
    …

# Inside exporters/pdf.py
from nbconvert.filters.filter_links import resolve_references
from nbconvert.filters.filter_links import replace_relative_paths
from nbconvert.filters.filter_links import replace_build_paths

    def default_filters(self):
        for x in super(LatexExporter, self).default_filters():
            yield x 
        latex_filters = (('resolve_references', resolve_references), ('replace_relative_paths', replace_relative_paths), ('replace_build_paths', resolve_references))
        for filter_tuple in latex_filters: 
             yield filter_tuple

% Inside templates/latex/document_contents.tplx
    ((( cell.source | citation2latex | strip_files_prefix | replace_relative_paths(resources.working_directory)| replace_build_paths(resources.output_files_dir) | convert_pandoc('markdown+tex_math_double_backslash', 'json', extra_args=[]) | resolve_references | convert_pandoc('json','latex'))))

If you wanted to make a change that you couldn't make from manipulating the markdown itself but that you needed to solve before it was written out into the LaTeX, you can do that by manipulating the a JSON representation of the pandoc AST that pandoc knows how to produce and consume. This is what resolve_references is doing today. If you want to learn more about how to implement that you can check out: https://github.com/jgm/pandocfilters.

However, given that you haven't needed to do this for your fix, we can probably just get away with creating two new filters.

If you'd like, I could push these changes on top of your PR. Or we could merge this and then I could make a new one with these changes in place — your call.

t-makaro · 2019-06-19T00:00:52Z

nbconvert/exporters/pdf.py

-        with TemporaryWorkingDirectory():
+        if not resources:
+            resources = {}
+        with self._fake_output_files_dir(resources):


Modifying the path locations should occur at the level of the latex exporter. This set-up removes the ability to use the latex exporter + manually compiling with latex as a debugging tool, since the intermediate latex will be different.

mpacer · 2019-07-18T00:13:48Z

nbconvert/templates/latex/base.tplx

@@ -53,8 +55,8 @@ This template does not define a docclass, the inheriting class must define this.
    \usepackage[mathletters]{ucs} % Extended unicode (utf-8) support
    \usepackage[utf8x]{inputenc} % Allow utf-8 characters in the tex document
    \usepackage{fancyvrb} % verbatim replacement that allows latex
-    \usepackage{grffile} % extends the file name processing of package graphics 
-                         % to support a larger range 
+    \usepackage{grffile} % extends the file name processing of package graphics


This is being added above in addition to existing here.

mpacer · 2019-07-18T00:15:17Z

nbconvert/exporters/pdf.py

@@ -6,23 +6,25 @@
 import subprocess
 import os
 import sys
+import shutil


It doesn't look like this import is being used.

willingc · 2019-07-29T21:28:03Z

@MSeal Going to bump this to 6.0

MSeal · 2020-06-17T01:33:25Z

@t-makaro I was going to take a stab at updating this PR for the 6.0 milestone, if you had thoughts about it since we last were working on this.

t-makaro · 2020-06-18T03:28:40Z

@MSeal I'm not sure that this is needed anymore. My understanding of this PR was that it created a temp directory for output images that were extracted by the extractoutputpreprocessor. The extractoutputpreprocessor was sometimes creating file names that LaTeX could not handle.

Due to improvements in LaTeX's packages, and my fix in #1193 (which was a fix to my earlier fix), I think that this should work fine. In fact, the LaTeX package fixes should also improve file path problems for files referenced in markdown.

MSeal · 2020-06-18T15:54:14Z

Yeah I think you're right! I tested out a few of the failing patterns and we're at a working place for the ones I was targeting with this PR. I'll close this out :)

MSeal requested review from minrk and mpacer April 8, 2019 04:40

MSeal mentioned this pull request Apr 8, 2019

spaces in filename break jupyter notebook pdf conversion sagemathinc/cocalc#3722

Closed

t-makaro reviewed Apr 11, 2019

View reviewed changes

MSeal requested a review from SylvainCorlay April 11, 2019 16:15

MSeal force-pushed the pdfExportPathFix branch from 0c43a41 to 7501e5a Compare April 13, 2019 21:34

MSeal commented Apr 14, 2019

View reviewed changes

MSeal force-pushed the pdfExportPathFix branch from 45b8414 to dd740b0 Compare April 15, 2019 01:00

MSeal commented Apr 15, 2019

View reviewed changes

mpacer reviewed Apr 15, 2019

View reviewed changes

MSeal added 9 commits April 15, 2019 20:30

Fixed pdf export path and naming failures

5c6096f

Fixed a variety of issues and applied PR feedback

07709d9

Removed pdf command exception output trimmer

beb6f6a

Fixed latex/pdf extraction from remote file paths, including with spa…

64e5da5

…ces in the path

Removed typo in test

28d1233

Attempt to fix markdown tests in jenkins

4b8c2cf

Reused sensitive_filename_cleanup in pandoc conversion

c99fef4

Added alt text support for latex image path rewrites. Moved markdown-…

1c54aa5

…to-latex into separate function.

Temporarily added extensive regex matching for markdown reference and…

bcf1461

… better image tag matching

MSeal changed the title ~~Fixed pdf export path and naming failures~~ [WIP] Fixed pdf export path and naming failures Apr 16, 2019

MSeal added 3 commits April 16, 2019 19:32

Fixed failed test

943813e

Added test suite for image markdown conversion patterns

73916f3

Finished pandoc markdown link manipluations

d802bd5

MSeal force-pushed the pdfExportPathFix branch from dd740b0 to d802bd5 Compare April 17, 2019 07:10

MSeal changed the title ~~[WIP] Fixed pdf export path and naming failures~~ Fixed pdf export path and naming failures Apr 17, 2019

Fixed references with escaped label quotes

7cf79f8

mpacer reviewed Apr 22, 2019

View reviewed changes

MSeal mentioned this pull request Apr 26, 2019

Using relative image paths breaks PDF generation #136

Open

MSeal mentioned this pull request Jun 17, 2019

NBConvert 5.6 Release #1052

Closed

t-makaro suggested changes Jun 19, 2019

View reviewed changes

mpacer reviewed Jul 18, 2019

View reviewed changes

willingc added this to the 6.0 milestone Jul 29, 2019

MSeal mentioned this pull request Jun 10, 2020

HTML & Slides export not converting HTML-bracketed Markdown URLs #1280

Open

MSeal closed this Jun 18, 2020

Fixed pdf export path and naming failures #986

Fixed pdf export path and naming failures #986

Conversation

MSeal commented Apr 8, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

t-makaro Apr 11, 2019 • edited Loading

Choose a reason for hiding this comment

MSeal commented Apr 11, 2019

MSeal commented Apr 14, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MSeal Apr 14, 2019 • edited Loading

Choose a reason for hiding this comment

meeseeksmachine commented Apr 15, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mpacer commented Apr 22, 2019

Choose a reason for hiding this comment

mpacer commented Apr 22, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

willingc commented Jul 29, 2019

MSeal commented Jun 17, 2020

t-makaro commented Jun 18, 2020

MSeal commented Jun 18, 2020

t-makaro Apr 11, 2019 •

edited

Loading

MSeal Apr 14, 2019 •

edited

Loading