Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed pdf export path and naming failures #986

Closed
wants to merge 13 commits into from

Conversation

MSeal
Copy link
Contributor

@MSeal MSeal commented Apr 8, 2019

The fix involves isolating all file paths to a temp directory with safe characters. I tried doing something less intrusive but the underlying latext command can't handle a variety of cases and I had to wrap just about all pathing for the exporter in a temp_dir wrapper to get all cases covered.

Fixes #965 and #633

Might address #974 given it sounded link complication by path characters.

Also I unintentionally fixed #947 in digging into the original problem I believe.

rc = self.run_bib(tex_file)
if rc:
# This can fail for non-conversion blocking reasons
self.run_bib(tex_file)
rc = self.run_latex(tex_file)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This second latex should only be run if the bibtex command is successful. Otherwise, the conversion will take almost twice as long as necessary. Maybe nest an if statement.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, will change that.


if not rc:
raise RuntimeError('Failed to run "{}" commands'.format(
[cmd.format(filename=tex_file) for cmd in self.latex_command]))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't display the captured output, will it? I think displaying the captured output is necessary if this is to fix #896 / #947. Otherwise, the user will still have zero idea why latex failed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does emit a critical log that goes to stderr by default. I didn't put the full output in the error message because it can be 100's or 1000's or lines long when I was testing. I'll look at truncating the output and putting a piece of it here so there's some visibility if log messages aren't being streamed to the client for some reason.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#988 Will fix latex error messages being far too long and not user friendly. I think that along with the fix to the notebook2.ipynb (which I see you did here), and the check for latex's return code beside the check for the file (like in #947) should nicely return any latex error and resolve #896.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll rip out the message trimming I put in after #988 merges. I think this PR now implements #947 -- by raising on latex failure inside the run_command -- which the author hasn't been updating for the past week anyway.

Copy link
Contributor

@t-makaro t-makaro Apr 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I didn't see the fact that you moved the latex failure into run_command. I like that. It also leaves a nice way to give better bibtex errors in the future.

@MSeal
Copy link
Contributor Author

MSeal commented Apr 11, 2019

Only test with an issue still is the test_linked_images one. This is a bit tricky to fix -- need to see if I can just map the relative paths to a quoted absolute path when building the tex.

@MSeal
Copy link
Contributor Author

MSeal commented Apr 14, 2019

Fixed the relative pathing (and spaces in paths) issues for latex/pdf extraction, I did a lot of testing locally to try different combinations of pathing issues and tried to update or add tests to capture those cases most concisely.

xelatex is annoyingly unable to handle spaces or other special characters in paths. Even with grffile on it only sorta works (depending on characters) but then will render all the parts of the path before the space character in a weird way alongside the actual image. To fix this I pass down path information to trigger a copy to the temp build directory during conversion. This could be extended to make non-local pathed files usable in convertion with a small follow-up PR.

@@ -178,7 +178,7 @@
"metadata": {},
"source": [
"Make sure markdown parser doesn't crash with empty Latex formulas blocks\n",
"$$$$\n",
"$$ $$\n",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rest of the file changes were just rerunning the notebook -- this was the only real change

@@ -20,6 +20,8 @@ This template does not define a docclass, the inheriting class must define this.
% Basic figure setup, for now with no caption control since it's done
% automatically by Pandoc (which extracts ![](path) syntax from Markdown).
\usepackage{graphicx}
% protect against spaces in asset names
\usepackage[space]{grffile}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only sort of works, but it keeps things from hard crashing and just introduces rendering idiosyncrasies for when any code paths might leave a space in a rendered asset path.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like it's redundant with the addition of grffile below.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

grffile only protects against spaces in names when using pdflatex, but not XeLaTeX.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the example I provided at https://github.com/mpacer/tex_ex/ grffile protects against spaces in names when using XeLaTeX. But, it requires adding the fix from https://github.com/ho-tex/oberdiek/issues/31#issuecomment-441094438 to get it to work.

@@ -35,7 +35,7 @@

% Display markdown
((* block data_markdown -*))
((( output.data['text/markdown'] | citation2latex | strip_files_prefix | convert_pandoc('markdown+tex_math_double_backslash', 'latex'))))
((( output.data['text/markdown'] | citation2latex | strip_files_prefix | convert_pandoc('markdown+tex_math_double_backslash', 'latex', extra_args=[], relative_path_replacement=resources.working_directory, build_path_replacement=resources.output_files_dir))))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These values are set in _fake_output_files_dir in exporters/pdf.py. This was the easiest way for the wrapping process to inform the conversion that we're operating in a different directory than the file, and to provide a latex-safe path for conversions to take place within.

@@ -444,7 +444,7 @@ def test_linked_images(self):
"""
Generate PDFs with an image linked in a markdown cell
"""
with self.create_temp_cwd(['latex-linked-image.ipynb', 'testimage.png']):
with self.create_temp_cwd(['latex-linked-image.ipynb', 'test image.png']):
Copy link
Contributor Author

@MSeal MSeal Apr 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes the test cover more edges by having a space in the asset.

@meeseeksmachine
Copy link

This pull request has been mentioned on Jupyter Community Forum. There might be relevant details there:

https://discourse.jupyter.org/t/request-for-help-for-a-new-nbconvert-release/811/1

# However the will still render but will double render some path
# characters alongside the image. To avoid this also use build_path_replacement.

# => (!\[(?:[iI]mage|[aA]lt [tT]ext|)\]\()((?![^:\.]+:\/\/)[^\/][^)\"]*\.[^)\"]*[^ ])( )?(\"[^)\"]+\")?(\))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a link to regex 101 here

@onlyif_cmds_exist('pandoc')
def test_markdown2latex_path_reference(self):
"""markdown2latex replace_relative_path against referenced path test"""
s = 'Some ![image](reference)\n[reference]: ref.png'
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo try with 2 newlines and see if it gets fixed

self.run_latex(tex_file)
# This can fail for non-conversion blocking reasons
if self.run_bib(tex_file):
self.run_latex(tex_file)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically — if we're running bibTeX, we should be running LaTeX one more time. That is it should look like

xelatex file.tex
bibtex file.tex
xelatex file.tex
xelatex file.tex

For those who want to know why — https://tex.stackexchange.com/questions/53235/why-does-latex-bibtex-need-three-passes-to-clear-up-all-warnings explains it. But basically because LaTeX was designed in a way that it kept little in memory, it needs to separately create references (LaTeX run 1) , create a file to house all the bibliography references (bibTeX), add the reference file contents to the page (LaTeX run 2), resolve all references (LaTeX run 3).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the run_latex command actually runs latex 3 times by default, so it should be fine.

@MSeal MSeal changed the title Fixed pdf export path and naming failures [WIP] Fixed pdf export path and naming failures Apr 16, 2019
@MSeal MSeal changed the title [WIP] Fixed pdf export path and naming failures Fixed pdf export path and naming failures Apr 17, 2019
return convert_pandoc(source, markup, 'latex', extra_args=extra_args)
return convert_pandoc(source, markup, 'latex',
extra_args=extra_args,
relative_path_replacement=relative_path_replacement,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not add these keyword arguments to convert_pandoc. I think it would be cleaner to do it in either:

  1. a jinja filter that is run prior to calling convert_pandoc
  2. a preprocessor that changes these links before jinja templating is applied

going into the details of this would be too long for this little comment, but I'll add more details about the implications of both later.

The same point should be made for build_path_replacement=build_path_replacement below.

@@ -33,7 +33,8 @@ def markdown2html_mistune(source):
]


def markdown2latex(source, markup='markdown', extra_args=None):
def markdown2latex(source, markup='markdown', extra_args=None,
relative_path_replacement=None, build_path_replacement=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also wouldn't add these keyword arguments here.

@@ -1,7 +1,8 @@
from nbconvert.utils.pandoc import pandoc


def convert_pandoc(source, from_format, to_format, extra_args=None):
def convert_pandoc(source, from_format, to_format, extra_args=None,
relative_path_replacement=None, build_path_replacement=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See other comment about these flags.

@@ -18,9 +19,21 @@ def convert_pandoc(source, from_format, to_format, extra_args=None):
to_format : string
Pandoc format for output.

extra_args : list (optional)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many thanks for adding this!

@mpacer
Copy link
Member

mpacer commented Apr 22, 2019

A high level point — this is doing a lot of cool things, but how much of the issues are fixed by adding the following to the preamble?

    \usepackage{grffile}

\makeatletter
\def\Gread@@xetex#1{%
\IfFileExists{"\Gin@base".bb}%
    {\Gread@eps{\Gin@base.bb}}%
    {\Gread@@xetex@aux#1}%
}
\makeatother

I got that from https://github.com/ho-tex/oberdiek/issues/31#issuecomment-441094438 which is a patch while maintainers of latex core push this fix out more broadly.

If you want to see a working example of this in a directory structure you can clone https://github.com/mpacer/tex_ex/ and run xelatex on it's base.tex and it should work for you.

@@ -161,7 +163,7 @@ This template does not define a docclass, the inheriting class must define this.

((* block commands *))
% Prevent overflowing lines due to hard-to-break entities
\sloppy
\sloppy
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would see if you can turn off your auto-formatter for .tplx and .tpl files. There are situations in which having spaces can be semantically meaningful in LaTeX.

@mpacer
Copy link
Member

mpacer commented Apr 22, 2019

So @MSeal and I were working through this today.

My major suggestion is to keep the pandoc APIs unchanged.

Instead, switch the current code into its own jinja filter (in the nbconvert/filters/ directory) and ensure that it is added to the list of filters available in the environment (you can follow along with https://github.com/jupyter/nbconvert/blob/master/nbconvert/exporters/latex.py#L48-L51).

Fortunately, the model you're already using here

if 'markdown' in fmt:
source = replace_markdown_paths(
source, relative_path_replacement, build_path_replacement)
is literally equivalent to adding a filter before the convert_pandoc call in the template here
((( cell.source | citation2latex | strip_files_prefix | convert_pandoc('markdown+tex_math_double_backslash', 'json', extra_args=[], relative_path_replacement=resources.working_directory, build_path_replacement=resources.output_files_dir) | resolve_references | convert_pandoc('json','latex'))))

What might be even cleaner would be to have two separate filters rather than one: replace_relative_paths and replace_build_paths. These could reasonably live inside the existing filter_links.py

so then the code would look something like

# inside filters/filter_links.py
def replace_relative_paths(source, working_directory):
    …
def replace_build_paths(source, output_files_dir):
    …
# Inside exporters/pdf.py
from nbconvert.filters.filter_links import resolve_references
from nbconvert.filters.filter_links import replace_relative_paths
from nbconvert.filters.filter_links import replace_build_paths

    def default_filters(self):
        for x in super(LatexExporter, self).default_filters():
            yield x 
        latex_filters = (('resolve_references', resolve_references), ('replace_relative_paths', replace_relative_paths), ('replace_build_paths', resolve_references))
        for filter_tuple in latex_filters: 
             yield filter_tuple
% Inside templates/latex/document_contents.tplx
    ((( cell.source | citation2latex | strip_files_prefix | replace_relative_paths(resources.working_directory)| replace_build_paths(resources.output_files_dir) | convert_pandoc('markdown+tex_math_double_backslash', 'json', extra_args=[]) | resolve_references | convert_pandoc('json','latex'))))

If you wanted to make a change that you couldn't make from manipulating the markdown itself but that you needed to solve before it was written out into the LaTeX, you can do that by manipulating the a JSON representation of the pandoc AST that pandoc knows how to produce and consume. This is what resolve_references is doing today. If you want to learn more about how to implement that you can check out: https://github.com/jgm/pandocfilters.

However, given that you haven't needed to do this for your fix, we can probably just get away with creating two new filters.

If you'd like, I could push these changes on top of your PR. Or we could merge this and then I could make a new one with these changes in place — your call.

with TemporaryWorkingDirectory():
if not resources:
resources = {}
with self._fake_output_files_dir(resources):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modifying the path locations should occur at the level of the latex exporter. This set-up removes the ability to use the latex exporter + manually compiling with latex as a debugging tool, since the intermediate latex will be different.

@@ -53,8 +55,8 @@ This template does not define a docclass, the inheriting class must define this.
\usepackage[mathletters]{ucs} % Extended unicode (utf-8) support
\usepackage[utf8x]{inputenc} % Allow utf-8 characters in the tex document
\usepackage{fancyvrb} % verbatim replacement that allows latex
\usepackage{grffile} % extends the file name processing of package graphics
% to support a larger range
\usepackage{grffile} % extends the file name processing of package graphics
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is being added above in addition to existing here.

@@ -6,23 +6,25 @@
import subprocess
import os
import sys
import shutil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't look like this import is being used.

@willingc
Copy link
Member

@MSeal Going to bump this to 6.0

@MSeal
Copy link
Contributor Author

MSeal commented Jun 17, 2020

@t-makaro I was going to take a stab at updating this PR for the 6.0 milestone, if you had thoughts about it since we last were working on this.

@t-makaro
Copy link
Contributor

@MSeal I'm not sure that this is needed anymore. My understanding of this PR was that it created a temp directory for output images that were extracted by the extractoutputpreprocessor. The extractoutputpreprocessor was sometimes creating file names that LaTeX could not handle.

Due to improvements in LaTeX's packages, and my fix in #1193 (which was a fix to my earlier fix), I think that this should work fine. In fact, the LaTeX package fixes should also improve file path problems for files referenced in markdown.

@MSeal
Copy link
Contributor Author

MSeal commented Jun 18, 2020

Yeah I think you're right! I tested out a few of the failing patterns and we're at a working place for the ones I was targeting with this PR. I'll close this out :)

@MSeal MSeal closed this Jun 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

convert partial if name got space inside
5 participants