Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update hook-gi.repository.GdkPixbuf.py for Great Glory #1843

Merged
merged 8 commits into from
Mar 28, 2016

Conversation

leycec
Copy link
Contributor

@leycec leycec commented Feb 27, 2016

Greetings and wintry salutations, PyInstaller goodfellows.

This pull request generalizes wailinygn's open pull request for the hook-gi.repository.GdkPixbuf hook – now with 43% more support for:

  • Python 3 under OS X.
  • Efficient writes without extraneous decodes and encodes.
  • A new functional test exercising this hook.

What? But... There's Already a Pull Request.

Yes. Yes, there is. And it's awe inspiring in its simple simplicity. Unfortunately, it also:

  • Fails for Python 3 under OS X (as does the current version of this hook, to be fair).
  • Has yet to be revised to write without extraneous decodes and encodes under Windows and Linux.
  • Has no functional tests.

This is a blocker for my lazy hook loading work. The current version of this hook causes other seemingly unrelated tests (e.g., test_matplotlib()) to fail, making it difficult for me to differentiate tests failing due to my work from tests failing due to this erroneous hook. It's also a blocker for my Matplotlib-based application.

Since the original pull request was moving a little... lethargically, I thought I'd pick up the slack, amp up the volume, and jack into a new reality. Thanks for laying the groundwork, wailinygn.

Sane File Handling

Performing safe file handling across both Python 2 and 3 is insane.

The crux of the insanity is Python 2's open() built-in. It sucks; it's fragile; it's unreliable; it's uninformative; it's broken by design. To fix the suckage, Python >= 2.6 backported Python 3's open() built-in under a different name: io.open(). Under Python 2.6 and 2.7, io.open() should typically be called in lieu of open(). Under Python 3, io.open() and open() are the same function. Hence:

# Under Python 3:
>>> io.open is open
True

# Under Python 2:
>>> io.open is open
False

To sanitize this insanity, a new compat.open_file() utility function simplifying file handling has been added. Under:

  • Python 3, this aliases the open() built-in.
  • Python 2, this aliases the io.open() function.

Of course, since open() is io.open() under Python 3, compat.open_file() is technically io.open() under both Python 3 and 2. But no one ever remembers to call io.open() (as evidenced by the absence of io.open() calls in the PyInstaller codebase), including that fat, lazy Canadian leycec. By the thunder hammer of Thor, may the existence of compat.open_file() be a light in the darkness encouraging others to stop calling the open() built-in!

Woah!

Where's Thor's right foot? I don't even know.

Python 3 under OS X

codewarrior0 astutely notes that the hook-gi.repository.GdkPixbuf hook currently fails hard on Python 3 under OS X:

The whole block under is_darwin needs to be fixed for Python 3. Decode cachedata, do the string operations, and then re-encode it.

Inspired by that, this pull request writes cachedata as follows under OS X:

  1. Passes the universal_newlines=True option to the subprocess.check_output() call, silently decoding cachedata with the user's preferred locale (e.g., utf-8) into a:
    • unicode object on Python 2.
    • str object on Python 3.
  2. Munges cachedata in a type-preserving manner. The original version of this hook rejoined cachedata with '\n'.join(cd), which accidentally converted cachedata from a unicode to str object on Python 2. Whoops. This pull request changes that to u'\n'.join(cd), preserving cachedata as a unicode object.
  3. Writes cachedata using our newly defined compat.open_file() helper, ensuring that string is implicitly encoded with the same character encoding that the subprocess.check_output() call used to decode that string.

Did I mention that file handling is insane? 'cause... it is. Insert sad face.

Python 2 or 3 under Linux or Windows

Under all other supported platforms, codewarrior0 again observes that cachedata is not modified and hence may be written to disk as is (i.e., as a raw encoded byte string) without extraneous decodes or encodes.

The prophet is right, of course. And we have little choice but to implement his sage wisdom.

Test Us Up the PyGObject Bomb

To ensure this never, ever happens again, a new test_gi_gdkpixbuf_binding() functional test has been added. Like every other PyGObject test, it's boring to the point of eye-gouging: it just attempts to import the gi.repository.GdkPixbuf module and hence load the hook-gi.repository.GdkPixbuf hook. That's it.

For maintainability, all PyGObject tests have been shifted from the test_libraries.py grab bag into a new dedicated test_hooks/test_gi.py script. Because! It must be done. (Cue furious hand-waving.)

That's It – But That's Enough

Thus ends this sordid chapter in the unsung saga of PyInstaller.

@leycec
Copy link
Contributor Author

leycec commented Feb 27, 2016

It would appear that Travis utterly hates dynamically defined tests (i.e., tests defined at runtime via exec() rather than statically defined). Hear my incomprehensible rage! :rage2:

Very well, Travis; you win this round. I'll fix this shortly.

@codewarrior0
Copy link
Contributor

To fix the suckage, Python >= 2.6 backported Python 3's open() built-in under a different name: io.open().

Good observation! I knew about the io module, but didn't notice that io.open was compatible with Python 3's open. On Python 2, I was always wrapping the result of open with a decoder from the codecs module. This makes compat.open_file a great addition.

It would appear that Travis utterly hates dynamically defined tests (i.e., tests defined at runtime via exec() rather than statically defined).

I know that pytest.mark.parametrize accepts skipif markers on the individual parameter values, but I don't know if our importorskip works with that. Did you try marking the parameter values to parametrize? Actually, I remember some rumblings on the pytest issue tracker about something called importorskip, let me check...

Aha! pytest has its own pytest.importorskip function. It's not a marker, but just a function you call during the test function, raising a Skip exception if the module is not importable, causing the test to be skipped. It looks like this function is meant to be called in lieu of the import statement, since it just returns the module if it was importable.

@leycec
Copy link
Contributor Author

leycec commented Feb 28, 2016

Did you try marking the parameter values to parametrize?

Yup! But py.test said: "Nope." It failed hard, perhaps relating to this open issue also attempting to mark parametrized values as skipif. I've begrudgingly agreed to brute-force this... for now.

Perusing the PyGObject docs, any brute-force solution will fall down in the face of sheer overwhelming numbers at some point. There are an absurd number of gi.repository subpackages. We'll probably end up hooking most of them for one disagreeable reason or another. (It boggles human consciousness.)

Aha! pytest has its own pytest.importorskip function.

Sweet leaf incarnate! That function improves upon our own by also supporting version checking, ala:

docutils = pytest.importorskip("docutils", minversion="0.3")

Our importerskip decorator accepts a variadic list of module names rather than a single module name and optional version number. The former is trivially convertible to the latter (by nesting decorators), but the reverse is not the case. The latter strikes me as more useful. Version-specific tests could be spiffy. I only note two instances in our test suite of importerskip decorators passed multiple module names, simplifying matters further.

I propose refactoring our importerskip decorator forthwith as follows:

def importorskip(modname, minversion=None):
    """
    This decorator skips the currently decorated test if the module with
    the passed fully-qualified name is unimportable _or_ importable but
    of a version less than the passed minimum version if any.

    :param module: Module name to be required.
    :param minversion: Optional minimum version to be required as well.
    :return: pytest decorator with a reason and list of required modules.
    """
    try:
        pytest.importorskip(modname, minversion)
    except Skipped as exc:
        return skipif(True, reason=str(exc))
    else:
        return noop

def noop(object):
    """
    Don't ask. Don't tell.
    """
    return object

I absolutely have not tested this. But it might work. The worst that could happen is Great Cthulhu rising from the ancient depths of R'lyeh.

@leycec
Copy link
Contributor Author

leycec commented Feb 28, 2016

Brute force for the violent win. Travis now passes. AppVeyor we ignore, of course. Windows. Always Windows.

@leycec
Copy link
Contributor Author

leycec commented Feb 28, 2016

Nailed it. The existing test_matplotlib() parametrized test provided the crusty key to the door of correctness. The suite of PyGObject >= 2.0 tests are now parametrized as Odin intended.

Additionally, I've refactored our @importorskip decorator as described above. It's a drop-in replacement for our existing @importorskip decorator, except with the added sugar of skipping tests requiring sufficiently new versions of modules. While no existing tests use this sugar yet, they really should.

We shall see how Travis feels about all this. I'm watching you, Travis.

@leycec
Copy link
Contributor Author

leycec commented Feb 28, 2016

Houston, we have ignition.

@leycec
Copy link
Contributor Author

leycec commented Feb 29, 2016

And that should do it.

Our suite of Matplotlib tests were also heart-breakingly broken. Despite being parametrized, the test_matplotlib() function ignored the passed parameters by:

  • Always excluding all Qt packages (PyQt4, PyQt5, PySide), rather than excluding all Qt packages other than the current Qt package being tested as a Matplotlib backend.
  • Always non-deterministically enabling the default Matplotlib backend (...whatever that might be), rather than enabling the Matplotlib backend corresponding to the current Qt package.

If the default Matplotlib backend is a Qt package (as it is on my machine), these tragicomic bullet points imply that test_matplotlib() will attempt to enable a backend whose package has been excluded. Since this cannot be, eye-gouging test failures ensue.

Travis silently accepted this, which is almost worse. I haven't investigated why. Maybe Travis refuses to install Qt packages and hence silently skips these tests. Maybe? Anyway, they work now.

Three cheers for the Japanese goblin! 👺

@leycec
Copy link
Contributor Author

leycec commented Mar 1, 2016

The Windows test failures appear to be unrelated issues. On Windows, the test_pandas_extension[onedir] test (which this pull request did not molest) and test_matplotlib[onefile-PySide] test (which this pull request did molest) both fail at runtime with:

ImportError: No module named 'dateutil.tz.__init__.tz'

Since dateutil.tz.__init__ is an existing module rather than a package, there should indeed never exist a dateutil.tz.__init__.tz module. Why does the dateutil.tz package attempt to import a module guaranteed to never exist on Windows? I am a morose guppy fish with no answers. 🐠

In theory, these test failures should be ignorable with respect to this pull request.

@leycec
Copy link
Contributor Author

leycec commented Mar 1, 2016

O.K.! That should do it.

Our SciPy test was broken for SciPy >= 0.16.0. The scipy.lib package imported by the test_scipy() test has been privatized by SciPy 0.16.0 to scipy._lib. This test now conditionally imports the desired package depending on SciPy version.

Godslayer and baby narwhals willing, no further commits will be appended to this pull request. 🐳

@leycec
Copy link
Contributor Author

leycec commented Mar 24, 2016

@htgoebel I summon thee, Teutonic wizard!

@springermac Thanks for heroically tackling the Python 3-specific hook-gi.repository.GdkPixbuf encoding issue in your humongous pull request. Unfortunately, another issue arises from the murky depths.

It's not your fault. You did a bang-up job. (That's good.) Your elegant solution calls our homebrew PyInstaller.compat.exec_command() function rather than the standard subprocess.check_output() function. That's great. Right?

exec_command(): It Kinda Sucks

Except, exec_command() is fundamentally broken. Outstanding issues include:

  • This function's return value is:
    • Under Python 2.7, an encoded byte string (i.e., an object of type str) rather than a decoded Unicode string (i.e., an object of type unicode). This value is not safely usable by callers for most purposes. Attempting to do so will succeed only for ASCII strings containing no encoded multibyte characters – which, admittedly, is probably the common case.
    • Under Python 3.x, a decoded Unicode string (i.e., an object of type str). This value is safely usable by callers for most purposes. However...
  • Under Python 3.x, if no encoding is explicitly passed to this function, this function's return value is decoded with the filesystem encoding for the current platform. Sadly, this is insane. There exists no correlation between filesystem and command output encodings. This is guaranteed to fail in common edge cases. As the official documentation for the sys.getfilesystemencoding() function suggests:
    • On Windows, the filesystem encoding is always unconditionally mbcs.
    • On OS X, the filesystem encoding is always unconditionally utf-8.

However, the command output encoding conditionally depends on the currently active locale for the shell session in which PyInstaller is running. This encoding is given by the sys.stdout.encoding attribute and the locale.getpreferredencoding() function. On POSIX-compliant platforms, this encoding is determined by the standard LC_* family of environment variables. On Windows... nobody even knows.

The point is: the filesystem encoding has nothing to do with the command output encoding.

Except under Linux, which has no concept of a "filesystem encoding." Under Linux, locale.getpreferredencoding() and sys.getfilesystemencoding() (typically) return the same encoding. Edge cases. Always edge cases.

Solutions? I Have Many

The ideal solution to these interlocking issues is to refactor exec_command() to call the higher-level subprocess.check_output() function with the universal_newlines=True parameter, which automatically handles decoding issues in a version-portable manner. While untested, only five lines of code should suffice:

def exec_command(*cmdargs, **kwargs):
    encoding = kwargs.pop('encoding', None)

    # If no encoding was specified, the current locale is defaulted to. Else, an
    # encoding was specified. To ensure this encoding is respected, the
    # "universal_newlines" option is disabled if also passed. Nice, eh?
    kwargs['universal_newlines'] = encoding is None

    # Standard output captured from this command as a decoded Unicode string if
    # "universal_newlines" is enabled or an encoded byte array otherwise.
    out = subprocess.check_output(cmdargs, **kwargs)

    # Return a Unicode string decoded, if needed, from this byte array.
    return out if encoding is None else out.decode(encoding)

Excuses? I Have Many

I don't feel comfortable committing that refactoring to this pull request, which is already too long in the tooth. I've already had to resolve several merge conflicts. (Cue grumbling.) I'd rather not push this pull request any deeper, harder, or longer.

In the meanwhile, I've documented the above solution in a FIXME comment preceding exec_command(). To get hook-gi.repository.GdkPixbuf reliably working, I've also temporarily reverted that hook to call subprocess.check_output() with universal_newlines=True rather than exec_command(). :bowtie:

Once exec_command() has been rejuvenated, I solemnly promise to call that function again in this hook.

This Is The End

Friends and code comrades, this concludes this pull request.

Tests pass. Hands wave. leycec sleeps soundly tonight.

@codewarrior0
Copy link
Contributor

Filenames: str or unicode?

My main objection to changing exec_command() to return a decoded unicode type on Python 2.7 is because, for the most part, PyInstaller on Python 2.7 operates on filenames as str. The filesystem functions accept str, filenames in a specfile are str, and the arguments passed in from the command-line are also str. I would someday like to change PyInstaller to use unicode filenames all around - especially on Windows, where str filenames are limited to those that can be represented using the legacy ANSI codepages.

The problem is that Python 2.7 allows you to mix str and unicode freely, and only raises a UnicodeDecodeError whenever a unicode is combined with a str that is not ASCII-encoded. So as long as there are only str types and no unicode in the mix, encoding errors should not come up. This is why I decided to let exec_command() return str when running on Python 2.7.

To change over to unicode filenames, we would have to go over the entire codebase, identify every single point where a filename enters PyInstaller's care, and make sure the filename is either decoded using the correct encoding or obtained as unicode in the first place. (For example, on Python 2.7, os.listdir will return a list of unicode if given a unicode as the directory name, and str otherwise.)

And going over the entire codebase looking for places-where-PyInstaller-gets-a-filename is... kind of a big task.

Unicode in Python 2.7 on Windows is fundamentally broken

That's an understatement, by the way.

However, the command output encoding conditionally depends on the currently active locale for the shell session in which PyInstaller is running

It isn't even that simple! The active locale for the shell session (i.e. the Windows console host process) is only brought to bear when the process running in the session tries to write text (in the form of UTF-16-or-otherwise-encoded bytes) to its output stream, which is only even possible when the output stream is connected to a console host.

The Windows API offers two different write-type functions for writing to an output stream: WriteFile, which takes raw bytes with no specified encoding, and WriteConsole. WriteConsole comes in both ANSI and Unicode variants, which means the calling process can pass either codepage-encoded text or Unicode-encoded text respectively. The text will then be transparently re-encoded to the console host's preferred encoding.

Note that WriteFile writes bytes to an output stream, while WriteConsole writes text to a console window. WriteConsole cannot be used when standard output isn't connected to a console host, which will be the case whenever its output is redirected to a file or a pipe. So the console host's codepage setting doesn't even come into play when calling subprocess.check_output.

To make matters worse, the default codepage for the console host is not the usual kind of ANSI codepage (which the legacy Windows filesystem functions will accept), but is actually an OEM codepage - the kind of codepage used in the MS-DOS days, for both screen text and for filenames. Text encoded with an OEM codepage can't be passed to Windows filesystem functions, and back in Python-land, a str encoded with OEM can't be passed to things like open() and listdir().

Thus, you're correct that using the filesystem encoding to decode the output is wrong. Well, wrong on Windows, because we can't decode OEM-encoded text using the filesystem encoding (which is always ANSI) or with the locale preferred encoding (which too is always ANSI.).

So write_newlines is right out, because it uses the locale preferred encoding. To get the correct OEM encoding, you have to go all the way to sys.stdin.encoding, which may not even be available if PyInstaller isn't running inside a console host (say, because it was called via CreateProcess in a GUI process), and will not ever be available on Python 2.7.

...In fact, the only way I know of to get the correct OEM encoding on Python 2.7 is to go to ctypes and call the Windows API GetConsoleCP.

It's actually really easy to be wrong about the output's encoding on Windows, since the process will usually be writing bytes directly to the output stream using WriteFile, which means that it can output text in whatever encoding it wants, regardless of the console host's preference.

So about the encoding used by Windows command-line processes in general, we can only conjecture. If the process is using the legacy Windows API to do filesystem stuff, then it will get a bunch of ANSI-encoded strings and write those to stdout, thus when we decode with the filesystem encoding in Python, we'll get the right text out. But if the process is using the modern Unicode Windows API to do filesystem stuff, it might write UTF-16 to stdout. (Because in Windows API-speak, "Unicode" always means "UTF-16"). Or, it might find out what the current ANSI codepage is and transcode to that... or, it might do the same for the OEM codepage, thinking there's a console (or perhaps an MS-DOS era command line program) on the other end of the pipe. It might even try to be smart and use WriteConsole only if there's a console on stdout, and use god-knows-what encoding otherwise.

We don't know what any particular Windows command-line program will output its text as. Which is why there's an encoding argument to exec_command.

For contrast...

... everything is much simpler on Linux and OS X. The filesystem functions only accept and return bytes, and make no distinction between bytes and text. The write function only comes in one flavor, the flavor of bytes. A called process can just write out the bytes it gets from the FS functions without spending a millisecond worrying about what encoding they are in.

When Python 2 gets its hands on those bytes, it can just keep them in a str and send them right to the filesystem functions, and everything just werks. When Python 3 gets these bytes, it will insist on decoding them using the filesystem encoding before putting them in a str, later re-encoding that str to bytes before sending them to a filesystem function, which usually just werks (as long as your LANG environ is set correctly. I think that was fixed in Python 3.3 by way of the surrogateescape error handler.)

Wait, I know some more things

Command-line Windows programs can be written using Microsoft's libc (better known as the MSVCRT) rather than using the Windows API directly. This is often the case for Unix programs that have been made to work on Windows. Microsoft's libc is implemented using the legacy Windows APIs, which means filenames are ANSI encoded. Furthermore, the standard filehandles (the FILE * kind) in this libc are bytes-flavored, so there is no hidden transcoding done by the libc or by the console host.

So, these programs will write filenames to stdout as ANSI, and decoding their output using os.fsdecode() will just werk. (That is, as long as said filenames can even be represented in the current ANSI codepage.)

In the Windows port of the PyInstaller bootloader, we go out of our way to call special "wide-character" versions of the libc functions, many of which are not even part of the C standard. These functions are implemented using the modern Windows API, which means they return UTF-16-encoded filenames, which can represent all possible filenames.

The mbcs encoding on Windows is just an alias for whatever encoding locale.getpreferredencoding() returns. This encoding cannot encode all possible filenames on Windows, because it is always going to be a codepage, and will never be a UTF encoding. No, not even if you change your console host's encoding to CP65001, which means "UTF-8". Why?

C:\Users\Rio>chcp 65001
Active code page: 65001

C:\Users\Rio>python
Python 2.7.9 (default, Dec 10 2014, 12:24:55) [MSC v.1500 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import locale

LookupError: unknown encoding: cp65001
>>> import sys

LookupError: unknown encoding: cp65001
>>> ^Z

C:\Users\Rio>python -c "import locale; print locale.getpreferredencoding()"
cp1252

Because Unicode in Python 2.7 on Windows is fundamentally broken.

@codewarrior0
Copy link
Contributor

Sorry for the wall of text. The executive summary is that the fundamental problem you found with exec_command is just the tip of a much, much larger iceberg.

@leycec
Copy link
Contributor Author

leycec commented Mar 25, 2016

Save us, Jabberwocky. This rabbit hole is deep and nefarious. I appreciate the Shield-Wall of Text and raise my rusty vorporal sword as a salute in your general direction.

My main objection to changing exec_command() to return a decoded unicode type on Python 2.7 is because, for the most part, PyInstaller on Python 2.7 operates on filenames as str.

Backwards compatibility! Sense is made. I whole-heartedly agree. Existing workiness must be preserved. Nonetheless, exec_command() is widely called throughout the codebase to capture general-purpose output rather than filesystem-encoded pathnames. This includes:

  • The compat.test_UPX() function, calling exec_command() to capture upx standard output – which is then stripped, split, and parsed apart as if it were a decoded unicode string. It isn't. Since this output appears to be strictly ASCII, this technically works. Until it doesn't, someday.
  • The hooks/hook-gi.repository.GdkPixbuf.py hook, calling exec_command() to capture gdk-pixbuf-query-loaders standard output under OS X – which is then stripped, split, and parsed apart as if it were a decoded unicode string. It isn't. Since this output appears to possibly contain multibyte characters, this is unlikely to behave as expected.
  • The depend.utils.load_ldconfig_cache() and depend.bindepend._getImports_ldd() functions, calling exec_command() to capture ldconfig and ldd standard output – which is then stripped, split, and parsed apart as if they were decoded unicode strings. While this output does contain pathnames, this output isn't entirely pathnames. Even if it was, are encoded str strings containing pathnames with multibyte characters really safely parseable?
  • The utils.git.get_repo_revision() function, calling exec_command() to capture git describe --long standard output – which is then stripped, split, and parsed apart as if it were a decoded unicode string. It isn't. Since tag names can contain multibyte characters, this is unlikely to behave as expected.
  • Perhaps most importantly, the utils.hooks.__init__.exec_statement() function. This function ultimately defers to exec_command() to capture python standard output. Some calls to this function capture pathnames; other calls capture general-purpose text, which is then operated upon as if that text were decoded unicode strings. It's a complete mess, really.

Naturally, exec_command() is also widely called throughout the codebase to capture filesystem-encoded pathnames. This includes:

  • The hooks/hook-PyQt5.QtWebEngineWidgets.py hook.
  • The utils.hooks.__init__.qt5_qml_dir() function.
  • Various utils.osx utility functions.

I would someday like to change PyInstaller to use unicode filenames all around...

I would someday like Python 2.7 to go away. Permanently. Until then, I giddily support a gradual transition from str to unicode strings. Ideally, str strings should only be used as temporary containers for:

  1. Raw input read from elsewhere.
  2. Raw output to be written elsewhere.

For all other purposes (...pretty much everything), only unicode strings should be used. str strings cannot be safely operated upon, generally. Yet PyInstaller blithely attempts to operate upon str strings at every available opportunity. This is a bewitching recipe for failure. Why do I even wake up in the morning?

I have no code and I must scream.

...especially on Windows, where str filenames are limited to those that can be represented using the legacy ANSI codepages.

...that's insane. See: "I have no code and I must scream."

So as long as there are only str types and no unicode in the mix, encoding errors should not come up.

It's not simply encoding errors, however.

str strings are effectively encoded byte arrays accessible via a conventional string API. They appear to be strings, but they're not (...not in the meaningful sense of the term "string," anyway); they're just contiguous runs of raw, low-level, encoded bytes.

str strings are not safely usable in string manipulation logic, generally.

To change over to unicode filenames, we would have to go over the entire codebase...

I am vomitting into my mouth as we speak. Converting to unicode filenames appears to be infeasible, particularly as Python 2.7 is – for all intents and purposes – a dead language. As it should be. Not much motivation or incentive there.

That said, a very gradual conversion to unicode filenames (and strings in general) would be unadulterated awesomeness. We can do this! Maybe not.

Unicode in Python 2.7 on Windows is fundamentally broken.

Everything you just wrote blew up my precarious mind.

So the console host's codepage setting doesn't even come into play when calling subprocess.check_output.

wat

Thus, you're correct that using the filesystem encoding to decode the output is wrong.

Yay!

Wait. I'm correct about a thing? Galactic implosion imminent.

...we can't decode OEM-encoded text using... the locale preferred encoding.

O.K., so I'm actually incorrect. Galactic stability restored.

We don't know what any particular Windows command-line program will output its text as.

"...as we know, there are known knowns; there are things we know we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns – the ones we don't know we don't know." – an unpleasant human being

All of which is to say...

You've Convinced Me.

You're also awesome. (You know. Thought you should know.)

exec_command() hereby retains its broken state, because no one has the time, manic energy, or insatiable desire to fix it – especially including me. A new exec_command_stdout() function that actually works as intended will be created from the proposed refactoring above and then called by the hooks/hook-gi.repository.GdkPixbuf.py hook. exec_command() documentation will be updated to describe its currently obscure behaviour and advise callers to call exec_command_stdout() instead.

For orthogonality, I should probably create a new exec_statement_stdout() function. But, jeez. I feel depleted merely contemplating it. Let's leave that odious task for another hapless developer. Muha!

tl;dr

A new exec_command_stdout() function will be created. The existing exec_command() function will remain unmolested. Thus is satisfied both backward compatibility and leycec.

@codewarrior0
Copy link
Contributor

This function ultimately defers to exec_command() to capture python standard output.

Thankfully, this is the one case where we can absolutely, positively know how the called process' text output will be encoded. For Python 3, at least. In Python 3, the stdio streams are wrapped with encoders that convert incoming text to bytes using a given encoding. The encoding defaults to the locale's preferred encoding, and can be overridden using the PYTHONIOENCODING environ, which I think we do in one of those exec_* functions.

str strings are not safely usable in string manipulation logic, generally.

This is true. str strings (as text encoded with something other than ASCII) can't be safely used as text. For one, the length of the string is its length in bytes, and getting the length in characters relies on knowing the text encoding, plus a lot of murky details involving combining characters and whatnot.

I blame the C language for this hell. Who thought it was a good idea to use char as the same type for both bytes and text?

Nonetheless, exec_command() is widely called throughout the codebase to capture general-purpose output rather than filesystem-encoded pathnames.

I'd like to find out more about these cases, and about any specific problems that arise on Python 2.7 because the output is handled as str. They shouldn't be coming up on OS X or Linux, where UTF-8 is generally the encoding used, since in multibyte sequences in UTF-8, every byte of a multibyte sequence is >0x80. Thus, none of the bytes in the multibyte will appear to be a / or . (or any other ASCII character) to the os.path functions.

@leycec
Copy link
Contributor Author

leycec commented Mar 25, 2016

But for simple path manipulation, because all widely-used codepages share the same codepoint for the / and . characters, path routines such as split, join, splitext usually work as expected when given a str.

You're probably right. But that "usually" concerns me, a bit.

Though I have no specific counterexamples at the moment, I can't help but involuntarily cringe whenever I see a str object being directly munged in Python 2.7. I've been burned a fair amount by that in the past (e.g., when parsing Wikipedia dumpfiles). And buuurn it did. 🚒

I was especially concerned about the Asiatic encodings still in common use (e.g., Shift-JIS, GBK), but you've destroyed that fear. I'm delighted to learn they sensibly avoid conflict with the lower half of the ASCII spectrum. How noble! How sane!

I then realized with a gnawing horror in my soft intestine that there are extremely common encodings that don't avoid conflict with the lower half of the ASCII spectrum. Indeed, their codepoints are explicitly incompatible with ASCII codepoints.

UTF-16 and UTF-32: We're Summarily Fucked

Unlike UTF-8, UTF-16 and UTF-32 are both ASCII-incompatible. Since neither shares the ASCII codepoints for the / and . characters (or any ASCII characters, for that matter), path routines do not generally work as expected on UTF-16- and UTF-32-encoded str strings.

The ASCII codepoints for / and . are 0x2F and 0x2E. The corresponding Unicode codepoints are 0x00002F and 0x00002E, where Unicode codepoints range from 0x000000 to 0x10FFFF. For a Unicode codepoint residing in the Basic Multilingual Plane (BMP), that codepoint's UTF-16 encoding is simply its codepoint as a 2-byte code unit. ASCII codepoints reside in the BMP. Ergo, the UTF-16 encodings for / and . are simply 0x002F and 0x002E.

"Lookin' good. What's the problem, pops?"

Pops shall now tell you. The upper byte of the UTF-16 encoding for any ASCII character is 0x00, the null byte. All filesystems prohibit the null byte in pathnames.

Attempting to pass UTF-16-encoded str pathnames to any filesystem function (e.g., os.path.exists(), os.listdir()) is guaranteed to fail with TypeError exceptions resembling:

TypeError: file() argument 1 must be encoded string without NULL bytes, not str

Interestingly, it could be worse. UTF-16 was explicitly designed to avoid conflict between BMP- and non-BMP encodings. Neither 0x002E, 0x002F, nor any UTF-16 encodings for ASCII codepoints are valid high surrogates. Hence, all 4-byte values in the range [0x002E0000, 0x002EFFFF] and in the range [0x002F0000, 0x002FFFFF] are invalid UTF-16 encodings. This means that 0x002E and 0x002F are always guaranteed to signify the corresponding ASCII codepoints, regardless of variable length characters. This means that you can safely split UTF-16 encoded pathnames on both / and ..

That's good. You just can't do anything with the resulting paths in Python, except feed them to something else that isn't Python. That's bad.

I expect similarly bad issues with UTF-32-encoded pathnames.

You Volunteering to Fix It?

Nope. I ain't volunteering for nuthin' except my head on this pillow. 😴

@codewarrior0
Copy link
Contributor

Unlike UTF-8, UTF-16 and UTF-32 are both ASCII-incompatible.

Absolutely. I might have glossed that bit over because of how unlikely it is that we would ever get anything in either of those encodings, because...

All filesystems prohibit the null byte in pathnames.

Therefore, you will never find a Unix-like system where LANG="en-US.UTF-16" or similar, because UTF-16 cannot represent English text as a valid filename. But on Windows, all bets are off. You can, in fact, get UTF-16 output from the simple command interpreter by running it with cmd.exe /U.

Anyway, we're getting a bit too far off topic. I'll keep this discussion in mind for when I eventually try to convert everything over to unicode (assuming Python 2.7 isn't killed before then).

This PR looks good to me!

@leycec
Copy link
Contributor Author

leycec commented Mar 25, 2016

Actually, I was wrong. You can't even safely split UTF-16 encoded pathnames on either / or .. Why? Because all values in the ranges [0x2E00, 0x2EFF] and [0x2F00, 0x2FFF] are valid UTF-16 encodings for Unicode codepoints that are neither / or ..

The pain. It hurts.

Therefore, you will never find a Unix-like system where LANG="en-us.UTF-16" or similar, because UTF-16 cannot represent English text as a valid filename.

Yup. What's oddball about that is OS X. As a BSD derivative, it's Unix-like and thus prohibits UTF-16 locales. But its default filesystem, HFS+, encodes pathnames as UTF-16.

Bizarro World strikes again.

I'll keep this discussion in mind for when I eventually try to convert everything over to unicode (assuming Python 2.7 isn't killed before then).

We both secretly know which of those two eventualities is likelier. 😉

Thanks for the erudite commentary, again. You rocked my encoding world, especially with all of the extremely fine-grained Windows effluvium. What is it with that platform? Don't answer that.

@leycec leycec force-pushed the lazyload branch 2 times, most recently from ac7a741 to 39b49a9 Compare March 26, 2016 05:22
@leycec
Copy link
Contributor Author

leycec commented Mar 26, 2016

Tests fail, because #1885 broke the build.

It's not my fault. Even though I incited htgoebels to do it with my FIXME comments of rage. It's all my fault.

@leycec
Copy link
Contributor Author

leycec commented Mar 26, 2016

This pull request is now good to go.

A new compat.exec_command_stdout() function has been added and is now called by hook-gi.repository.GdkPixbuf to portably parse command output. Unlike the legacy compat.exec_command() and compat.exec_command_all() functions, this new function only returns Unicode strings decoded in a (hopefully) robust manner.

To the end of all bugs, and beyond! 👽

@htgoebel htgoebel added the area:hooks Caused by or effecting some hook label Mar 26, 2016
@leycec
Copy link
Contributor Author

leycec commented Mar 27, 2016

Rebased without modification in hope of passing tests. Let's see!

@leycec leycec force-pushed the lazyload branch 2 times, most recently from 4f60d4d to 2f5f629 Compare March 27, 2016 03:47
@leycec
Copy link
Contributor Author

leycec commented Mar 27, 2016

Tests pass. 👻

@leycec
Copy link
Contributor Author

leycec commented Mar 28, 2016

Fixed the related issue #1867, just because.

This completes the noble tale of this pull request. No additional commits will be added. No existing commits will be revised. All's well that finally ends.

@codewarrior0 You know what to do, my friend.

@codewarrior0
Copy link
Contributor

All right, I looked over the changes again and everything looks good. I'll wait for the CI to finish before I merge this in.

I'm going to add a note to the docstring for exec_command_output, advising that if you parse filenames from the command's output, you will have to check if you are running under Python 2.7, and if so, encode the filename with the filesystem encoding before passing the filename to PyInstaller.

@leycec
Copy link
Contributor Author

leycec commented Mar 28, 2016

I'm going to add a note to the docstring for exec_command_output, advising that if you parse filenames from the command's output, you will have to check if you are running under Python 2.7, and if so, encode the filename with the filesystem encoding before passing the filename to PyInstaller.

Yeah, that's pretty essential. Apologies for neglecting that. I suppose a pragmatic compromise between these two warring camps might be to:

  • Note in the exec_command() and exec_command() docstrings that the return values are safely usable only when pathname output is expected.
  • Note in the exec_command_stdout() docstring that the return value is not safely usable when pathname output is expected.

Want me to edit that in? I hate stepping on toes, but I could probably whip through that carnage in a jiffy.

@codewarrior0
Copy link
Contributor

Want me to edit that in?

Sure, be my guest! FYI, if you write [ci skip] in any part of the commit message, it will avoid triggering a rebuild on either CI system. Perfect for docs-only changes.

@codewarrior0
Copy link
Contributor

The failures on AppVeyor are due to failing to build Numpy from source. It looks like Numpy 1.11.0 was just released hours ago, and they haven't got around to publishing Windows binary wheels (which they had for the previous version). I'll trust that the tests pass on Travis.

When the numpy guys get around to publishing a Windows binary wheel, I'll give it a rebuild.

@codewarrior0
Copy link
Contributor

(Did you rebase on the wrong branch?)

@leycec
Copy link
Contributor Author

leycec commented Mar 28, 2016

Ugh. I didn't rebase on the wrong branch, so I'm not quite sure why the insanity. Let's try a pull and see if that cleans my HEAD up.

A new compat.open_file() utility function simplifying file handling across both
Python 2 and 3 has been added. Under Python 3, this aliases the open() built-in;
under Python 2, this aliases the io.open() built-in, which is a backport of the
Python 3 open() built-in and hence usually preferred to the mostly obsolete
Python 2 open() built-in.
A new compat.exec_command_stdout() utility function reliably capturing standard
output from external commands in a portable manner has been added. The legacy
compat.exec_command() and compat.exec_command_all() utility functions do not
reliably capture either standard output or error. For backward compatibility,
these functions have been preserved as is but are now deprecated when called for
their return values. Docstrings have been significantly improved. [ci skip]
A new compat.which() utility function has been added. For Python 3.x, this is
synonymous with the shutil.which() function introduced by Python 3.3. For Python
2.7, this a backported copy of the shutil.which() function defined by the most
recent stable version of Python 3 as of this commit: Python 3.5.1.
The "gi.repository.GdkPixbuf" hook now supports Python 3 under all supported
platforms by decoding loader cache output only when necessary (like under OS X)
in a robust, portable manner enabling the "universal_newlines" option and
calling the Python 3-style open() builtin.

This hook is also now ignored with a non-fatal warning if GDK and hence
GdkPixbuf is unavailable on the current system. This fixes pyinstaller#1867.
All PyGObject functional tests have been moved from the existing
"test_libraries" module into a new "test_gi" module. All tests requiring
PyGObject >= 2.0 are now parametrized, trivializing test creation and
maintenance -- including the addition of a new "gi.repository.GdkPixbuf" test.
The test_matplotlib() tests repeatedly implicitly tested the default backend
rather than the currently parametrized backend. These tests now test the latter
by explicitly configuring Matplotlib to do so, and have been moved from the
existing "test_libraries" module into a new "test_matplotlib" module.
The @importorskip test decorator now supports fine-grained skipping of tests
requiring minimum module versions by internally deferring to the official
pytest.importorskip() helper. Doing so requires @importorskip to accept only one
rather than multiple module names. Existing tests passing multiple module names
have been refactored in the trivial way to pass only one such name.
The "scipy.lib" package imported by the test_scipy() test has been privatized by
SciPy 0.16.0 to "scipy._lib". This test now conditionally imports the desired
package depending on SciPy version.
@leycec
Copy link
Contributor Author

leycec commented Mar 28, 2016

That looks sane(r). <phew/>

@leycec
Copy link
Contributor Author

leycec commented Mar 28, 2016

Ugh.

Thanks for the [ci skip] black magic, too. I embedded that in the only changed commit, but... maybe the rebase terrors above did a whack-job on it. That commit patches up the exec_command*() docstrings as noted above. Feel free to suggest a few more, or edit them to the bone yourself. 🍖

@leycec
Copy link
Contributor Author

leycec commented Mar 28, 2016

Assuming Travis tests pass, we can probably safely close #1819 and #1867 as well. Cheers, man.

@codewarrior0 codewarrior0 merged commit 32759b6 into pyinstaller:develop Mar 28, 2016
@codewarrior0
Copy link
Contributor

As always, your help is much appreciated, @leycec.

@htgoebel
Copy link
Member

Looks like you hunted down another beast of legacy. Thanks @leycec (and of course @codewarrior0 ).

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 17, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area:hooks Caused by or effecting some hook
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants