GH-99554: `marshal` bytecode more efficiently #99555

brandtbucher · 2022-11-17T00:29:08Z

This shrinks the size of .pyc files by about 25%. It also removes an intermediate copy of the bytecode created during marshalling.

Next steps will be removing the intermediate bytes object created during unmarshalling and performing the work of _PyCode_Quicken as part of this same move.

Issue: .pyc files are larger than they need to be #99554

brandtbucher · 2022-11-17T01:01:59Z

Hm, that's annoying. It appears that umarshal is used for deepfreeze, but it needs to be able to run with a version of Python other than the one being built... that makes this whole thing a lot harder, since those Pythons don't have access to the opcode module of the Python we're unmarshalling.

I might need to find a ~~hack~~ clever way of importing the opcode module from Lib.

mdboom

Just one drive-by comment ;)

mdboom · 2022-11-17T17:31:51Z

Python/marshal.c

+        assert(0x00 <= oparg && oparg < 0x100);
+        buffer[i++] = _Py_MAKECODEUNIT(opcode, oparg);
+        for (int j = 0; j < _PyOpcode_Caches[opcode]; j++) {
+            buffer[i++] = _Py_MAKECODEUNIT(CACHE, oparg);


Maybe assert that i < size here, since there is potential for a buffer overrun here.

gvanrossum

TBH, I'm a bit torn. This looks brilliant, but there's one thing that worries me -- unless you have the correct table of inline cache entries and know the value of HAVE_ARGUMENT you can't even skip the bytecode, which means you can't do anything else with the code object (e.g. looking at co_consts or co_names or even co_firstlineno).

That would be easier if the size you wrote was the number of bytes written instead of the number of code units including cache; but then r_bytecode() would have to make a guess at the size of the bytes object it is allocating and perhaps reallocate it later, which isn't ideal either.

How would you feel about writing both numbers to the file? That wastes 4 bytes per code object, but makes decoding more robust.

FWIW: Here's a wild scheme for avoiding the extra copy of the bytecode (which would presumably save a little time and memory on all imports). Once you have read the size of the bytecode you could just allocate a blank PyCodeObject object of the right size, and then you could deposit the expanded bytecode directly into that. Then the struct _PyCodeConstructor-based API would have to allow a field where you pass in that object and it would just initialize the rest based on the various given fields and return it (or dealloc it if there's another error). A little fiddly and not something for this PR, but given how heroic this all already sounds, perhaps worth the effort in the future.

gvanrossum · 2022-11-22T04:31:49Z

Tools/build/umarshal.py

+        bytecode = bytearray()
+        while len(bytecode) < nbytes:
+            opcode_byte = self.r_byte()
+            if opcode.HAVE_ARGUMENT <= opcode_byte:


This looks weird. I'm so used to if x >= N that seeing if N <= x just feels wrong.

Suggested change

if opcode.HAVE_ARGUMENT <= opcode_byte:

if opcode_byte >= opcode.HAVE_ARGUMENT:

const == variable is a C-ism to prevent accidentally writing = instead of ==. No need in Python.

I have a weird habit where I always use < and <= for comparisons, regardless of the placement of constants, etc. I guess I like the parallel with chained comparisons like lo <= x < hi, which almost always use < and <=.

I'll change it, though.

markshannon

Looks good.
Any performance numbers? (not that it matters, the 25% space saving is worth it).

markshannon · 2022-11-28T14:29:11Z

Tools/build/deepfreeze.py

@@ -17,13 +17,13 @@
 from typing import Dict, FrozenSet, TextIO, Tuple

 import umarshal
+import opcode_for_build as opcode


This isn't a normal import so it shouldn't look like one. Maybe something like opcode = opcode_finder.get_opcodes()

markshannon · 2022-11-28T14:31:00Z

Tools/build/umarshal.py

+        bytecode = bytearray()
+        while len(bytecode) < nbytes:
+            opcode_byte = self.r_byte()
+            if opcode.HAVE_ARGUMENT <= opcode_byte:


const == variable is a C-ism to prevent accidentally writing = instead of ==. No need in Python.

gvanrossum · 2022-12-18T16:53:00Z

Python/marshal.c

-        buffer[i++] = _Py_MAKECODEUNIT(opcode, oparg);
+        buffer[i].opcode = opcode;
+        buffer[i++].oparg = oparg;
        for (int j = 0; j < _PyOpcode_Caches[opcode]; j++) {
-            buffer[i++] = _Py_MAKECODEUNIT(CACHE, oparg);
+            buffer[i].opcode = CACHE;
+            buffer[i++].oparg = 0;


Honestly I'd rather see the i++ on separate lines instead of this hybrid approach.

brandtbucher · 2022-12-22T22:02:31Z

How would you feel about writing both numbers to the file? That wastes 4 bytes per code object, but makes decoding more robust.

We can do that. The only awkward part is that this requires walking over the bytecode twice when marshalling: once to count the number of bytes to be written (for the header), and another to actually write them.

brandtbucher · 2022-12-22T22:04:09Z

(Another option could just be to terminate the serialized bytecode with two zero bytes. That can't appear anywhere in the compressed string, but is a tad bit hacky.)

brandtbucher · 2022-12-23T01:41:14Z

Any performance numbers?

1.00x faster:

Slower (23):
- regex_dna: 208 ms +- 1 ms -> 219 ms +- 1 ms: 1.05x slower
- pickle_list: 3.97 us +- 0.04 us -> 4.12 us +- 0.07 us: 1.04x slower
- genshi_text: 20.6 ms +- 0.2 ms -> 21.3 ms +- 0.2 ms: 1.03x slower
- json: 4.68 ms +- 0.12 ms -> 4.80 ms +- 0.13 ms: 1.03x slower
- pickle: 10.00 us +- 0.08 us -> 10.2 us +- 0.1 us: 1.03x slower
- json_dumps: 9.28 ms +- 0.14 ms -> 9.49 ms +- 0.13 ms: 1.02x slower
- json_loads: 23.5 us +- 0.3 us -> 24.0 us +- 0.4 us: 1.02x slower
- regex_effbot: 3.51 ms +- 0.02 ms -> 3.58 ms +- 0.01 ms: 1.02x slower
- unpickle_list: 5.03 us +- 0.04 us -> 5.12 us +- 0.04 us: 1.02x slower
- djangocms: 32.1 us +- 1.5 us -> 32.6 us +- 1.4 us: 1.02x slower
- fannkuch: 366 ms +- 5 ms -> 372 ms +- 7 ms: 1.02x slower
- django_template: 32.4 ms +- 0.4 ms -> 32.9 ms +- 1.1 ms: 1.02x slower
- chameleon: 6.32 ms +- 0.09 ms -> 6.43 ms +- 0.09 ms: 1.02x slower
- spectral_norm: 94.9 ms +- 1.0 ms -> 96.1 ms +- 1.1 ms: 1.01x slower
- mdp: 2.51 sec +- 0.01 sec -> 2.54 sec +- 0.01 sec: 1.01x slower
- mako: 9.63 ms +- 0.08 ms -> 9.72 ms +- 0.05 ms: 1.01x slower
- xml_etree_generate: 76.5 ms +- 0.7 ms -> 77.2 ms +- 0.8 ms: 1.01x slower
- pprint_pformat: 1.38 sec +- 0.01 sec -> 1.39 sec +- 0.02 sec: 1.01x slower
- pyflate: 405 ms +- 4 ms -> 407 ms +- 3 ms: 1.01x slower
- python_startup: 8.45 ms +- 0.01 ms -> 8.50 ms +- 0.01 ms: 1.00x slower
- python_startup_no_site: 6.06 ms +- 0.01 ms -> 6.08 ms +- 0.01 ms: 1.00x slower
- 2to3: 252 ms +- 1 ms -> 253 ms +- 1 ms: 1.00x slower
- generators: 77.4 ms +- 0.5 ms -> 77.6 ms +- 0.4 ms: 1.00x slower

Faster (33):
- xml_etree_iterparse: 107 ms +- 2 ms -> 102 ms +- 1 ms: 1.04x faster
- chaos: 69.0 ms +- 0.6 ms -> 66.1 ms +- 0.6 ms: 1.04x faster
- coroutines: 25.8 ms +- 0.4 ms -> 25.0 ms +- 0.3 ms: 1.04x faster
- unpack_sequence: 42.7 ns +- 0.5 ns -> 41.6 ns +- 0.7 ns: 1.03x faster
- async_tree_memoization: 650 ms +- 42 ms -> 635 ms +- 37 ms: 1.02x faster
- scimark_sor: 107 ms +- 2 ms -> 105 ms +- 2 ms: 1.02x faster
- deepcopy: 334 us +- 4 us -> 326 us +- 2 us: 1.02x faster
- deepcopy_reduce: 2.95 us +- 0.04 us -> 2.88 us +- 0.04 us: 1.02x faster
- meteor_contest: 106 ms +- 4 ms -> 104 ms +- 2 ms: 1.02x faster
- pidigits: 192 ms +- 0 ms -> 189 ms +- 0 ms: 1.02x faster
- logging_simple: 5.77 us +- 0.10 us -> 5.69 us +- 0.10 us: 1.01x faster
- deepcopy_memo: 33.9 us +- 0.5 us -> 33.5 us +- 0.6 us: 1.01x faster
- scimark_sparse_mat_mult: 4.09 ms +- 0.05 ms -> 4.03 ms +- 0.07 ms: 1.01x faster
- xml_etree_parse: 150 ms +- 6 ms -> 149 ms +- 3 ms: 1.01x faster
- telco: 6.25 ms +- 0.18 ms -> 6.17 ms +- 0.20 ms: 1.01x faster
- genshi_xml: 47.6 ms +- 0.7 ms -> 47.0 ms +- 0.6 ms: 1.01x faster
- logging_format: 6.35 us +- 0.08 us -> 6.28 us +- 0.08 us: 1.01x faster
- pathlib: 17.8 ms +- 0.3 ms -> 17.6 ms +- 0.2 ms: 1.01x faster
- nbody: 94.5 ms +- 2.5 ms -> 93.6 ms +- 1.3 ms: 1.01x faster
- regex_v8: 22.4 ms +- 0.1 ms -> 22.2 ms +- 0.1 ms: 1.01x faster
- async_tree_io: 1.33 sec +- 0.03 sec -> 1.32 sec +- 0.03 sec: 1.01x faster
- sqlglot_normalize: 105 ms +- 1 ms -> 104 ms +- 1 ms: 1.01x faster
- sympy_sum: 162 ms +- 2 ms -> 161 ms +- 2 ms: 1.01x faster
- scimark_fft: 317 ms +- 4 ms -> 315 ms +- 4 ms: 1.01x faster
- pycparser: 1.14 sec +- 0.02 sec -> 1.13 sec +- 0.01 sec: 1.01x faster
- go: 136 ms +- 2 ms -> 135 ms +- 1 ms: 1.01x faster
- sqlglot_parse: 1.40 ms +- 0.02 ms -> 1.39 ms +- 0.01 ms: 1.01x faster
- sympy_integrate: 20.3 ms +- 0.1 ms -> 20.2 ms +- 0.2 ms: 1.00x faster
- logging_silent: 92.6 ns +- 0.6 ns -> 92.1 ns +- 0.7 ns: 1.00x faster
- crypto_pyaes: 74.4 ms +- 0.8 ms -> 74.1 ms +- 0.7 ms: 1.00x faster
- dulwich_log: 62.3 ms +- 0.4 ms -> 62.0 ms +- 0.5 ms: 1.00x faster
- sqlglot_optimize: 50.6 ms +- 0.3 ms -> 50.5 ms +- 0.2 ms: 1.00x faster
- bench_thread_pool: 777 us +- 3 us -> 774 us +- 3 us: 1.00x faster

Benchmark hidden because not significant (28): async_generators, async_tree_none, async_tree_cpu_io_mixed, bench_mp_pool, coverage, deltablue, docutils, float, hexiom, html5lib, mypy, nqueens, pickle_dict, pickle_pure_python, pprint_safe_repr, raytrace, regex_compile, richards, scimark_lu, scimark_monte_carlo, sqlglot_transpile, sqlite_synth, sympy_expand, sympy_str, thrift, unpickle, unpickle_pure_python, xml_etree_process

Geometric mean: 1.00x faster

gvanrossum · 2022-12-27T18:28:56Z

Python/marshal.c

+    // Terminate with two zero bytes, so that programs scanning .pyc files can
+    // skip over the bytecode (even if they don't know the compression scheme).
+    // This is simpler than writing the compressed size in the header, which
+    // requires two loops (one to count the bytes, then one to write them):
+    w_short(0, p);


Ugh, sorry, I can't stomach this. It feels certain that at some point in the future we'll have a corner case where we encounter \0\0 in the middle. If an extra pass is too expensive let's just go with what you had before. Or let's not do this -- I am feeling pretty lukewarm about this scheme, especially since the register VM will require a whole new approach anyway.

gvanrossum

Personally my vote is to abandon this.

brandtbucher · 2022-12-27T19:20:09Z

Personally my vote is to abandon this.

Hm, that's a shame.

I'm still not personally sold on the motivation for including this extra metadata at all. Are there really that many packages reading raw .pycs that are doing it without the marshal module? Only Tools/build/umarshal.py comes to mind.

If so, these tools already need to understand how the other members of the code object are laid out, and in what formats (something which changes between Python versions). I don't think that the overhead of those tools needing opcode.HAVE_ARGUMENT and the opcode._inline_cache_entries table is enough to kill this, especially since we're so adamant about marshal being an unstable, internal format. Even marshal.version has mostly lost its usefulness, since we only bump the importlib magic number whenever the serialization of code objects change.

I am feeling pretty lukewarm about this scheme, especially since the register VM will require a whole new approach anyway.

I wouldn't call it a "whole new approach". The cache compression will still be very valuable, I think.

(I'm also not willing to die on this hill though. 25% savings on disk is certainly nice, but it's probably not a game-changer for most users.)

gvanrossum · 2022-12-28T01:34:47Z

To be clear, it's not the choice between adding the byte count at the cost of an extra pass vs. requiring people to parse the bytecode in order to implement their own marshal. Most tools doing that would be able to just use the marshal module (the code object format there changes in every release so they'd have to anyways) or copy umarshal.py and friends from our Tools directory.

It's more that this adds a fair amount of code complexity and what that buys us is merely slightly smaller PYC files, which I'm not sure anybody cares for. Plus (as I mentioned) I worry that we'll have to redo this for the register machine, which completely changes the opcode format.

brandtbucher · 2022-12-28T20:16:07Z

Okay, makes sense. I'll drop this for now, and we can maybe revisit after the opcode format has stabilized.

brandtbucher added 3 commits November 15, 2022 11:13

Compress marshalled bytecode

4034318

blurb add

8b6b393

Add opcode_for_build

5f81b82

brandtbucher added performance Performance or resource usage stdlib Python modules in the Lib dir labels Nov 17, 2022

brandtbucher requested a review from tiran as a code owner November 17, 2022 00:29

brandtbucher self-assigned this Nov 17, 2022

bedevere-bot mentioned this pull request Nov 17, 2022

.pyc files are larger than they need to be #99554

Closed

bedevere-bot added the awaiting core review label Nov 17, 2022

brandtbucher requested a review from markshannon November 17, 2022 00:33

mdboom reviewed Nov 17, 2022

View reviewed changes

brandtbucher requested a review from gvanrossum November 21, 2022 20:51

gvanrossum reviewed Nov 22, 2022

View reviewed changes

markshannon reviewed Nov 28, 2022

View reviewed changes

brandtbucher added 6 commits December 14, 2022 16:54

Catch up with main

e34c460

fixup

23ba802

Catch up with main

6ed83af

Address review comments

5534112

Catch up with main

ead8f26

Terminate the compressed bytecode with two zeros

bcd7980

gvanrossum reviewed Dec 18, 2022

View reviewed changes

python deleted a comment from netlify bot Dec 22, 2022

gvanrossum reviewed Dec 27, 2022

View reviewed changes

brandtbucher closed this Dec 28, 2022

markshannon mentioned this pull request Jan 3, 2023

Instruction formats faster-cpython/ideas#530

Closed

gvanrossum mentioned this pull request Jan 17, 2023

How are bytecode strings used? faster-cpython/ideas#538

Open

markshannon mentioned this pull request Aug 31, 2023

GH-108716: Turn off deep-freezing of modules. #108722

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-99554: `marshal` bytecode more efficiently #99555

GH-99554: `marshal` bytecode more efficiently #99555

brandtbucher commented Nov 17, 2022 •

edited by bedevere-bot

Loading

brandtbucher commented Nov 17, 2022

mdboom left a comment

mdboom Nov 17, 2022

gvanrossum left a comment

gvanrossum Nov 22, 2022

markshannon Nov 28, 2022

brandtbucher Dec 22, 2022

markshannon left a comment

markshannon Nov 28, 2022

markshannon Nov 28, 2022

gvanrossum Dec 18, 2022

brandtbucher commented Dec 22, 2022

brandtbucher commented Dec 22, 2022 •

edited

Loading

brandtbucher commented Dec 23, 2022

gvanrossum Dec 27, 2022

gvanrossum left a comment

brandtbucher commented Dec 27, 2022

gvanrossum commented Dec 28, 2022

brandtbucher commented Dec 28, 2022

	if opcode.HAVE_ARGUMENT <= opcode_byte:
	if opcode_byte >= opcode.HAVE_ARGUMENT:

GH-99554: marshal bytecode more efficiently #99555

GH-99554: marshal bytecode more efficiently #99555

Conversation

brandtbucher commented Nov 17, 2022 • edited by bedevere-bot Loading

brandtbucher commented Nov 17, 2022

mdboom left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gvanrossum left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

markshannon left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brandtbucher commented Dec 22, 2022

brandtbucher commented Dec 22, 2022 • edited Loading

brandtbucher commented Dec 23, 2022

Choose a reason for hiding this comment

gvanrossum left a comment

Choose a reason for hiding this comment

brandtbucher commented Dec 27, 2022

gvanrossum commented Dec 28, 2022

brandtbucher commented Dec 28, 2022

GH-99554: `marshal` bytecode more efficiently #99555

GH-99554: `marshal` bytecode more efficiently #99555

brandtbucher commented Nov 17, 2022 •

edited by bedevere-bot

Loading

brandtbucher commented Dec 22, 2022 •

edited

Loading