Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance regression 3.10b1: inlining issue in the big _PyEval_EvalFrameDefault() function with Visual Studio (MSC) #89279

Closed
neonene mannequin opened this issue Sep 6, 2021 · 100 comments
Labels
3.10 only security fixes 3.11 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) OS-windows performance Performance or resource usage topic-C-API

Comments

@neonene
Copy link
Mannequin

neonene mannequin commented Sep 6, 2021

BPO 45116
Nosy @malemburg, @gvanrossum, @rhettinger, @pfmoore, @vstinner, @tjguk, @markshannon, @zware, @zooba, @animalize, @pablogsal, @brandtbucher, @neonene, @erlend-aasland, @Fidget-Spinner
PRs
  • bpo-45116: Add the Py_ALWAYS_INLINE macro #28390
  • bpo-45116: Py_DEBUG ignores Py_ALWAYS_INLINE #28419
  • bpo-45116: Use Py_ALWAYS_INLINE in object.h #28427
  • [3.10] bpo-45116: Shrink interpreter by targetted revert of #25244 #28475
  • bpo-45116 Fix another performance regression on Windows #28630
  • bpo-45116 Add a warm-up for PGO training on Windows #28631
  • bpo-45116: Make --inlinestat option available in build.bat for diagnostic logs #31436
  • [3.10] bpo-45116: Fix inlining regressions on Windows Release build #31459
  • gh-89279: In ceval.c, redefine some macros for speed #32387
  • Files
  • 310rc1_confirm_overhead.patch
  • ceval_310rc1_patched.c
  • b98e-no-inline-in-all.diff
  • b98e-no-inline-in-eval.diff
  • b98e-no-inline-in-the-others.diff
  • pyproject_inlinestat.patch
  • x64_28d2.log
  • x64_b98e.log
  • 310rc2_benchmarks.txt
  • 310a7_vs_310rc2_bench.txt
  • PR28475_inline.log
  • PR28475_vs_310rc2_vs_310a7.txt
  • PR28475_skip1test_bench.txt
  • 310rc2patched_vs_310rc2notrace.txt
  • switch-case_unarranged_bench.txt
  • ceval_PR29565_split_func.c
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = None
    created_at = <Date 2021-09-06.15:27:18.216>
    labels = ['interpreter-core', '3.10', 'performance', 'expert-C-API', '3.11', 'OS-windows']
    title = 'Performance regression 3.10b1: inlining issue in the big _PyEval_EvalFrameDefault() function with Visual Studio (MSC)'
    updated_at = <Date 2022-04-08.11:04:12.476>
    user = 'https://github.com/neonene'

    bugs.python.org fields:

    activity = <Date 2022-04-08.11:04:12.476>
    actor = 'steve.dower'
    assignee = 'none'
    closed = False
    closed_date = None
    closer = None
    components = ['Interpreter Core', 'Windows', 'C API']
    creation = <Date 2021-09-06.15:27:18.216>
    creator = 'neonene'
    dependencies = []
    files = ['50263', '50264', '50271', '50272', '50273', '50274', '50275', '50276', '50280', '50286', '50291', '50293', '50296', '50315', '50363', '50452']
    hgrepos = []
    issue_num = 45116
    keywords = ['patch']
    message_count = 82.0
    messages = ['401143', '401152', '401154', '401182', '401183', '401319', '401329', '401346', '401364', '401623', '401624', '401628', '401743', '401964', '401970', '401972', '402025', '402040', '402043', '402044', '402063', '402064', '402065', '402067', '402068', '402071', '402090', '402091', '402092', '402098', '402099', '402117', '402135', '402143', '402189', '402190', '402217', '402229', '402230', '402287', '402289', '402296', '402307', '402308', '402320', '402480', '402856', '402857', '402858', '402864', '402867', '402871', '402878', '402886', '402891', '402893', '402928', '402930', '402954', '403403', '403409', '403430', '403432', '403464', '403559', '403587', '403609', '404089', '406354', '406386', '406407', '406416', '406471', '406474', '406479', '406487', '406613', '407188', '415378', '416911', '416950', '416977']
    nosy_count = 15.0
    nosy_names = ['lemburg', 'gvanrossum', 'rhettinger', 'paul.moore', 'vstinner', 'tim.golden', 'Mark.Shannon', 'zach.ware', 'steve.dower', 'malin', 'pablogsal', 'brandtbucher', 'neonene', 'erlendaasland', 'kj']
    pr_nums = ['28390', '28419', '28427', '28475', '28630', '28631', '31436', '31459', '32387']
    priority = None
    resolution = None
    stage = 'patch review'
    status = 'open'
    superseder = None
    type = 'performance'
    url = 'https://bugs.python.org/issue45116'
    versions = ['Python 3.10', 'Python 3.11']

    @neonene
    Copy link
    Mannequin Author

    neonene mannequin commented Sep 6, 2021

    pyperformance on Windows shows some gap between 3.10a7 and 3.10b1.
    The following are the ratios compared with 3.10a7 (the higher the slower).

    -------------------------------------------------
    Windows x64 | PGO release official-binary
    ----------------+--------------------------------
    20210405 |
    3.10a7 | 1.00 1.24 1.00 (PGO?)
    20210408-07:58 |
    b98eba5 | 0.98
    20210408-10:22 |

    • PR25244 | 1.04
      20210503 |
      3.10b1 | 1.07 1.21 1.07
      -------------------------------------------------
      Windows x86 | PGO release official-binary
      ----------------+--------------------------------
      20210405 |
      3.10a7 | 1.00 1.25 1.27 (release?)
      20210408-07:58 |
      b98eba5 | 1.00
      20210408-10:22 |
    • PR25244 | 1.11
      20210503 |
      3.10b1 | 1.14 1.28 1.29

    Since PR25244 (28d28e0),
    _PyEval_EvalFrameDefault() in ceval.c has seemed to be unoptimized with PGO (msvc14.29.16.10).
    At least the functions below have become un-inlined there at all.

    (1) _Py_DECREF() (from Py_DECREF,Py_CLEAR,Py_SETREF)
    (2) _Py_XDECREF() (from Py_XDECREF,SETLOCAL)
    (3) _Py_IS_TYPE() (from PyXXX_CheckExact)
    (4) _Py_atomic_load_32bit_impl() (from CHECK_EVAL_BREAKER)

    I tried in vain other linker options like thread-safe-profiling, agressive-code-generation, /OPT:NOREF.
    3.10a7 can inline them in the eval-loop even if profiling only test_array.py.

    I measured overheads of (1)~(4) on my own build whose eval-loop uses macros instead of them.

    -----------------------------------------------------------------
    Windows x64 | PGO patched overhead in eval-loop
    ----------------+------------------------------------------------
    3.10a7 | 1.00
    20210802 |
    3.10rc1 | 1.09 1.05 4% (slow 43, fast 5, same 10)
    20210831-20:42 |
    863154c | 0.95 0.90 5% (slow 48, fast 3, same 7)
    (3.11a0+) |
    -----------------------------------------------------------------
    Windows x86 | PGO patched overhead in eval-loop
    ----------------+------------------------------------------------
    3.10a7 | 1.00
    20210802 |
    3.10rc1 | 1.15 1.13 2% (slow 29, fast 14, same 15)
    20210831-20:42 |
    863154c | 1.05 1.02 3% (slow 44, fast 7, same 7)
    (3.11a0+) |

    @neonene neonene mannequin added performance Performance or resource usage 3.10 only security fixes 3.11 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) OS-windows topic-C-API labels Sep 6, 2021
    @vstinner
    Copy link
    Member

    vstinner commented Sep 6, 2021

    Rather than defining again functions as macro, you should consider using __forceinline function attribute: see bpo-45094.

    @rhettinger
    Copy link
    Contributor

    Perhaps these critical code sections should have been left as macros. It is difficult to assuring system wide inlining across modules.

    @vstinner
    Copy link
    Member

    vstinner commented Sep 6, 2021

    Raymond:

    Perhaps these critical code sections should have been left as macros. It is difficult to assuring system wide inlining across modules.

    These functions were not converted recently to static inline function. For example, Py_INCREF() was already a static inline function in Python 3.9. I don't think that any decision should be taken before performances have been analyzed in depth.

    I'm not convinced that there is really a performance regression. I'm not sure how benchmarks are run on Windows.

    neonene:

    I measured overheads of (1)~(4) on my own build whose eval-loop uses macros instead of them.

    I don't understand how to read the table.

    Usually, I expect a comparison between a reference build and a patch build, but here you seem to use 3.10a7 as the reference to compare results. I'm not sure that geometric means can be compared this way.

    @vstinner
    Copy link
    Member

    vstinner commented Sep 6, 2021

    Since PR25244 (28d28e0),
    _PyEval_EvalFrameDefault() in ceval.c has seemed to be unoptimized with PGO (msvc14.29.16.10).
    At least the functions below have become un-inlined there at all.

    I'm not sure if PGO builds are reproducible, since there is a training step which requires to run a non-deterministic workload (the Python test suite which is somehow randomized by I/O timings and other stuffs).

    The compiler is free to inline or not depending on the pressure on registers and stack memory. I'm not sure that inlining always make the code faster, since inlining have many side effects.

    I don't know well MSC compiler.

    @neonene
    Copy link
    Mannequin Author

    neonene mannequin commented Sep 7, 2021

    @vstinner: __forceinline suggestion

    Since PR25244 (mentioned above), it seems link.exe has got to get stuck on python310.dll.
    Before the PR, it took 10x~ longer to link than without __forceinline function.
    I can confirm with _Py_DECREF() and _Py_XDECREF() and one training-job (the more fucntions forced/jobs used, the slower to link).
    Have you tried __forceinline on PGO ?

    I don't understand how to read the table.

    Overhead field is the output of pyperf command, not subtraction (the answers are the same just luckily).

    ex) 3.10rc1x86 PGO:
    PGO : pyperf compare_to 3.10a7 left
    patched : pyperf compare_to 3.10a7 right
    overhead : pyperf compare_to right left
    are
    1.15x slower (slower 52, faster 4, not significant 2)
    1.13x slower (slower 50, faster 4, not significant 4)
    1.02x slower (slower 29, faster 14, not significant 15)

    I'm not sure if PGO builds are reproducible,

    MSVC does not produce the same code. Inlining (all or nothing) might be a quite special case in the hottest section.
    I suspect the profiler doesn't work well only for _PyEval_EvalFrameDefault(), including branch/align optimization.
    So my posted macro or inlining is just for a mesureing, not the solution.

    @vstinner
    Copy link
    Member

    vstinner commented Sep 7, 2021

    I don't understand if adding __forceinline on the mentioned static inline functions has an impact on the performance of a PGO build on Windows.

    @zooba
    Copy link
    Member

    zooba commented Sep 7, 2021

    I'm not sure if PGO builds are reproducible, since there is a training step which requires to run a non-deterministic workload (the Python test suite which is somehow randomized by I/O timings and other stuffs).

    I'm 99% sure it's a tracing PGO rather than sampling, so provided everything runs in the same order and follows the same codepaths, wall-clock timings shouldn't have a significant effect.

    Without looking at the generated code, it may be more effective to try and force the functions called from the macro to never be inlined. The optimiser may be missing that the calls are uncommon, and trying to inline them first, then deciding that the whole merged function is too big to inline in places where it would be useful.

    There's also no particular reason we need to use these tests as the profile, it's just that nobody has taken the time to come up with anything better. I'd rather see us invest time in generating a good profile rather than trying to hand-optimise inlining. (Though the test suite is good in many ways, because it makes sure all the extension modules are covered. We definitely want as close to full code coverage as we can get, at least for happy paths, but may need more eval loop in the profile if that's what needs help.)

    @animalize
    Copy link
    Mannequin

    animalize mannequin commented Sep 8, 2021

    This article briefly introduces the inlining decisions in MSVC.
    https://devblogs.microsoft.com/cppblog/inlining-decisions-in-visual-studio/

    @neonene
    Copy link
    Mannequin Author

    neonene mannequin commented Sep 11, 2021

    Thanks for all suggestions. I focused on my bisected commit and the previous.

    I run pyperformance with 4 functions never inlined in the sections below.

       _Py_DECREF()
       _Py_XDECREF()
       _Py_IS_TYPE()
       _Py_atomic_load_32bit_impl()

    are

    (1) never inlined in _PyEval_EvalFrameDefault().
    (2) never inlined in the other funcitons.
    (3) never inlined in all functions.

    slow downs [4-funcs never inlined section]
    --------------------------------------------------------------
    Windows x64 PGO (44job) (*) (1) (2) (3)
    rebuild none eval others all
    --------------------------------------------------------------
    b98eba5 (4 funcs inlined in eval) 1.00 1.05 1.09 1.14
    PR25244 ( not inlined in eval) 1.06 1.07 1.18 1.17

    pyperf compare_to upper lower:
    (*) 1.06x slower (slower 45, faster 4, not not significant 9)
    (1) 1.02x slower (slower 33, faster 13, not not significant 12)
    (2) 1.08x slower (slower 48, faster 6, not not significant 4)
    (3) 1.03x slower (slower 39, faster 6, not not significant 13)

    --------------------------------------------------------------
    Windows x86 PGO (44job) (*) (1) (2) (3)
    rebuild none eval others all
    --------------------------------------------------------------
    b98eba5 (4 funcs inlined in eval) 1.00 1.03 1.06 1.15
    PR25244 ( not inlined in eval) 1.13 1.13 1.22 1.24

    pyperf compare_to upper lower:
    (*) 1.13x slower (slower 54, faster 2, not not significant 2)
    (1) 1.10x slower (slower 47, faster 3, not not significant 8)
    (2) 1.14x slower (slower 54, faster 1, not not significant 3)
    (3) 1.08x slower (slower 43, faster 3, not not significant 12)

    In both x64 and x86, it looks column (2) and (*) has similar gaps.
    So, I would like to simply focus on the eval-loop.

    I built PGO with "/d2inlinestats" and "/d2inlinelogfull:_PyEval_EvalFrameDefault" according to the blog.

    I posted logs. As for PR25244, the logsize is 3x smaller than the previous and pgo rejects the 4 funcs above. I will look into it later.

    Collecting:

    Before the PR, it took 10x~ longer to link than without __forceinline function.

    Current build is 10x~ shorter than before to link.
    Before the PR, __forceinline had no impact to me.

    @animalize
    Copy link
    Mannequin

    animalize mannequin commented Sep 11, 2021

    MSVC 2019 has a /Ob3 option:
    https://docs.microsoft.com/en-us/cpp/build/reference/ob-inline-function-expansion

    From the experience of another project, I conjecture /Ob3 increase the "global budget" mentioned in the blog.
    I used /Ob3 for the 3.10 branch, and there seems to be no significant performance change. If you have time, neonene welcome to verify this.

    @neonene
    Copy link
    Mannequin Author

    neonene mannequin commented Sep 11, 2021

    According to:
    https://docs.microsoft.com/en-us/cpp/build/profile-guided-optimizations?view=msvc-160

    PGO seems to override /Ob3.
    Around this issue, I posted benchmark on bpo-44381.
    On python building, /Ob3 works for only non-pgo-supported dlls like,

    _ctypes_test
    _freeze_importlib
    _msi
    _testbuffer
    _testcapi
    _testconsole
    _testembed
    _testimportmultiple
    _testinternalcapi
    _testmultiphase
    _uuid
    liblzma
    pylauncher
    pyshellext
    pywlauncher
    sqlite3
    venvlauncher
    venvwlauncher
    winsound

    I use this option in _msvccompiler.py for my pyd.
    I will try and report when PGO with /Ob3 makes difference in the log.

    @vstinner vstinner changed the title Performance regression 3.10b1 and later on Windows Performance regression 3.10b1 and later on Windows: Py_DECREF() not inlined in PGO build Sep 13, 2021
    @vstinner vstinner changed the title Performance regression 3.10b1 and later on Windows Performance regression 3.10b1 and later on Windows: Py_DECREF() not inlined in PGO build Sep 13, 2021
    @neonene
    Copy link
    Mannequin Author

    neonene mannequin commented Sep 13, 2021

    With msvc 16.10.3 and 16.11.2 (latest),
    PR25244 told me the amount of code in _PyEval_EvalFrameDefault() is over the limit of PGO.
    In the old version of _PyEval_EvalFrameDefault (b98eba5), the same issue can be caused adding any-code anywhere with more than 20 expressions/statements. For example, at the top/middle/end of the function, repeating "if (0) {}" 10times, or "if (0) {19 statements}". As for python3.9.7, more than 800 expressions/statements.

    Here is just a workaround for 3.10rc2 on windows.
    ==================================================

    --- Python/ceval.c
    +++ Python/ceval.c
    @@ -1306,9 +1306 @@
    -#define DISPATCH() \
    -    { \
    -        if (trace_info.cframe.use_tracing OR_DTRACE_LINE OR_LLTRACE) { \
    -            goto tracing_dispatch; \
    -        } \
    -        f->f_lasti = INSTR_OFFSET(); \
    -        NEXTOPARG(); \
    -        DISPATCH_GOTO(); \
    -    }
    +#define DISPATCH() goto tracing_dispatch
    @@ -1782,4 +1774,9 @@
         tracing_dispatch:
         {
    +        if (!(trace_info.cframe.use_tracing OR_DTRACE_LINE OR_LLTRACE)) {
    +            f->f_lasti = INSTR_OFFSET();
    +            NEXTOPARG();
    +            DISPATCH_GOTO();
    +        }
             int instr_prev = f->f_lasti;
             f->f_lasti = INSTR_OFFSET();

    ==================================================

    This patch becomes ineffective just adding one expression to DISPATCH macro as below

       #define DISPATCH() {if (1) goto tracing_dispatch;}

    And this approach is not sufficient for 3.11 with bigger eval-func.
    I don't know a cl/link option to lift such restriction of function size.

    3.10rc2 x86 pgo : 1.00
    patched : 1.09x faster (slower 5, faster 48, not significant 5)

    3.10rc2 x64 pgo : 1.00 (roughly the same speed as official bin)
    patched : 1.07x faster (slower 5, faster 47, not significant 6)
    patched(/Ob3) : 1.07x faster (slower 7, faster 45, not significant 6)

    x64 results are posted.

    Fixing inlining rejection also made __forceinline buildable with normal processing time and memory usage.

    @neonene
    Copy link
    Mannequin Author

    neonene mannequin commented Sep 16, 2021

    I reported this issue to developercommunity of microsoft.

    https://developercommunity.visualstudio.com/t/1531987

    @rhettinger
    Copy link
    Contributor

    These should be changed back to macros where inlining is guaranteed.

    @zooba
    Copy link
    Member

    zooba commented Sep 16, 2021

    I agree with Raymond. Let's stop throwing more code at this until we've figured out what's going on and revert the change for now.

    @zooba
    Copy link
    Member

    zooba commented Apr 8, 2022

    __assume(0) should be replaced with other function, inside the eval switch-case or in the inlined paths of callees. This is critical with PGO.

    Out of interest, have you done other experiments confirming this? The reference linked is talking about compiler throughput (i.e. how long it takes to compile), and while it hints that using __assume(0) may interfere with other optimisations, that isn't supported with any detail or analysis in the post.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    @neonene
    Copy link
    Contributor

    neonene commented Apr 10, 2022

    have you done other experiments confirming this?

    My benchmark results are left in faster-cpython/ideas#321 (comment).

    __assume(0) is problematic only where the substitute function is inlined.


    Correction of my previous post:

    MSVC's stuck can be avoided in many ways, ...
    unless tp_dealloc does not create a inlined callsite

    -unless
    +if

    @gvanrossum
    Copy link
    Member

    gvanrossum commented Apr 11, 2022

    For Py_UNREACHABLE() then maybe we should just remove these two lines from, pymacro.h?

    #elif defined(_MSC_VER)
    #  define Py_UNREACHABLE() __assume(0)
    

    Then the code will fall back to

    #else
    #  define Py_UNREACHABLE() \
        Py_FatalError("Unreachable C code path reached")
    #endif
    

    @gvanrossum
    Copy link
    Member

    __assume(0) is problematic only where the substitute function is inlined.

    Can you elaborate? What is the "substitute function"? The macro definition is

    #  define Py_UNREACHABLE() __assume(0)
    

    so there is no inlined function. Are you referring to the code containing the call to Py_UNREACHABLE()? That wouldn't affect the ceval.c main loop in _PyEval_EvalFrameDefault because that function is definitely to large to be inlined. :-)

    What am I missing?

    @neonene
    Copy link
    Contributor

    neonene commented Apr 13, 2022

    Sorry for the lack of explanation.

    I encountered a measurable slowdown several months ago when Py_RETURN_RICHCOMPARE macro is inlined in the eval-loop.
    However, that may be x86 only.

    If I understand correctly, x86 official binaries are non-PGO builds. Then, a Py_FatalError() only for TARGET(CACHE) branch would be enough for now.


    When I change the current version as below:

    1. Substitute void Py_UNREACHABLE(void) {} or Py_FatalError() for __assume(0) in pymacro.h

    2. Make PyObject_RichCompare() called through a function pointer, adding
      this above _PyEval_EvalFrameDefault().

    static const richcmpfunc PyObject_RichCompare_PTR = PyObject_RichCompare;
    #define PyObject_RichCompare PyObject_RichCompare_PTR
    

    Then, PGO decides to inline PyObject_RichCompare(), based on its profile.
    This seems to affect "Function Layout optimization" even if it is not inlined. (under verification)

        PyObject_RichCompare (pgu decision)
          _PyThreadState_GET (pgu decision)
            _PyRuntimeState_GetThreadState (pgu decision)
          _PyErr_Occurred (pgu decision)
          -_PyErr_BadInternalCall (pgo hard reject)
          _Py_EnterRecursiveCall (pgu decision)
            _Py_MakeRecCheck (pgu decision)
            -_Py_CheckRecursiveCall (pgo hard reject)
          do_richcompare (pgu decision)
            PyType_IsSubtype (pgu decision)
              -type_is_subtype_base_chain (pgo hard reject)
            Py_DECREF (pgu decision)
              _Py_Dealloc (pgu decision)
            long_richcompare (pgu decision)
              ...
    >>>>>     Py_UNREACHABLE (pgu decision)  // or _Py_FatalErrorFunc (pgo hard reject)
    

    As for other place (_Py_FatalErrorFunc() never gets inlined anywhere):

        dict_get (pgu decision)
          -_PyArg_CheckPositional (pgo hard reject)
          dict_get_impl (pgu decision)
            unicode_get_hash (pgu decision)
            PyObject_Hash (pgu decision)
              _Py_HashPointer (pgu decision)
                _Py_HashPointerRaw (pgu decision)
              -PyType_Ready (pgo hard reject)
              -PyObject_HashNotImplemented (pgo hard reject)
            _Py_dict_lookup (pgu decision)
              -unicodekeys_lookup_unicode (pgo hard reject)
              -unicodekeys_lookup_generic (pgo hard reject)
              dictkeys_generic_lookup (pgu decision)
                dictkeys_get_index (pgu decision)
                Py_INCREF (pgu decision)
                -PyObject_RichCompareBool (pgo hard reject)
                Py_DECREF (pgu decision)
                  _Py_Dealloc (pgu decision)
    >>>>>       -Py_UNREACHABLE (pgo hard reject)  // (no harm)
    
        Py_DECREF (force inline)
          -_Py_Dealloc (initial scan: soft depth exceeded)
        -_Py_Specialize_BinaryOp (pgo hard reject)
    >>>  Py_UNREACHABLE (pgu decision)            // @TARGET(CACHE) inlined
        _PyFrame_SetStackPointer (pgu decision)
        -trace_function_entry (pgo hard reject)
        _PyFrame_GetStackPointer (pgu decision)
        PyDTrace_FUNCTION_ENTRY_ENABLED (pgu decision)
        -dtrace_function_entry (pgo hard reject)
        PyDTrace_LINE_ENABLED (pgu decision)
        -maybe_dtrace_line (pgo hard reject)
        _PyFrame_SetStackPointer (pgu decision)
        -maybe_call_line_trace (pgo hard reject)
        _PyFrame_GetStackPointer (pgu decision)
        -_PyInterpreterFrame_GetLine (pgo hard reject)
        -fprintf (initial scan: parameter mismatch, varargs, not eligible)
        -_PyErr_SetString (pgo hard reject)
    >>> -Py_UNREACHABLE (pgo hard reject)         // out of switch (no harm)
    

    Benchmark after removal of TARGET(CACHE) branch:

    Py_UNREACHABLE at
    long_richcompare()
    x64 PGO x86 PGO
    __assume(0) 1.00 1.00
    Py_FatalError 1.02x slower 1.03x~ faster
    void foo(void) {} 1.02x slower 1.04x~ faster

    __assume(0) works well in the hot section on x64.

    EDIT: The gap on x86 can be increased depending on the amount of optimization.

    >pyperf compare_to  assume64  fatalerr64
    
    Slower (25):
    - dulwich_log: 195 ms +- 3 ms -> 231 ms +- 50 ms: 1.19x slower
    - xml_etree_process: 98.3 ms +- 1.7 ms -> 109 ms +- 1 ms: 1.11x slower
    - pickle_pure_python: 540 us +- 7 us -> 592 us +- 41 us: 1.10x slower
    - logging_silent: 163 ns +- 1 ns -> 177 ns +- 3 ns: 1.09x slower
    - xml_etree_generate: 138 ms +- 4 ms -> 148 ms +- 3 ms: 1.07x slower
    - unpickle_pure_python: 348 us +- 9 us -> 371 us +- 5 us: 1.07x slower
    - meteor_contest: 152 ms +- 1 ms -> 162 ms +- 2 ms: 1.06x slower
    - hexiom: 10.8 ms +- 0.1 ms -> 11.4 ms +- 0.2 ms: 1.06x slower
    - deltablue: 6.78 ms +- 0.23 ms -> 7.14 ms +- 0.09 ms: 1.05x slower
    - scimark_fft: 491 ms +- 13 ms -> 517 ms +- 12 ms: 1.05x slower
    - fannkuch: 627 ms +- 14 ms -> 660 ms +- 20 ms: 1.05x slower
    - crypto_pyaes: 127 ms +- 3 ms -> 134 ms +- 4 ms: 1.05x slower
    - sqlalchemy_imperative: 44.8 ms +- 2.3 ms -> 47.1 ms +- 5.3 ms: 1.05x slower
    - nqueens: 164 ms +- 1 ms -> 172 ms +- 3 ms: 1.05x slower
    - float: 138 ms +- 2 ms -> 144 ms +- 3 ms: 1.05x slower
    - logging_format: 20.4 us +- 0.4 us -> 21.3 us +- 0.9 us: 1.04x slower
    - nbody: 179 ms +- 6 ms -> 185 ms +- 7 ms: 1.04x slower
    - django_template: 85.4 ms +- 1.1 ms -> 88.6 ms +- 2.3 ms: 1.04x slower
    - go: 240 ms +- 4 ms -> 249 ms +- 7 ms: 1.03x slower
    - regex_compile: 231 ms +- 2 ms -> 238 ms +- 3 ms: 1.03x slower
    - json_dumps: 21.1 ms +- 0.2 ms -> 21.6 ms +- 0.2 ms: 1.02x slower
    - logging_simple: 18.9 us +- 0.3 us -> 19.3 us +- 0.4 us: 1.02x slower
    - sqlalchemy_declarative: 277 ms +- 8 ms -> 283 ms +- 8 ms: 1.02x slower
    - scimark_monte_carlo: 104 ms +- 2 ms -> 106 ms +- 2 ms: 1.02x slower
    - pathlib: 183 ms +- 12 ms -> 187 ms +- 7 ms: 1.02x slower
    
    Faster (6):
    - pickle_dict: 44.7 us +- 0.5 us -> 41.9 us +- 0.9 us: 1.07x faster
    - scimark_sor: 215 ms +- 4 ms -> 203 ms +- 3 ms: 1.06x faster
    - telco: 11.3 ms +- 0.1 ms -> 10.8 ms +- 0.1 ms: 1.04x faster
    - unpickle: 23.0 us +- 0.3 us -> 22.2 us +- 0.2 us: 1.04x faster
    - pickle_list: 6.75 us +- 0.17 us -> 6.60 us +- 0.19 us: 1.02x faster
    - python_startup: 13.8 ms +- 1.2 ms -> 13.5 ms +- 0.1 ms: 1.02x faster
    
    Benchmark hidden because not significant (27): 2to3, chameleon, chaos, json_loads, mako, pickle, pid
    igits, pyflate, python_startup_no_site, raytrace, regex_dna, regex_effbot, regex_v8, richards, scima
    rk_lu, scimark_sparse_mat_mult, spectral_norm, sqlite_synth, sympy_expand, sympy_integrate, sympy_su
    m, sympy_str, tornado_http, unpack_sequence, unpickle_list, xml_etree_parse, xml_etree_iterparse
    
    Geometric mean: 1.02x slower
    
    >pyperf compare_to  assume64  emptyfunc64
    
    Slower (23):
    - nbody: 179 ms +- 6 ms -> 191 ms +- 4 ms: 1.07x slower
    - xml_etree_process: 98.3 ms +- 1.7 ms -> 105 ms +- 1 ms: 1.07x slower
    - meteor_contest: 152 ms +- 1 ms -> 163 ms +- 7 ms: 1.07x slower
    - scimark_fft: 491 ms +- 13 ms -> 523 ms +- 11 ms: 1.07x slower
    - xml_etree_generate: 138 ms +- 4 ms -> 146 ms +- 3 ms: 1.06x slower
    - fannkuch: 627 ms +- 14 ms -> 662 ms +- 15 ms: 1.06x slower
    - spectral_norm: 179 ms +- 2 ms -> 189 ms +- 3 ms: 1.05x slower
    - unpickle_pure_python: 348 us +- 9 us -> 366 us +- 5 us: 1.05x slower
    - telco: 11.3 ms +- 0.1 ms -> 11.8 ms +- 0.2 ms: 1.05x slower
    - scimark_lu: 153 ms +- 3 ms -> 160 ms +- 2 ms: 1.04x slower
    - django_template: 85.4 ms +- 1.1 ms -> 89.0 ms +- 1.2 ms: 1.04x slower
    - python_startup: 13.8 ms +- 1.2 ms -> 14.3 ms +- 1.8 ms: 1.04x slower
    - nqueens: 164 ms +- 1 ms -> 170 ms +- 4 ms: 1.04x slower
    - json_dumps: 21.1 ms +- 0.2 ms -> 21.9 ms +- 0.2 ms: 1.04x slower
    - deltablue: 6.78 ms +- 0.23 ms -> 7.01 ms +- 0.10 ms: 1.03x slower
    - crypto_pyaes: 127 ms +- 3 ms -> 131 ms +- 3 ms: 1.03x slower
    - unpack_sequence: 69.1 ns +- 3.0 ns -> 71.3 ns +- 6.4 ns: 1.03x slower
    - hexiom: 10.8 ms +- 0.1 ms -> 11.1 ms +- 0.2 ms: 1.03x slower
    - sympy_sum: 299 ms +- 4 ms -> 308 ms +- 6 ms: 1.03x slower
    - chameleon: 14.7 ms +- 0.2 ms -> 15.1 ms +- 0.8 ms: 1.03x slower
    - sqlalchemy_declarative: 277 ms +- 8 ms -> 284 ms +- 12 ms: 1.02x slower
    - raytrace: 543 ms +- 6 ms -> 556 ms +- 8 ms: 1.02x slower
    - scimark_sparse_mat_mult: 6.86 ms +- 0.35 ms -> 7.00 ms +- 0.30 ms: 1.02x slower
    
    Faster (4):
    - scimark_sor: 215 ms +- 4 ms -> 202 ms +- 2 ms: 1.06x faster
    - sqlite_synth: 5.44 us +- 0.13 us -> 5.21 us +- 0.02 us: 1.05x faster
    - pickle_list: 6.75 us +- 0.17 us -> 6.53 us +- 0.05 us: 1.03x faster
    - pickle_dict: 44.7 us +- 0.5 us -> 43.7 us +- 1.5 us: 1.02x faster
    
    Benchmark hidden because not significant (31): 2to3, chaos, dulwich_log, float, go, json_loads, logg
    ing_format, logging_silent, logging_simple, mako, pathlib, pickle, pickle_pure_python, pidigits, pyf
    late, python_startup_no_site, regex_compile, regex_dna, regex_effbot, regex_v8, richards, scimark_mo
    nte_carlo, sqlalchemy_imperative, sympy_expand, sympy_integrate, sympy_str, tornado_http, unpickle,
    unpickle_list, xml_etree_parse, xml_etree_iterparse
    
    Geometric mean: 1.02x slower
    
    >pyperf compare_to  assume86  fatalerr86
    
    Slower (6):
    - sqlalchemy_imperative: 40.2 ms +- 2.2 ms -> 42.6 ms +- 2.9 ms: 1.06x slower
    - unpickle_list: 8.10 us +- 0.10 us -> 8.47 us +- 0.18 us: 1.05x slower
    - pickle: 21.1 us +- 0.6 us -> 21.8 us +- 0.2 us: 1.04x slower
    - scimark_monte_carlo: 148 ms +- 5 ms -> 153 ms +- 5 ms: 1.03x slower
    - sqlalchemy_declarative: 261 ms +- 5 ms -> 269 ms +- 14 ms: 1.03x slower
    - sqlite_synth: 5.57 us +- 0.03 us -> 5.71 us +- 0.03 us: 1.03x slower
    
    Faster (28):
    - unpack_sequence: 126 ns +- 11 ns -> 95.3 ns +- 5.7 ns: 1.32x faster
    - logging_silent: 225 ns +- 5 ns -> 191 ns +- 10 ns: 1.18x faster
    - spectral_norm: 279 ms +- 6 ms -> 244 ms +- 4 ms: 1.14x faster
    - hexiom: 14.3 ms +- 0.1 ms -> 12.6 ms +- 0.1 ms: 1.14x faster
    - scimark_lu: 230 ms +- 6 ms -> 205 ms +- 5 ms: 1.12x faster
    - richards: 101 ms +- 1 ms -> 90.4 ms +- 4.0 ms: 1.12x faster
    - scimark_sparse_mat_mult: 10.7 ms +- 0.3 ms -> 9.64 ms +- 0.27 ms: 1.11x faster
    - deltablue: 8.37 ms +- 0.10 ms -> 7.65 ms +- 0.31 ms: 1.09x faster
    - xml_etree_process: 129 ms +- 1 ms -> 120 ms +- 2 ms: 1.07x faster
    - scimark_sor: 275 ms +- 6 ms -> 257 ms +- 11 ms: 1.07x faster
    - float: 175 ms +- 2 ms -> 164 ms +- 2 ms: 1.07x faster
    - go: 308 ms +- 7 ms -> 288 ms +- 9 ms: 1.07x faster
    - xml_etree_generate: 176 ms +- 2 ms -> 165 ms +- 4 ms: 1.06x faster
    - pickle_pure_python: 652 us +- 11 us -> 621 us +- 16 us: 1.05x faster
    - raytrace: 695 ms +- 5 ms -> 663 ms +- 6 ms: 1.05x faster
    - pathlib: 189 ms +- 6 ms -> 180 ms +- 5 ms: 1.04x faster
    - pyflate: 963 ms +- 21 ms -> 923 ms +- 27 ms: 1.04x faster
    - nbody: 247 ms +- 8 ms -> 237 ms +- 7 ms: 1.04x faster
    - xml_etree_parse: 272 ms +- 4 ms -> 264 ms +- 7 ms: 1.03x faster
    - regex_compile: 272 ms +- 3 ms -> 263 ms +- 2 ms: 1.03x faster
    - mako: 22.0 ms +- 0.3 ms -> 21.3 ms +- 0.1 ms: 1.03x faster
    - chameleon: 19.9 ms +- 0.2 ms -> 19.4 ms +- 0.4 ms: 1.03x faster
    - chaos: 174 ms +- 2 ms -> 170 ms +- 2 ms: 1.03x faster
    - xml_etree_iterparse: 175 ms +- 3 ms -> 170 ms +- 4 ms: 1.03x faster
    - nqueens: 203 ms +- 1 ms -> 198 ms +- 2 ms: 1.03x faster
    - unpickle_pure_python: 456 us +- 6 us -> 445 us +- 16 us: 1.02x faster
    - unpickle: 25.9 us +- 0.4 us -> 25.3 us +- 0.4 us: 1.02x faster
    - json_loads: 47.1 us +- 0.4 us -> 46.2 us +- 0.5 us: 1.02x faster
    
    Benchmark hidden because not significant (24): 2to3, crypto_pyaes, django_template, dulwich_log, fan
    nkuch, json_dumps, logging_format, logging_simple, meteor_contest, pickle_dict, pickle_list, pidigit
    s, python_startup, python_startup_no_site, regex_dna, regex_effbot, regex_v8, scimark_fft, sympy_exp
    and, sympy_integrate, sympy_sum, sympy_str, telco, tornado_http
    
    Geometric mean: 1.03x faster
    
    >pyperf compare_to  assume86  emptyfunc86
    
    Slower (3):
    - pickle: 21.1 us +- 0.6 us -> 22.2 us +- 0.5 us: 1.05x slower
    - python_startup: 13.2 ms +- 0.2 ms -> 13.8 ms +- 2.5 ms: 1.04x slower
    - regex_dna: 243 ms +- 1 ms -> 250 ms +- 6 ms: 1.03x slower
    
    Faster (38):
    - unpack_sequence: 126 ns +- 11 ns -> 94.1 ns +- 5.3 ns: 1.34x faster
    - hexiom: 14.3 ms +- 0.1 ms -> 12.5 ms +- 0.2 ms: 1.15x faster
    - deltablue: 8.37 ms +- 0.10 ms -> 7.43 ms +- 0.08 ms: 1.13x faster
    - spectral_norm: 279 ms +- 6 ms -> 249 ms +- 2 ms: 1.12x faster
    - logging_silent: 225 ns +- 5 ns -> 202 ns +- 7 ns: 1.11x faster
    - xml_etree_process: 129 ms +- 1 ms -> 118 ms +- 1 ms: 1.09x faster
    - unpickle_pure_python: 456 us +- 6 us -> 418 us +- 12 us: 1.09x faster
    - nqueens: 203 ms +- 1 ms -> 187 ms +- 2 ms: 1.09x faster
    - scimark_sparse_mat_mult: 10.7 ms +- 0.3 ms -> 9.85 ms +- 0.33 ms: 1.09x faster
    - scimark_lu: 230 ms +- 6 ms -> 213 ms +- 4 ms: 1.08x faster
    - xml_etree_generate: 176 ms +- 2 ms -> 163 ms +- 3 ms: 1.08x faster
    - richards: 101 ms +- 1 ms -> 93.7 ms +- 3.9 ms: 1.08x faster
    - float: 175 ms +- 2 ms -> 163 ms +- 3 ms: 1.08x faster
    - scimark_sor: 275 ms +- 6 ms -> 257 ms +- 6 ms: 1.07x faster
    - unpickle_list: 8.10 us +- 0.10 us -> 7.60 us +- 0.23 us: 1.07x faster
    - scimark_monte_carlo: 148 ms +- 5 ms -> 139 ms +- 5 ms: 1.06x faster
    - go: 308 ms +- 7 ms -> 290 ms +- 8 ms: 1.06x faster
    - raytrace: 695 ms +- 5 ms -> 661 ms +- 7 ms: 1.05x faster
    - crypto_pyaes: 175 ms +- 1 ms -> 166 ms +- 2 ms: 1.05x faster
    - nbody: 247 ms +- 8 ms -> 235 ms +- 8 ms: 1.05x faster
    - pyflate: 963 ms +- 21 ms -> 917 ms +- 15 ms: 1.05x faster
    - xml_etree_iterparse: 175 ms +- 3 ms -> 167 ms +- 6 ms: 1.04x faster
    - meteor_contest: 178 ms +- 2 ms -> 171 ms +- 5 ms: 1.04x faster
    - sympy_sum: 298 ms +- 11 ms -> 287 ms +- 2 ms: 1.04x faster
    - pickle_pure_python: 652 us +- 11 us -> 628 us +- 16 us: 1.04x faster
    - scimark_fft: 724 ms +- 10 ms -> 697 ms +- 15 ms: 1.04x faster
    - regex_compile: 272 ms +- 3 ms -> 263 ms +- 7 ms: 1.03x faster
    - unpickle: 25.9 us +- 0.4 us -> 25.1 us +- 0.3 us: 1.03x faster
    - chameleon: 19.9 ms +- 0.2 ms -> 19.3 ms +- 0.4 ms: 1.03x faster
    - sympy_expand: 912 ms +- 21 ms -> 885 ms +- 7 ms: 1.03x faster
    - mako: 22.0 ms +- 0.3 ms -> 21.4 ms +- 0.2 ms: 1.03x faster
    - chaos: 174 ms +- 2 ms -> 169 ms +- 2 ms: 1.03x faster
    - django_template: 93.7 ms +- 1.6 ms -> 91.1 ms +- 1.4 ms: 1.03x faster
    - fannkuch: 907 ms +- 19 ms -> 885 ms +- 12 ms: 1.02x faster
    - sympy_str: 542 ms +- 4 ms -> 529 ms +- 10 ms: 1.02x faster
    - xml_etree_parse: 272 ms +- 4 ms -> 266 ms +- 9 ms: 1.02x faster
    - json_dumps: 23.8 ms +- 0.2 ms -> 23.3 ms +- 0.3 ms: 1.02x faster
    - regex_v8: 38.7 ms +- 0.4 ms -> 38.0 ms +- 0.4 ms: 1.02x faster
    
    Benchmark hidden because not significant (17): 2to3, dulwich_log, json_loads, logging_format, loggin
    g_simple, pathlib, pickle_dict, pickle_list, pidigits, python_startup_no_site, regex_effbot, sqlalch
    emy_declarative, sqlalchemy_imperative, sqlite_synth, sympy_integrate, telco, tornado_http
    
    Geometric mean: 1.04x faster
    

    @neonene
    Copy link
    Contributor

    neonene commented Apr 13, 2022

    Are you referring to the code containing the call to Py_UNREACHABLE()? That wouldn't affect the ceval.c main loop in _PyEval_EvalFrameDefault because that function is definitely to large to be inlined. :-)

    Here is MSVC's inlining decision on current Python3.10:
    https://bugs.python.org/file50291/PR28475_inline.log

    Weird, but faster than when only tiny functions are inlined.
    In the log, Py_DECREF(static) is expanded until Py_Dealloc(extern) stops its recursion. That looks to me too expansive.

    @gvanrossum
    Copy link
    Member

    gvanrossum commented Apr 13, 2022

    It looks like we may be looking at different builds? I'm only looking at x64 builds for 3.11.

    If I understand correctly, x86 official binaries are non-PGO builds.

    @zooba Is that so?

    @zooba
    Copy link
    Member

    zooba commented Apr 13, 2022

    If I understand correctly, x86 official binaries are non-PGO builds.

    Yeah, this is correct. We're more likely to deprecate and drop the 32-bit binaries before we make any major effort to optimise them - they run under an emulation layer in the OS (practically all supported OS installs are 64-bit native), so aren't really going to be recommended for people who care about performance anyway.

    gvanrossum added a commit to gvanrossum/cpython that referenced this issue Apr 19, 2022
    gvanrossum added a commit that referenced this issue Apr 22, 2022
    Macros Py_DECREF, Py_XDECREF, Py_IS_TYPE, _Py_atomic_load_32bit_impl
    and _Py_DECREF_SPECIALIZED are redefined as macros
    that completely replace the inline functions of the same name.
    These three came out in the top four of functions that (in MSVC)
    somehow weren't inlined.
    
    Co-authored-by: Victor Stinner <vstinner@python.org>
    Co-authored-by: Dennis Sweeney <36520290+sweeneyde@users.noreply.github.com>
    @neonene
    Copy link
    Contributor

    neonene commented Apr 23, 2022

    I think this issue can be closed. (I can't after migration)

    Most of my experiences are invalid after Guido's #91718 corrected the quirks of MSVC.
    Another reasonable fix would be a good test which makes specialized sections hotter.

    Thanks.

    @Fidget-Spinner
    Copy link
    Member

    Closing as requested by OP. Thanks for your investigations @neonene ! Thanks to Guido too for the fix.

    @gvanrossum
    Copy link
    Member

    Thank you @neonene for your gentle pushes and encouragement and help to get this fixed!

    @vstinner
    Copy link
    Member

    @neonene:

    Most of my experiences are invalid after Guido's #91718 corrected the quirks of MSVC.

    Do you mean that this merged change 2f233fc is now useless?

    @gvanrossum
    Copy link
    Member

    No they are complementary.

    @neonene
    Copy link
    Contributor

    neonene commented Apr 25, 2022

    Do you mean that this merged change 2f233fc is now useless?

    No. What I said is about the optimization, not the (force) inlining. And what I suggested before have been already fixed by f8dc618 (and 2f233fc):

    • tp_* or cfunc pointer in the eval-loop can inline multiple callees without conflict.

    • Moving LOAD_FAST out of switch according to the scores below has no advantage now.

    TOP3 entries with current 44 tests
    case 124  132522464  // LOAD_FAST
    case 100   48956231  // LOAD_CONST
    case  45   48318813  // LOAD_FAST__LOAD_FAST
    

    @vstinner
    Copy link
    Member

    What I understand is that PGO build of Python 3.11 on Windows will be faster thanks to these changes, and the Windows python.org binaries only use PGO for 64-bit, not for 32-bit.

    @neonene
    Copy link
    Contributor

    neonene commented Apr 25, 2022

    You can read a bit more posts and links because you have changed this thread's title several times.

    @vstinner
    Copy link
    Member

    Can someone please try to write a summary of this long and complex issue? It seems like different but related topics have been discussed and it's hard to get an overview. I'm confused between sometimes someone said that a change fixed the fix and then wrote that no, it's not really fixing the issue.

    @gvanrossum
    Copy link
    Member

    Let me give it a quick try.

    That's it.

    @vstinner
    Copy link
    Member

    Thanks for the summary. I would add that marking performance critical function with __forceinline (Py_ALWAYS_INLINE) was tested, but it didn't work.

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.10 only security fixes 3.11 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) OS-windows performance Performance or resource usage topic-C-API
    Projects
    None yet
    Development

    No branches or pull requests

    10 participants