Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-117578: Fix inlining regression in PyType_GetModuleByDef() #123100

Closed
wants to merge 4 commits into from
Closed

gh-117578: Fix inlining regression in PyType_GetModuleByDef() #123100

wants to merge 4 commits into from

Conversation

neonene
Copy link
Contributor

@neonene neonene commented Aug 17, 2024

On main and 3.13, there are cases where the get_module_by_def function in typeobject.c is not inlined in its wrapper functions:

Wrappers Windows callee: get_module_by_def()
PyType_GetModuleByDef() Release /Ob2:called /Ob3:inlined
PGO inlined
_PyType_GetModuleByDef2() Release /Ob2:called /Ob3:inlined
PGO called

Non-builtin modules can have extra function-call overheads, where the wrappers cannot be inlined.

This PR specifies Py_ALWAYS_INLINE to the callee.

cc @encukou

@neonene neonene requested a review from markshannon as a code owner August 17, 2024 11:40
@neonene neonene changed the title gh-117578: Fix inlining regression in PyType_GetModuleByDef() family gh-117578: Fix inlining regression in PyType_GetModuleByDef() Aug 17, 2024
@encukou
Copy link
Member

encukou commented Aug 19, 2024

Hm, it doesn't sound right to override profile-guided optimization, especially since test_decimal (the only current caller of _PyType_GetModuleByDef2) is in the PGO test set.
Does this PR have a significant performance impact?

@neonene
Copy link
Contributor Author

neonene commented Aug 20, 2024

It is mentioned on the faster-cpython repo that the telco test has slowed down a lot.

According to MSVC, the module state access counts were:

Function / breakdown     entry-cnt  alternative access
-----------------------  ---------  ---------------------------
PyType_GetModuleByDef()    6852643
    convert_op             2971848  via context object
    PyDecType_New          1651188  via context object (partial)
    dec_addstatus          1651188  via context object (partial)
    current_context        1486221  via context object (partial)
    dec_mpd_qquantize       247731  METH_METHOD
    ctx_mpd_qquantize       165000  METH_METHOD
    ...

_PyType_GetModuleByDef2()  1073193
    nm_mpd_qadd             660462
    nm_mpd_qmul             412731

Tested with the /Ob3 option, switching the inlining specifier: f740a5d. My Release/PGO builds on Windows get slower using TLS version of PyThreadState_Get(), which is also observed at #103324 (comment). If *nix OSes are in good health with TLS, I guess I also need to run the telco with a good condition (without TLS):

sub
GetModuleByDef call inline inline
GetModuleByDef2 call call inline
normal 3.14 perf (the higher, the faster)
1.00x (base) 1.01x
1.03x 1.05x 1.08x respect alternative
less TLS overhead perf (experiment)
1.05x 1.05x 1.06x
1.08x 1.08x 1.08x respect alternative
module state access normal TLS-less
current (base) 1.05x
This PR 1.01x 1.06x
PyThreadState_Get() 1.05x 1.04x example patches
PR with alternatives 1.08x 1.08x
global state 1.08x 1.11x
static type (GC unused) 1.14x 1.14x taken from 3.12

This patch would need to be applied if we wanted as much speed as the global state access on Windows, which has little effect alone (1%) for some reason.

@neonene
Copy link
Contributor Author

neonene commented Aug 20, 2024

Windows PGO:
- telco: 9.21 ms +- 0.12 ms -> 9.02 ms +- 0.09 ms: 1.02x faster

@neonene
Copy link
Contributor Author

neonene commented Aug 21, 2024

Is it acceptable that test_decimal.py has a test case like below, instead of touching the C code?

@requires_cdecimal
class CArithmeticOperatorsTest(ArithmeticOperatorsTest, unittest.TestCase):
    ...
    @unittest.skipIf(not test.support.PGO, 'PGO training only')
    def test_excecise_binop(self):
        Decimal = self.decimal.Decimal
        d = Decimal('11.1')
        for i in range(500000):
            1 + d   # at least 300000 times

@neonene
Copy link
Contributor Author

neonene commented Aug 21, 2024

I'll try PyType_GetBaseByToken() version.

@neonene
Copy link
Contributor Author

neonene commented Aug 26, 2024

Closing in favor of proposing the PyType_GetBaseByToken() version, which can supersede _PyType_GetModuleByDef2() on PGO and Relase(/Ob3) builds.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants