Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PEP 669: Low Impact Instrumentation and Monitoring for CPython. #2070

Closed

Conversation

markshannon
Copy link
Member

A PEP for low impact debugging, profiling, coverage, etc. using quickening.

Copy link
Member

@brettcannon brettcannon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update .github/CODEOWNERS as part of this PR.

Copy link
Member

@brandtbucher brandtbucher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool! Found a few possible typos that you might want to address before merging:


Using a profiler or debugger in CPython can have a severe impact on
performance. Slowdowns by an order of magnitude are not uncommon.
It does not have this bad.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
It does not have this bad.
It does not have to be this bad.

expect to be active for the entire lifetime of the program.

Monitoring can occur at any point in the program's life and be applied
anywhere in the program. Monitoring points are expected to few.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
anywhere in the program. Monitoring points are expected to few.
anywhere in the program. Monitoring points are expected to be few.

onto the control flow graph of the code object.
``BRANCH``, ``JUMPBACK``, ``START`` and ``RESUME`` events will inform which
basic blocks have started to execute.
The ``RAISE`` event with mark any blocks that did not complete.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The ``RAISE`` event with mark any blocks that did not complete.
The ``RAISE`` event will mark any blocks that did not complete.

Comment on lines +271 to +276
* ENTER
* EXIT
* UNWIND
* C_CALL
* C_RETURN
* RAISE
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* ENTER
* EXIT
* UNWIND
* C_CALL
* C_RETURN
* RAISE
* ``ENTER``
* ``EXIT``
* ``UNWIND``
* ``C_CALL``
* ``C_RETURN``
* ``RAISE``

Copy link
Member

@gvanrossum gvanrossum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the very chatty review. I read the doc in a single pass and remarked upon anything that wasn't clear to me from what came before.

Comment on lines +22 to +23
Using the new API, code run under a debugger on 3.11 should easily outperform
code run without a debugger on 3.10.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Presumably only when the debugger is not actually invoked? I can believe it will be faster for code in a frame containing a breakpoint that is never hit. But I cannot believe it would be faster if some Python code would be run during e.g. line tracing or when evaluating a condition each time a breakpoint is hit.

Comment on lines +27 to +35
Motivation
==========

Developers should not have to pay an unreasonable cost to use debuggers,
profilers and other similar tools.

C++ and Java developers expect to be able to run a program at full speed
(or very close to it) under a debugger.
Python developers should expect that too.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section is just rhetoric. There's already a bunch of that in the Abstract ("It does not have this bad", "at low cost", "easily outperform") -- I'd like to hear more about what approach this is actually taking sooner.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is the "motivation" section; it should be motivating 🙂
I've toned down the abstract to compensate.

Comment on lines +40 to +45
The quickening mechanism provided by PEP 659 provides a way to dynamically
modify executing Python bytecode. These modifications have no cost beyond
the parts of the code that are modified and a relatively low cost to those
parts that are modified. We can leverage this to provide an efficient
mechanism for instrumentation and monitoring that was not possible in 3.10
or earlier.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is something that you could lead with.

Instrumentation occurs early in a program's life cycle and persists through
the lifetime of the program. It is expected to be pervasive, but fixed.
Instrumentation is designed to support profiling and coverage tools that
expect to be active for the entire lifetime of the program.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if I want to profile only a specific function call? It that not covered? I suppose it is, but your simplified description of instrumentation leaves little room for it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To profile calls to a specific function foo:

instrumentation.instrument(foo.__code__, instrumentation.ENTER)
instrumentation.register(instrumentation.ENTER, profiler_func)


Monitoring can occur at any point in the program's life and be applied
anywhere in the program. Monitoring points are expected to few.
The capabilities of monitoring are a superset of that of profiling,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I'm confused. The use cases that I'm familiar with are coverage, profiling and debugging (both stepping through code and breakpoints). How do these use cases map to the two parts, instrumentation and monitoring? You imply there's a mapping, but I'm confused by what the mapping is supposed to be, since debugging seems to fall in neither category.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Debugging is a form of monitoring. The debugger is monitoring the program being run.
Any ideas for a better term than monitoring"? I don't want to use "debugging" as that is just a particular use case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the terms are fine, all I'm asking for is that you clarify the mapping between the two concepts and the three or more use cases. 'Monitoring is a superset of profiling" doesn't do that for me, but "A profiler would use the monitoring API" would. Except your example of profiling a function (in the comment above) uses instrumentation, not profiling. So I'm still confused about the relationship between instrumentation and monitoring, and between those and the use cases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the draft terms are fine, as I initially thought the "instrumentation API" was going to be "How clients specify what to monitor" and the monitoring API was going to be "How clients are notified of monitored events".

Even after having the distinction explained, I don't think I'd be able to remember "instrumentation is low overhead, monitoring is high overhead". If anything, I would expect them to be the other way around due to the way somewhat similar terminology gets used in hardware testing (non-intrusively monitoring ordinary device outputs vs instrumenting the test rig with additional data collection points).

For the actual distinction being made, my suggestions would be:

  • "passive monitoring API" (pervasive collection of events, low overhead; e.g. coverage, profiling)
  • "dynamic monitoring API" (selective collection of events, potentially high overhead; e.g. break points, watching local variables for changes)

To avoid excessive overhead, a single call should be made to
``instrumentation.insert_monitors`` passing all the offsets at once.

Breakpoints can suspended with ``instrumentation.monitor_off``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does that re-specialize the instruction at that offset?

Regardless of the answer it would seem this is deleting a breakpoint, not just suspending it. (If there was a difference, I'd thing that suspending makes it easy to re-enable, but there doesn't seem to be a semantic difference in this case, so I prefer the simpler "deleting".)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does that re-specialize the instruction at that offset?

Maybe, but probably not.

Turning a monitor off and removing it have very different performance profiles.

Turning a monitor off is cheap, but will interfere with optimization.
Removing a monitor is very expensive, but will ultimately allow full optimization.

It is up to the implementer of the debugger which to use, but if the user of the debugger turns a breakpoint off, rather than removing it, it implies they may want to turn it on again soon.

Breakpoints can suspended with ``instrumentation.monitor_off``.

Debuggers can break on exceptions being raised by registering a callable
for ``RAISE``:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the RAISE event applies to any reason an exception is raised, not just a raise statement, right? This could bear clarification.

onto the control flow graph of the code object.
``BRANCH``, ``JUMPBACK``, ``START`` and ``RESUME`` events will inform which
basic blocks have started to execute.
The ``RAISE`` event with mark any blocks that did not complete.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"with"?

* UNWIND
* C_CALL
* C_RETURN
* RAISE
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is RAISE needed here?

Copy link
Member Author

@markshannon markshannon Sep 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To match C_CALL if the callee raises.

Comment on lines +297 to +300
References
==========

[A collection of URLs used as references through the PEP.]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can remove this if you're not planning to add any references.

@gvanrossum
Copy link
Member

(I should add that, like Brandt, I think this is very cool, and a nice win for Quickening.)

Copy link

@fabioz fabioz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Mark,

Sorry for the long comment here... I couldn't really find a way to fit those as separate comments to specific points of the PEP.

So, I currently have a mixed feelings regarding this approach... I do see some potential performance wins here, but I think it'll still depend a lot on the actual implementation (so, I'm a bit wary that the abstract starts talking about the current slowdowns when there are currently already alternatives to do something similar with the existing frame evaluation).

i.e.: using the python frame evaluation mechanism to add breakpoints by changing bytecode is already done in debuggers nowadays and debuggers will still need to register a callback to know when a Python function will be entered and then monitor changes and potentially reapply the monitoring on breakpoint changes, so, I'm not 100% sure this will be a win performance wise if you're actually comparing with that case.

It's also a bit troublesome that it says that modifying the monitoring will be slow especially if this will be the way to make stepping after a breakpoint is reached. It could potentially be a net-negative if that's the case -- as far as I understood, for stepping debuggers should track all code entered and add monitoring for all line changes, so if that's slow compared to having the tracing function, the final result would be a net-negative when stepping (if I have misunderstood, please let me know).

As a note, I do see some cases where the approach presented here may be faster, especially if exceptions have to be tracked (which does require a tracing function even when using programmatic breakpoints in the current approach), but I'm also not sure how much would be due to improvements here or because the current version of tracing provided by CPython calls local frame updates at each step due -- see: https://bugs.python.org/issue42197 -- presumably having that fixed would automatically make debuggers in general an order of magnitude faster without these changes -- although specialized debuggers probably know this caveat already and do provide their own C-level tracing function ;)

It may be nice to have some reference implementation with some performance numbers comparing approaches, especially when having to do step-handling to know this won't be a net negative...

So, let me try to elaborate a bit in terms of what I see as pros/cons:

Pros:

  • Tracking of unhandled and handled exceptions through a single instrumentation point (it's hard to detect whether an unhandled exception is being the point of a return of a function right now and this would make the life of debugger implementors dealing with handled/unhandled exceptions much nicer).
  • Dealing with the current frame evaluation mode to be able to modify bytecode right now works, but the API isn't that nice (the instrumentation does seem like a nicer API).

Cons:

  • if instrumenting line breakpoints for step-tracking is slow after a breakpoint is hit, it may not be feasible for debuggers to use it.
  • Performance-wise, I see a similar overhead in debuggers that are currently optimized to deal with the frame eval mode (although they may need to fallback to tracing on breakpoint changes for the frames that are already currently running or when tracking exceptions, so, in a few situations this could be a win).

Some questions:

  • Will this tracing be disabled on recursion errors as happens with the current tracing function set in sys.settrace? This is a major annoyance because debuggers that do use tracing can get lost right now, so, it'd be nice to know how the recursion limit would work for debuggers in this case.
  • It seems there are no mentions on dealing with particular threads... it'd be nice if monitoring could be completely removed for some target thread (for instance, debuggers using internal threads could completely disable the tracing for a given thread -- although note that in general I do want to instrument for all threads, but it'd be nice to disable it specifically for some threads so that those don't have any additional overhead).
  • The PEP should probably make it clear that instrumentation calls will still be called when code is called from another instrumentation point (i.e.: when a user is in a breakpoint and calls custom evaluation code, a debugger may still want to hit breakpoints in that case).
  • Will it be possible for debuggers to change the line being executed? i.e.: one feature available for debuggers right now is that inside a tracing function it's possible to set the next line to be executed (which is a common feature) and there's no mention of that here (if that's not possible, debuggers will still have to fallback to tracing to actually handling breakpoints as is the case right now when using the frame evaluation mode).
  • One nice thing on java is the ability to drop frames and restart the execution with a changed code object (so, for instance, it's possible to change the code and restart the execution of a frame), do you see any way for that to work here?

@markshannon
Copy link
Member Author

Hi Mark,

Sorry for the long comment here... I couldn't really find a way to fit those as separate comments to specific points of the PEP.

So, I currently have a mixed feelings regarding this approach... I do see some potential performance wins here, but I think it'll still depend a lot on the actual implementation (so, I'm a bit wary that the abstract starts talking about the current slowdowns when there are currently already alternatives to do something similar with the existing frame evaluation).

Correct me if I'm wrong, but a PEP 523 based debugger still needs to check if the callee function contains a breakpoint on every call, and replace the bytecode which will prevent quickening and specialization.

OOI, what's the slowdown of running under the PyDev debugger with a few breakpoints set (but not hit) for 3.9/3.10?

i.e.: using the python frame evaluation mechanism to add breakpoints by changing bytecode is already done in debuggers nowadays and debuggers will still need to register a callback to know when a Python function will be entered and then monitor changes and potentially reapply the monitoring on breakpoint changes, so, I'm not 100% sure this will be a win performance wise if you're actually comparing with that case.

With this PEP there will be no need for the debugger to change bytecodes, just set a checkpoint and let the VM handle it.
If you need to know when a particular function is entered then that can be handled using instrumentation.

Not only should it make your life much easier, it is faster because the VM knows which optimizations it can still apply.

It's also a bit troublesome that it says that modifying the monitoring will be slow especially if this will be the way to make stepping after a breakpoint is reached. It could potentially be a net-negative if that's the case -- as far as I understood, for stepping debuggers should track all code entered and add monitoring for all line changes, so if that's slow compared to having the tracing function, the final result would be a net-negative when stepping (if I have misunderstood, please let me know).

Stepping in a debugger requires human input, which takes hundreds of milliseconds at the least. It won't be any slower.

As a note, I do see some cases where the approach presented here may be faster, especially if exceptions have to be tracked (which does require a tracing function even when using programmatic breakpoints in the current approach), but I'm also not sure how much would be due to improvements here or because the current version of tracing provided by CPython calls local frame updates at each step due -- see: https://bugs.python.org/issue42197 -- presumably having that fixed would automatically make debuggers in general an order of magnitude faster without these changes -- although specialized debuggers probably know this caveat already and do provide their own C-level tracing function ;)

It may be nice to have some reference implementation with some performance numbers comparing approaches, especially when having to do step-handling to know this won't be a net negative...

So, let me try to elaborate a bit in terms of what I see as pros/cons:

Pros:

* Tracking of unhandled and handled exceptions through a single instrumentation point (it's hard to detect whether an unhandled exception is being the point of a return of a function right now and this would make the life of debugger implementors dealing with handled/unhandled exceptions much nicer).

* Dealing with the current frame evaluation mode to be able to modify bytecode right now works, but the API isn't that nice (the instrumentation does seem like a nicer API).

Cons:

* if instrumenting line breakpoints for step-tracking is slow after a breakpoint is hit, it may not be feasible for debuggers to use it.

The alternative is to be slow all the time, as debuggers currently are. Slow immediately after a breakpoint is much better than slow all the time.

* Performance-wise, I see a similar overhead in debuggers that are currently optimized to deal with the frame eval mode (although they may need to fallback to tracing on breakpoint changes for the frames that are already currently running or when tracking exceptions, so, in a few situations this could be a win).

Using PEP 523 prevents specializing and flattening of Python-to-Python calls. Rewriting the bytecode prevents quickening and all other specializations. A checkpoint only prevents specialization at the checkpoint. A debugger using checkpoints for breakpoints will easily outperform a PEP 523 based debugger, assuming both are implemented to the same standard.

Some questions:

* Will this tracing be disabled on recursion errors as happens with the current tracing function set in `sys.settrace`? This is a major annoyance because debuggers that do use tracing can get lost right now, so, it'd be nice to know how the recursion limit would work for debuggers in this case.

There is no tracing to be disabled, just checkpoints. The recursion limit will be unchanged.
You might need to temporarily increase the recursion limit in your callback functions when handling RecursionErrors.

Because we will flatten Python-to-Python calls in 3.11 (that's the plan anyway), you should have plenty of headroom, even on MacOS.

* It seems there are no mentions on dealing with particular threads... it'd be nice if monitoring could be completely removed for some target thread (for instance, debuggers using internal threads could completely disable the tracing for a given thread -- although note that in general I do want to instrument for all threads, but it'd be nice to disable it specifically for some threads so that those don't have any additional overhead).

There is no way to disable callbacks for specific threads. That is left to the user.
If you want to only break in a particular thread, then you'll need to check the thread in the callback.

* The PEP should probably make it clear that instrumentation calls will still be called when code is called from another instrumentation point (i.e.: when a user is in a breakpoint and calls custom evaluation code, a debugger may still want to hit breakpoints in that case).

I'll add that when I get a chance.

* Will it be possible for debuggers to change the line being executed? i.e.: one feature available for debuggers right now is that inside a tracing function it's possible to set the next line to be executed (which is a common feature) and there's no mention of that here (if that's not possible, debuggers will still have to fallback to tracing to actually handling breakpoints as is the case right now when using the frame evaluation mode).

We will probably need to allow frame.f_lineno = ... to work outside of trace functions, maybe add a special context manager to allow it.

* One nice thing on java is the ability to drop frames and restart the execution with a changed code object (so, for instance, it's possible to change the code and restart the execution of a frame), do you see any way for that to work here?

That is outside the scope of this PEP. It seems a strange feature, is it used much?

@fabioz
Copy link

fabioz commented Sep 16, 2021

Hi Mark,
Sorry for the long comment here... I couldn't really find a way to fit those as separate comments to specific points of the PEP.
So, I currently have a mixed feelings regarding this approach... I do see some potential performance wins here, but I think it'll still depend a lot on the actual implementation (so, I'm a bit wary that the abstract starts talking about the current slowdowns when there are currently already alternatives to do something similar with the existing frame evaluation).

Correct me if I'm wrong, but a PEP 523 based debugger still needs to check if the callee function contains a breakpoint on every call, and replace the bytecode which will prevent quickening and specialization.

Well, the code object is cached and then it's set in the frame before the frame executes, but in a way, it's always the same code object which is set in a new frame, so, I'm not sure why that would prevent specialization of said frame/code object (I'm not that familiar as you're in the area though, so, if you say it'll prevent quickening and specialization I believe you).

Also, you lost me a bit... as far as I understood the debugger will still need to instrument things to check and update the monitors on every call so that it can reach all the code objects that'll be executed (there's a fast way out in which nothing is done after a few checks, but that's true for the existing approach too) -- I don't really see any other way (i.e.: how else would the debugger monitor things such as anonymous functions inside of functions?) or are you thinking about a different approach? So, the approach (as far as I can see) is mostly the same and the overhead would also be mostly the same (for line breakpoints)...

OOI, what's the slowdown of running under the PyDev debugger with a few breakpoints set (but not hit) for 3.9/3.10?

I just did some experiments in this (on 3.9).

I created a case which is the worst-case for the debugger, which is a case where method-calls dominate the program time (since that's the place it does all the computation, afterwards it just waits for a breakpoint to be hit) and then the method does nothing:

def method2():
    i = 1

def method():

    for i in range(200000):
        method2()

        if False:
            # Unreachable breakpoint here
            pass

def caller():
    start_time = time.time()
    method()
    print('TotalTime>>%s<<' % (time.time()-start_time,))

So, in my machine running without debugging this takes around 0.021s.
Just with the frame eval function (which does add the programmatic breakpoints when a call is made), it takes around 0.087s.
If the tracing is set (to a no-op) it takes around 0.156s.
If python with sys.settrace with a function that returns itself (without really doing anything and not under the debugger), it takes around 0.27s

Keep in mind that this is the absolute worse case -- but in a way also the simplest only with line breakpoints -- if exception breakpoints are added, then the debugger can't forgo tracing for all the events and the time goes up to 0.91s, so, just the fact that exceptions would be notified from VM events could be a huge win in that particular case.

Now, having given those numbers (on the worse case), from what I've seen in the general case (on real world programs), the debugger makes programs 1.5-4 times slower when the user doesn't have exception breakpoints and 8-10 times slower if he does enable the use of exception breakpoints -- and for this case this PEP would help quite a bit, so, it's a good +1 from me for this case alone ;)

  • Will it be possible for debuggers to change the line being executed? i.e.: one feature available for debuggers right now is that inside a tracing function it's possible to set the next line to be executed (which is a common feature) and there's no mention of that here (if that's not possible, debuggers will still have to fallback to tracing to actually handling breakpoints as is the case right now when using the frame evaluation mode).

We will probably need to allow frame.f_lineno = ... to work outside of trace functions, maybe add a special context manager to allow it.

Seems good to me (it'd be good if this was a part of the PEP or if it's addressed in a related PEP -- this is a major feature in the debugger and without this the debugger wouldn't be able to use most of this PEP).

* One nice thing on java is the ability to drop frames and restart the execution with a changed code object (so, for instance, it's possible to change the code and restart the execution of a frame), do you see any way for that to work here?

That is outside the scope of this PEP. It seems a strange feature, is it used much?

In Java land it's used quite a lot to hot-swap code during a debug session, in Python users have to get by by calling custom evaluate functions while in a breakpoint and then setting the frame.f_lineno to try to mimic it if the code wouldn't change much (think of setting a variable and going inside some if), although many times it's not as convenient and there are cases where it's not possible to use that workflow as a replacement... Anyways, I agree, this is a feature outside of this PEP.

@markshannon
Copy link
Member Author

Hi Mark,
Sorry for the long comment here... I couldn't really find a way to fit those as separate comments to specific points of the PEP.
So, I currently have a mixed feelings regarding this approach... I do see some potential performance wins here, but I think it'll still depend a lot on the actual implementation (so, I'm a bit wary that the abstract starts talking about the current slowdowns when there are currently already alternatives to do something similar with the existing frame evaluation).

Correct me if I'm wrong, but a PEP 523 based debugger still needs to check if the callee function contains a breakpoint on every call, and replace the bytecode which will prevent quickening and specialization.

Well, the code object is cached and then it's set in the frame before the frame executes, but in a way, it's always the same code object which is set in a new frame, so, I'm not sure why that would prevent specialization of said frame/code object (I'm not that familiar as you're in the area though, so, if you say it'll prevent quickening and specialization I believe you).

I hadn't realized that you cached the whole object (although that makes perfect sense), in which case most specializations would still work. However, PEP 523 will prevent specializations across calls.

Also, you lost me a bit... as far as I understood the debugger will still need to instrument things to check and update the monitors on every call so that it can reach all the code objects that'll be executed (there's a fast way out in which nothing is done after a few checks, but that's true for the existing approach too) -- I don't really see any other way (i.e.: how else would the debugger monitor things such as anonymous functions inside of functions?) or are you thinking about a different approach? So, the approach (as far as I can see) is mostly the same and the overhead would also be mostly the same (for line breakpoints)...

Because breakpoints are defined on the code objects, when setting a breakpoint you can do a one time search for the relevant code object instead of checking every call.

E.g.

def f(seq):
    return [
         x * x for x in seq    # Break point here
    ]

You'll need to be able to find the code object for the list comprehension. I assume that you can already do that by collecting line ranges on code objects with an import hook. Once you have the code object, you can find the offsets for the line with co.co_lines() and set the monitor. Once that is done the VM takes care of it.

co = find_code_object(file, line)
offsets = [offset for offset, _, l in co.co_lines() if line == l]
instrumentation.insert_monitors(co, *offsets)

OOI, what's the slowdown of running under the PyDev debugger with a few breakpoints set (but not hit) for 3.9/3.10?

I just did some experiments in this (on 3.9).

Thanks, those numbers aren't as bad as I was expecting, but that's still quite a slowdown compared to normal execution.

@fabioz
Copy link

fabioz commented Sep 16, 2021

You'll need to be able to find the code object for the list comprehension. I assume that you can already do that by collecting line ranges on code objects with an import hook. Once you have the code object, you can find the offsets for the line with co.co_lines() and set the monitor. Once that is done the VM takes care of it.

co = find_code_object(file, line)
offsets = [offset for offset, _, l in co.co_lines() if line == l]
instrumentation.insert_monitors(co, *offsets)

Well, an import hook isn't enough to collect all code-objects -- it may work on some cases, but I can think of cases where the user creates code to run without using an import -- for instance, he could create it in C/C++ and then execute it and we'd still want to track that or if a breakpoint is set after the import is done... so, the only reasonable way I see it working is checking the code right before it's executed (where some caching would be used to try to discover the fast path where a code object doesn't need to be tracked). Anyways, from the API you're proposing this would still be possible, so, it works for me ;)

As a note, not sure if you've seen, but maybe you're aware of https://mail.python.org/archives/list/python-dev@python.org/thread/CP2PTFCMTK57KM3M3DLJNWGO66R5RVPB/ (where line numbers are a bit odd -- it doesn't affect the tracing but I think it'd affect the instrumentation given those odd line numbers in the code object for the jump instruction).

Copy link
Member

@zooba zooba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor comments, plus a big-ish one about threading, which I see coming up relatively frequently for the current tracing system. Would be great to figure out a neat way to deal with it while we're doing something better.


C++ and Java developers expect to be able to run a program at full speed
(or very close to it) under a debugger.
Python developers should expect that too.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For those who think that pdb is the peak of debugging, and don't suffer from any performance issues with that, is there any motivation to support the change?

I'm obviously in agreement that a better approach would be great, just concerned that there may be push back from people who don't see the value when a few lines here about the current state of things (maybe the frequent checks for tracing?) would probably motivate things.


Code objects can be instrumented by calling::

instrumentation.instrument(codeobject, events)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a new module? Why not put it in sys? (Even as sys.intrumentation...)

instrumentation.instrument(codeobject, events)

Code objects must be instrumented before they are executed.
An exception will be raised if the code object has been executed before it
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It said earlier that quickening lets us modify "executing" bytecode. Is that not actually the case?

And if so, when do we have a chance to instrument new code objects? Do they all have to be done on creation? Or can we do it on first call?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One place that should have the chance to instrument called code objects is the event handling in the parent function. I think there's a missing event type for that purpose though: there's currently no event that triggers just before a code object is invoked (receiving the code objects for both the caller and callee, and the offset in the caller).

Functions can be unregistered by calling
``instrumentation.register(event, None)``.

Callback functions can be registered at any time.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seconding the question about whether they're global/per-interpreter/per-thread/per-...whatever.

Right now, people hit issues, confusion and sometimes ideas due to settrace's single-threadedness (particularly when you start tracing an already-running application and don't have a chance to force your trace into every thread, which is something we also saw regularly building the debugger for VS). The main problem being that tracing on one thread doesn't give you any way to trace - or even pause execution of - the others.

e.g. once you hit a breakpoint in one thread, the first thing you do that releases the GIL is going to let other threads start executing. Having some way to also trigger an event in the context of every thread would let a thread-aware debugger control which of them can keep executing. (Or maybe some other kind of approach makes more sense, that's just an idea that jumps out at me.) It doesn't seem to be essential to this PEP for this to be supported, but it's a good opportunity to add it while we're defining new events.

At the very least, having clear statements about the threading behaviour for what this PEP does add are essential.

When an event occurs the registered function will be called.
The arguments provided are as follows:

* BRANCH: ``func(code: CodeType, offset: int, taken:bool)``
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the code object an important parameter? The frame is generally more useful (and implicitly includes the code object).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then again, I guess for a profiler the code object is enough as it's likely maintaining its own call stacks. Debuggers will need to walk the frame even if callers weren't instrumented, and so having them do an extra call to start walking the stack is better than preemptively passing it in.

If that's the case, a sentence explaining it would be nice, so the next reader doesn't have to guess.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to the case where the frame is needed for debuggers (i.e.: this PEP should have some note stating the recommended way to get the frame related to the instrumentation notification).

-- as a note, if performance-wise it'd be the same, I'd say that receiving the frame which contains the code object would be better. If there's some performance penalty, then another way to get it would also be reasonable since it's not always needed (but the recommended way to get it should be explicit in the PEP).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It wouldn't be the same performance wise. The new eval loop mostly avoids creating full python frame objects, so this new API provides an opportunity to retain that performance improvement, whereas the old one loses it (the frame objects have to be created anyway in order to pass them to the installed trace hook).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do wonder if it might make sense to pass in a callable that requests the full frame object though. Otherwise hooks would have to use something similar to sys._getframes(), which feels awkward. That said, the callable wouldn't be cheap to create either, so @zooba's suggestion is probably the way to go (i.e. add text to the PEP explaining why the event callback API is the way it is)

Callback function arguments
'''''''''''''''''''''''''''

When an event occurs the registered function will be called.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And events only occur if they were instrumented?

* EXIT: ``func(code: CodeType, offset: int)``
* C_CALL: ``func(code: CodeType, offset: int, value: object)``
* C_RETURN: ``func(code: CodeType, offset: int, value: object)``
* C_EXCEPT: ``func(code: CodeType, offset: int, exception: BaseException)``
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that both the exception as well as the traceback would be important in the C_EXCEPT, RAISE, EXCEPT as well as the UNWIND cases.

As a note, if the idea is using exception.__traceback__, it'd be nice if if that's noted in the PEP... as a note, I'm not sure if exception.__traceback__ is always available -- is it possible that some exception raised -- say some custom exception object from C/C++ doesn't have it? (in which case it'd need to be passed as a parameter?)


Code objects must be instrumented before they are executed.
An exception will be raised if the code object has been executed before it
is instrumented.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I had misread this PEP... I thought that only by using instrumentation.register(instrumentation.ENTER, profiler_func) I'd start to listen to any function call and instrumentation.register(instrumentation.UNWIND, profiler_func) would make it listen to any unhandled exception, but now, rereading, I'm under the impression that I'd just listen to function calls for code objects previously instrumented and it wouldn't even be possible to change it for code objects already running...

I think that the PEP could make that clearer.

Now, if it's really the case that:

  • there's no callback to listen to any call just prior to a code object/frame being executed
  • there's no way to listen to any events for code objects/frames already running

I'd say I'm -1 on the PEP as it'd just not be usable for many use cases (the happy path where this API would work would be too narrow to be usable)...

For instance, it's very common for users to attach to a running program and this PEP wouldn't support that at all and it'd be next to impossible to get a hold of all the code objects that'd need to be instrumented before they're actually run even on the case where the program is started with the debugger in place (an import hook to instrument code objects isn't enough as there are many corner cases where it wouldn't be able to get it).

Now, if I misunderstood it, then I think the PEP should make it clearer that those are in fact supported and which API would be used for that...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the distinction between the low overhead passive monitoring API ('instrumentation' in the current text) and the high overhead dynamic monitoring API ('monitoring' in the current text).

Turning on code coverage or profiling would have to be done early, so could be done dynamically only in the presence of dynamic compilation. But an interactive debugger would use the kinds of hooks that can be added to an existing code object, it wouldn't use the ones that are intended for code coverage and profiling.

@vstinner
Copy link
Member

FYI I added https://www.python.org/dev/peps/pep-0670/ and so I skipped the number 669 reserved to this PR ;-)

Copy link
Contributor

@ncoghlan ncoghlan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The general proposal here looks promising to me, but I think we want to try really hard to have the user facing runtime API just refer to "events" and "callbacks", and not require users to understand any new terminology that's specific to the new runtime hook implementation.


Monitoring can occur at any point in the program's life and be applied
anywhere in the program. Monitoring points are expected to few.
The capabilities of monitoring are a superset of that of profiling,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the draft terms are fine, as I initially thought the "instrumentation API" was going to be "How clients specify what to monitor" and the monitoring API was going to be "How clients are notified of monitored events".

Even after having the distinction explained, I don't think I'd be able to remember "instrumentation is low overhead, monitoring is high overhead". If anything, I would expect them to be the other way around due to the way somewhat similar terminology gets used in hardware testing (non-intrusively monitoring ordinary device outputs vs instrumenting the test rig with additional data collection points).

For the actual distinction being made, my suggestions would be:

  • "passive monitoring API" (pervasive collection of events, low overhead; e.g. coverage, profiling)
  • "dynamic monitoring API" (selective collection of events, potentially high overhead; e.g. break points, watching local variables for changes)

This PEP is fully backwards compatible.

We may seek to remove ``sys.settrace`` in the future once the APIs provided
by this PEP have been widely adopted, but that is for another PEP.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would make sense to specify whether the new callback hooks are invoked before or after the existing trace hooks.


Code objects must be instrumented before they are executed.
An exception will be raised if the code object has been executed before it
is instrumented.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the distinction between the low overhead passive monitoring API ('instrumentation' in the current text) and the high overhead dynamic monitoring API ('monitoring' in the current text).

Turning on code coverage or profiling would have to be done early, so could be done dynamically only in the presence of dynamic compilation. But an interactive debugger would use the kinds of hooks that can be added to an existing code object, it wouldn't use the ones that are intended for code coverage and profiling.

When an event occurs the registered function will be called.
The arguments provided are as follows:

* BRANCH: ``func(code: CodeType, offset: int, taken:bool)``
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It wouldn't be the same performance wise. The new eval loop mostly avoids creating full python frame objects, so this new API provides an opportunity to retain that performance improvement, whereas the old one loses it (the frame objects have to be created anyway in order to pass them to the installed trace hook).

When an event occurs the registered function will be called.
The arguments provided are as follows:

* BRANCH: ``func(code: CodeType, offset: int, taken:bool)``
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do wonder if it might make sense to pass in a callable that requests the full frame object though. Otherwise hooks would have to use something similar to sys._getframes(), which feels awkward. That said, the callable wouldn't be cheap to create either, so @zooba's suggestion is probably the way to go (i.e. add text to the PEP explaining why the event callback API is the way it is)

instrumentation.register(event, func)

Functions can be unregistered by calling
``instrumentation.register(event, None)``.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could something more descriptive than "register" be used as the name here? If I've understood the purpose of the function correctly, then set_event_callback feels like it would be clearer.

Code objects can be instrumented by calling::

instrumentation.instrument(codeobject, events)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this function just be called enable_event_callbacks rather than requiring users to know what the term instrument means in the context of this PEP?

instrumentation.instrument(codeobject, events)

Code objects must be instrumented before they are executed.
An exception will be raised if the code object has been executed before it
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One place that should have the chance to instrument called code objects is the event handling in the parent function. I think there's a missing event type for that purpose though: there's currently no event that triggers just before a code object is invoked (receiving the code objects for both the caller and callee, and the offset in the caller).

* BRANCH: ``func(code: CodeType, offset: int, taken:bool)``
* JUMPBACK: ``func(code: CodeType, offset: int)``
* ENTER: ``func(code: CodeType, offset: int)``
* EXIT: ``func(code: CodeType, offset: int)``
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these with statement entry and exit or function entry and exit? Either way, it's likely worth making the event name longer to be clear about it.


The following functions are provided to insert monitoring points::

instrumentation.insert_monitors(codeobject, *offsets)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the adjusted naming I proposed above, the associated names for these APIs would be:

  • insert_dynamic_callbacks
  • remove_dynamic_callbacks
  • disable_dynamic_callback
  • enable_dynamic_callback

The callback registration function monitor_register would instead become set_dynamic_callback. (Although it isn't clear to me why the registration API needs to be different from the regular event hook registration - couldn't DYNAMIC just be another event type for callback registration purposes, even if the methods for adding dynamic hooks are different from those for enabling other event types?)

@markshannon
Copy link
Member Author

I just want to say that I haven't abandoned this, but @fabioz comments about attaching a debugger have forced me to do redesign. I expect to push the new design early next month.

The new design will use the same approach internally (modifying the quickened bytecode at runtime) for performance, but provide a more event based interface to make it easier to work with.
The VM will be responsible for instrumentation, the user just needs to specify which events they are interested in.

Here are the events in my current whiteboard design:

  1. PY_CALL -- Call of a Python function (actually immediately after the call, the callee's frame will be on the stack)
  2. PY_RESUME -- Resumption of a Python function (for generator and coroutine functions), except for throw() calls.
  3. PY_THROW -- A Python function is resumed by a throw() call.
  4. PY_RETURN -- Return from a Python function
  5. PY_YIELD -- Yield from a Python function
  6. PY_UNWIND -- Exit from a Python function during exception unwinding.
  7. C_CALL -- Call of a builtin function (before the call in this case)
  8. C_RETURN -- Return from a builtin function (before the call in this case)
  9. RAISE -- An exception is raised
  10. HANDLED -- An exception is handled
  11. LINE -- A new line in the program is reached.
  12. INSTRUCTION -- A new VM instruction is about to be executed -- This is outrageously expensive, but does seem to have uses.
  13. BACKEDGE -- An unconditional backedge in the control flow graph is reached.
  14. BRANCH -- A conditional branch is about to be taken (or not).
  15. BREAKPOINT -- A breakpoint is hit

This is sufficient to support all the use cases I am aware of, and will allow us to support sys.settrace and sys.setprofile tracing on top of it.

Feel free to suggest others, if you need them.

@encukou
Copy link
Member

encukou commented Nov 24, 2021

@nedbat, you might want to take a look at the list: #2070 (comment)

@markshannon
Copy link
Member Author

Closing in favor of #2183 which describes the new design.

Thanks everyone for the feedback so far.

@markshannon markshannon deleted the high-performance-debugging branch March 23, 2022 12:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants