Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-107137: Add _PyTuple_NewNoTrack() internal C API #107183

Closed
wants to merge 1 commit into from

Conversation

vstinner
Copy link
Member

@vstinner vstinner commented Jul 24, 2023

Fix a crash in PySequence_Tuple() when the tuple which is being initialized was accessed via GC introspection, like the gc.get_referrers() function. Now the function only tracks the tuple by the GC once it's fully initialized.

Add _PyTuple_NewNoTrack() and _PyTuple_ResizeNoTrack() functions to the internal C API: similar to PyTuple_New() and _PyTuple_Resize(), but don't track the tuple by the GC.

Modify PySequence_Tuple() to use these functions.

Fix a crash in PySequence_Tuple() when the tuple which is being
initialized was accessed via GC introspection, like the
gc.get_referrers() function. Now the function only tracks the tuple
by the GC once it's fully initialized.

Add _PyTuple_NewNoTrack() and _PyTuple_ResizeNoTrack() functions to
the internal C API: similar to PyTuple_New() and _PyTuple_Resize(),
but don't track the tuple by the GC.

Modify PySequence_Tuple() to use these functions.
@vstinner
Copy link
Member Author

cc @corona10 @gvanrossum

@vstinner
Copy link
Member Author

gc.get_referrers() documentation has a warning added in 2003 by issue #39117:

Care must be taken when using objects returned by get_referrers() because some of them could still be under construction and hence in a temporarily invalid state. Avoid using get_referrers() for any purpose other than debugging.

@vstinner
Copy link
Member Author

vstinner commented Jul 24, 2023

@pablogsal: in 2021, you wrote:

I don't know if is worth to add a new API just for tuples, given that this problem happens with many other objects

Are you still against the GC bug in PySequence_Tuple()?

Note: I'm only proposing to add functions to the internal C API. Make such API public would deserves its own discussion. Also, I proposed a different API to fix the issue, an internal "tuple builder" API: PR #107139.

@vstinner vstinner marked this pull request as draft July 24, 2023 17:26
@vstinner
Copy link
Member Author

I mark this PR as a draft since in 2021, the issue #59313 was closed as "not a bug" by @pablogsal who wrote:

I mainly agree on closing this issue as we already have a warning about this behaviour in gc.get_referrers and other friends.

On the other hand I am still a bit afraid of a crash that could happen if the GC does a pass and one of these tuples is in an inconsistent state. Is possible that this crashes Python without the user calling ever any function from the GC module. I need to do some investigation around this, but I agree that this is a separate issue, so I will open a new one if it happens to be relevant.

@gvanrossum
Copy link
Member

This leaves me worried. The issue where the crash due to incomplete tuples (gh-59313) was closed, basically telling people that this is an acceptable risk, and the risk is due to the existence of cg.get_referrers() (see also gh-39117).

It seems that using PyTuple_New() with a nonzero size argument is always unsafe, because the tuple is immediately tracked by GC.

An alternative way to deal with this is to not call _PyObject_GC_TRACK in PyTuple_New(), and document to users that if they create a tuple that contains mutable objects they should call PyObject_GC_Track() after fully initializing the tuple.

Introducing a new internal API, _PyTuple_NewNoTrack(), does nothing for extensions calling PyTuple_New().

I feel we have two choices here:

  • Either we remove the _PyObject_GC_TRACK() call from PyTuple_New() and document that users must call PyObject_GC_Track() or risk that the tuple becomes part of an immortal cycle.
  • Or we accept that the problem is with gc.get_referrers() and we refuse to fix it.

I don't feel that just fixing tuple(<iterator that calls gc.get_referrers()> but not doing anything about the public API (PyTuple_New()) is not an effective solution.

(This discussion should probably go into an issue, but the issues are all closed.)

@vstinner
Copy link
Member Author

It seems that using PyTuple_New() with a nonzero size argument is always unsafe, because the tuple is immediately tracked by GC.

Correct

An alternative way to deal with this is to not call _PyObject_GC_TRACK in PyTuple_New(),

It sounds like a backward incompatible changes. Such subtle change reminds me when instances of heap types should now call Py_DECREF(Py_TYPE(self)) in their tp_dealloc function, implement tp_traverse and tp_clear, visit the type in tp_traverse, and clear the type in tp_clear :-(

I'm not saying that so many changes are needed. My worry is that such incompatible change can be silently ignored, and "suddenly" people will get new GC problems that they didn't have before since their tuples were tracked by the GC.

Not tracking tuples by the GC sounds worse then the risk of race condition of code calling gc.get_referrers().

Introducing a new internal API, _PyTuple_NewNoTrack(), does nothing for extensions calling PyTuple_New().

That's why I have a bigger plan: provide a new API to replace PyTuple_New() + PyTuple_SET_ITEM(): a "tuple builder" API, issue #107137.

The migration plan doesn't imply to remove immediately the old API. It can be slow. Moreover, at this stage, I only propose an internal C API and so how it goes in Python internals first, how the API is used, and see if it does help fixing the race conditions for us at least.

Or we accept that the problem is with gc.get_referrers() and we refuse to fix it.

Sure, that's also an option. But even if we decide to not provide a solution to fix 3rd party C extensions, does it mean that we should not fix known bugs in Python internals if there is a simple solution for that?

(This discussion should probably go into an issue, but the issues are all closed.)

I created issue #107137. These days, most discussions happen on pull requests, issues are just created to get "an issue number" for the Changelog and What's New entries :-)

I don't understand well the rationale against fixing PySequence_Tuple(). Do you mean that my internal API is too complicated to use?

@gvanrossum
Copy link
Member

I don't understand well the rationale against fixing PySequence_Tuple(). Do you mean that my internal API is too complicated to use?

APIs, even internal ones, are forever. I would like you to ask other core devs to chime in here rather than it just being you and me. After all the problem has existed for decades, so why hurry now?

@vstinner
Copy link
Member Author

APIs, even internal ones, are forever.

The difference is that we don't provide any backward compatibility warranty on the internal C API. We have greater freedom to change it (rename functions, add parameters, remove functions, etc.)

I would like you to ask other core devs to chime in here rather than it just being you and me. After all the problem has existed for decades, so why hurry now?

Sure, I would like to hear more voices. I pinged multiple core devs on the issue, on my first PR and this second PR. I only created this issue very recently, other people seem to be busy (the EuroPython is just over ;-)).

Why makes you think that there is a hurry?

@gvanrossum
Copy link
Member

Why makes you think that there is a hurry?

Because you have a pattern of making changes, requesting a review (or not!), and merging before you actually get a review. See e.g. gh-106871.

@davidhewitt
Copy link
Contributor

From PyO3's perspective, if you want us to replace PyTuple_New with something which doesn't risk GC crashes then making this stable with a name like PyTuple_UnsafeNewNoTrack (if we want unsafe to hint that this needs to be tracked by the GC after) would be helpful.

PyO3 can then use either PyTuple_SetItem or PyTuple_SET_ITEM depending on whether users are building for the stable API or the version-specific API.

@vstinner
Copy link
Member Author

@gvanrossum:

Because you have a pattern of making changes, requesting a review (or not!), and merging before you actually get a review. See e.g. #106871.

Well. I created this PR to get reviews. 3 weeks later: no review. I pinged 3 core devs (including you). I tried to explain my problem with reviews a few times:

How can I get a review on this internal C API to fix a GC crash? :-)

@gvanrossum
Copy link
Member

3 weeks later: no review.

Not true. I wrote two separate review comments, but stating concerns about API bloat (even internal) and pointing out that the issue was closed as "won't fix" for a good (IMO) reason. What more do you want? We have one core dev (me) still against and nobody else in favor. In that case the status quo wins.

@vstinner
Copy link
Member Author

Since issue #59313 got closed as "wont fix" two years ago, the C API was discussed and a working group is trying to make it better and address C API Problems. I'm proposing a concrete solution to one of these problems. I wrote a concrete PR to discuss the problem and my specific solution.

As explained in the issue related to this PR, issue #107137, the problem is that the C API allows to create incomplete and inconsistent Python objects: capi-workgroup/problems#56 We can incrementally fix the C API to reduce this problem and design a migration path to a safer API which doesn't allow that. Fixing the problem in the internal C API is a first step, and IMO it's worth it even if the work stops here (since it does fix a concrete crash).

What more do you want?

Well, it would be nice to hear other voices. It seems like @davidhewitt is interested by a fix for this C API issue for PyO3.

@gvanrossum
Copy link
Member

It might also make sense to just wait until the core dev sprint in Brno, where the future of the C API will be discussed.

@nascheme
Copy link
Member

FYI, I don't think the problem is just with gc.get_referrers(). If the GC "sees" an object with invalid pointer in it, it could crash. E.g. it calls tp_traverse() and that follows a NULL pointer or invalid pointer. My guess is that PyTuple_New() is mostly okay since the GC doesn't get a chance to run until the tuple becomes valid. The original design intention of PyObject_GC_Track() and PyObject_GC_UnTrack() is that you only reveal objects to the GC when they are valid and you untrack them before they become invalid.

@gvanrossum
Copy link
Member

But for tuple objects, tp_traverse is initialized with tupletraverse which uses the standard Py_VISIT macro which skips NULL. I assume that other types that leave NULL values after construction use this too. I don't think we have to worry about other garbage values, so I don't think there's any a risk from tp_traverse.

My guess is that PyTuple_New() is mostly okay since the GC doesn't get a chance to run until the tuple becomes valid.

This I'm not so sure about -- IIUC the bug reported in gh-59313 shows how to get the GC to run (note it was closed because it was deemed an acceptable risk, not because it was fixed). FWIW This PR should reference that issue in its title, not gh-107137 (which is about adding a different tuple building API).

@davidhewitt
Copy link
Contributor

davidhewitt commented Aug 15, 2023

While I'm not aware of crashes in the wild caused by this in PyO3, the issue in #59313 probably does affect Rust code using PyO3. We allow tuples to be built from Rust iterators. Users have the flexibility to run arbitrary Python code while the uninitialized tuple is under construction. We could work around this by first collecting the iterator into a Rust vector of PyObject*, from some measurements that was a 20-50% slowdown.

If the fix to avoid the crash and keep current performance "just" exposing a new API which is PyTuple_New but where users have to manually call PyObject_GC_Track once the tuple is fully initialised, that'd be extremely easy for PyO3 to migrate to. So my ask is that if you merge this PR, please add this to the public API.

An alternative way to deal with this is to not call _PyObject_GC_TRACK in PyTuple_New(),

I think that's not feasible because stable API extensions built before this change will start producing tuples not tracked by GC, which sounds bad to me.

@markshannon
Copy link
Member

As @vstinner said in capi-workgroup/problems#56 creating incomplete tuples is unsafe.

Rather than an add another unsafe API, would it not be better to promote the safe ones and deprecate the unsafe ones?

From smaller tuples, PyTuple_Pack() is reasonably efficient, and _PyTuple_FromArray() more so.
We should probably make PyTuple_FromArray() public.

For larger tuples, the safe thing to do is build a list, then call PyList_AsTuple().
If that is inefficient, we should fix that, rather than playing with broken tuples.

@erlend-aasland
Copy link
Contributor

From smaller tuples, PyTuple_Pack() is reasonably efficient, and _PyTuple_FromArray() more so. We should probably make PyTuple_FromArray() public.

I'd be in favour of making _PyTuple_FromArray() public. It is a neatly implemented API, and it has been used with great success in the code base. Also, I think Raymond's idea of filling newly created tuples with Py_None should be considered now that None is immortal.

@corona10
Copy link
Member

I'd be in favour of making _PyTuple_FromArray() public. It is a neatly implemented API, and it has been used with great success in the code base.

Looks good to me also, because I am not sure how people want to use APIs based on elements of size.
For example, recommending using API in small cases and using B API for other scenarios is not good for me in most cases.
PyTuple_FromArray will be a good single way to create a tuple in most of cases.

@vstinner
Copy link
Member Author

@markshannon:

Rather than an add another unsafe API, would it not be better to promote the safe ones and deprecate the unsafe ones?

This change is related to issue #107137 which proposes adding a new safe API. But for the implementation, I need to add internal unsafe functions, right.

@vstinner
Copy link
Member Author

Right, initializing tuple times to None would prevent crashes on repr(tuple) when devs play with GC introspection functions.

But my concern is more general than that. For me, PyTuple_SET_ITEM() is too different from all other C API functions: it doesn't clear the reference to the old item, since it makes the assumption that the function must only be called on newly created tuple with uninitialized items.

PyTuple_SET_ITEM() doc:

Like PyTuple_SetItem(), but does no error checking, and should only be used to fill in brand new tuples.

I would prefer to move towards an API where it's impossible to create unitialized/incomplete objects and so a function like PyTuple_SET_ITEM() would make no sense. Reference counting would be more regular and the API would be less error-prone.

PyTuple_SET_ITEM(tuple, index, item) is really just tuple->ob_items[index] = item;. It doesn't call Py_DECREF(tuple->ob_items[index]) (or Py_XDECREF()).

PyTuple_SetItem() is more regular: it clears the old item before setting the new item using Py_XSETREF().


On the implementation side, because PyTuple_New() API exists, most Objects/tupleobject.c functions have to handle the case of NULL items. For example, PyTuple_SetItem() uses Py_XSETREF(), instead of Py_SETREF(). tupledealloc() uses Py_XDECREF(), instead of Py_DECREF(). For me, it's weird to have to care about partially initialized tuple just for the uncommon code path (tuple partially initialized, and oops, an error occurred).

Other APIs like PyTuple_Pack() or proposed "tuple builder" API doesn't have such problem: if we manage to remove PyTuple_New() in the long term, it will be possible to cleanup Objects/tupleobject.c to avoid the NULL items case.

@vstinner
Copy link
Member Author

By the way, this change also fix Lib/test/crashers/gc_inspection.py which can be removed in that case.

@vstinner
Copy link
Member Author

I close my issue, see: #107139 (comment)

@vstinner vstinner closed this Aug 26, 2023
@vstinner vstinner reopened this Aug 26, 2023
@vstinner
Copy link
Member Author

Ooops, wrong window, I wanted to close the issue #107137

@vstinner
Copy link
Member Author

Apparently, I failed to convinced that _PyTuple_NewNoTrack() is the good fix for the issue, and some people consider that this crash is not important enough to be fixed, so I close just my PR.

@vstinner vstinner closed this Sep 13, 2023
@vstinner vstinner deleted the tuple_new_notrack branch September 13, 2023 16:13
@markshannon
Copy link
Member

Could you open an issue for the crash?
The issue for the crash shouldn't be tied to a new API.
We should definitely fix the crash.

@vstinner
Copy link
Member Author

Could you open an issue for the crash?

There was an issue closed by @pablogsal and @rhettinger was also against it. I asked @pablogsal if he can reconsider his decision but he didn't replied. So I give up. If you feel ready to reconsider this design choice, please go ahead and open a new issue :-) I'm done on this topic for now ;-)

@markshannon
Copy link
Member

I think we should fix this. It is a crash in legal Python code.
#59313 (comment)

I don't think we need a new API, we need to fix PySequence_Tuple()

@vstinner
Copy link
Member Author

I don't think we need a new API, we need to fix PySequence_Tuple()

My PR only added an internal C API. Is it now wrong to add internal C API functions?

@markshannon
Copy link
Member

Is it now wrong to add internal C API functions?

No, but I don't think this is the way to go. _PyTuple_NewNoTrack might be less bug-prone than PyTuple_New, but it is still bug-prone.
We should be creating tuples all at once, so we don't have worry about how to handle invalid tuples.
PyTuple_FromArray and PyTupleFromList are the way to go.

@markshannon
Copy link
Member

And PyTuple_Pack.

If the loss of efficiency in refcount handling is an obstacle to adoption, we could add variants that consume the references.

@markshannon
Copy link
Member

See #117747

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants