Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add PyUnicode_Equal() function #43

Open
2 of 6 tasks
vstinner opened this issue Sep 25, 2024 · 5 comments
Open
2 of 6 tasks

Add PyUnicode_Equal() function #43

vstinner opened this issue Sep 25, 2024 · 5 comments
Labels

Comments

@vstinner
Copy link

vstinner commented Sep 25, 2024

I propose to add a public PyUnicode_Equal(a, b) function to the limited C API 3.14 to replace the private _PyUnicode_EQ() function:

API: int PyUnicode_Equal(PyObject *a, PyObject *b)

  • Return 1 if a is equal to b.
  • Return 0 if a is not equal to b.
  • Set a TypeError exception and return -1 if a or b is not a Python str object.
  • The function always succeed if a and b are strings.

Python 3.13 moved the private _PyUnicode_EQ() function to internal C API. mypy and Pyodide are using it.

The existing PyUnicode_Compare() isn't enough and has an issue. PyUnicode_Compare() returns -1 for "less than" but also for the error case. The caller must call PyErr_Occurred() which is inefficient. It causes an ambiguous return value: capi-workgroup/problems#1

PyUnicode_Equal() has no such ambiguous return value (-1 only means error). Moreover, PyUnicode_Equal() may be a little bit faster than PyUnicode_Compare(), but I'm not sure about that.


Vote to add this API:

@zooba
Copy link

zooba commented Sep 25, 2024

API shape seems fine, I prefer PyUnicode_IsEqual as a name.

I find the performance argument fairly compelling, since ordering requires a lot of calculation (and realistically requires a number of options, including locale, to be correct). Having a fast equals method is likely to prevent people from relying on interning strings and comparing pointers.

However, if the motivation really is that PyUnicode_Compare has problems, then shouldn't we be creating a better compare function? One that can take an "equals only" flag to be quick, but can also gain more flags (e.g. case insensitive, treat numbers logically, locale-aware ordering, etc.).

Knowing whether the people using the private function were looking specifically for speed or just using the first function their IDE told them about would help us make an informed decision here. I don't like when the proposal doesn't actually solve the problem given as the motivation (but am happy to change the motivation in this case).

@zooba
Copy link

zooba commented Sep 25, 2024

Also, we need to rule out PyObject_RichCompareBool as a suitable alternative (perhaps paired with PyUnicode_Check if the caller really doesn't know what types they've got, but almost certainly in most cases where they have enough calls to justify a performance optimisation they're going to be responsible for having created the values, and will know that they're strings already).

@ZeroIntensity
Copy link

Not that my opinion matters much here, but I might as well share my thoughts: I think a faster comparison protocol is a great idea (and one I'd be happy to help work on), but probably better suited for api-revolution rather than this proposal. PyUnicode_Equal/PyUnicode_IsEqual is simply a convenient drop in for extensions that were relying on the private API, nothing more, nothing less.

Though, maybe it's going a bit far to add this to the stable ABI -- the main beneficiaries from this are people who weren't using the limited API anyway, so really we're just adding another function to do the same thing, but with less version compatibility than PyUnicode_Compare. Why not mark this as an unstable API? That way, if we do add a better comparison interface in the future, then we won't have to continue maintaining PyUnicode_Equal.

@encukou
Copy link
Collaborator

encukou commented Sep 25, 2024

Both projects use it for handling keyword arguments. I expected either that or attribute names (e.g. implementing descriptor-like behaviour in tp_getattro).
IMO, that's a better fit for the C API than, say, PyLong_IsNegative, where the uses are more application-logic-y.

I think locale-aware comparisons have no place in the PyUnicode_* namespace. At this level, strings are sequences Unicode codepoints.
(FWIW, locale-aware equality involves a lot of computation too. If you want 'é' == 'é' to be True, you should not use a PyUnicode_* function.)

+1 for PyUnicode_Equal.
The docs should mention that the function works for str subclasses, but does not honor custom __eq__/tp_richcompare. (We don't have a lot of this kind of info in the docs, but we should start adding it.)

@vstinner
Copy link
Author

API shape seems fine, I prefer PyUnicode_IsEqual as a name.

As naming, PyUnicode_EqualToUTF8() function was added to Python 3.13. You might want to use a similar name. But I don't have a strong opinion, English is not my first language :-)

I find the performance argument fairly compelling

I ran a benchmark on strings of 10 characters:

  • PyUnicode_Equal: Mean +- std dev: 4.57 ns +- 0.11 ns
  • PyUnicode_Compare: Mean +- std dev: 5.87 ns +- 0.13 ns
  • PyUnicode_RichCompare: Mean +- std dev: 6.96 ns +- 0.22 ns

PyUnicode_Equal() is the fastest function: 1.3x faster than PyUnicode_Compare(). But we're talking about nanoseconds here: it's only 1.3 ns faster than PyUnicode_Compare().

Also, we need to rule out PyObject_RichCompareBool as a suitable alternative

It returns PyObject* which is less convenient, requires calling Py_DECREF(), and is less efficient.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants