Add `PyUnicode_Equal()` function #43

vstinner · 2024-09-25T14:58:15Z

I propose to add a public PyUnicode_Equal(a, b) function to the limited C API 3.14 to replace the private _PyUnicode_EQ() function:

API: int PyUnicode_Equal(PyObject *a, PyObject *b)

Return 1 if a is equal to b.
Return 0 if a is not equal to b.
Set a TypeError exception and return -1 if a or b is not a Python str object.
The function always succeed if a and b are strings.

Python 3.13 moved the private _PyUnicode_EQ() function to internal C API. mypy and Pyodide are using it.

The existing PyUnicode_Compare() isn't enough and has an issue. PyUnicode_Compare() returns -1 for "less than" but also for the error case. The caller must call PyErr_Occurred() which is inefficient. It causes an ambiguous return value: capi-workgroup/problems#1

PyUnicode_Equal() has no such ambiguous return value (-1 only means error). Moreover, PyUnicode_Equal() may be a little bit faster than PyUnicode_Compare(), but I'm not sure about that.

Vote to add this API:

The text was updated successfully, but these errors were encountered:

zooba · 2024-09-25T15:12:06Z

API shape seems fine, I prefer PyUnicode_IsEqual as a name.

I find the performance argument fairly compelling, since ordering requires a lot of calculation (and realistically requires a number of options, including locale, to be correct). Having a fast equals method is likely to prevent people from relying on interning strings and comparing pointers.

However, if the motivation really is that PyUnicode_Compare has problems, then shouldn't we be creating a better compare function? One that can take an "equals only" flag to be quick, but can also gain more flags (e.g. case insensitive, treat numbers logically, locale-aware ordering, etc.).

Knowing whether the people using the private function were looking specifically for speed or just using the first function their IDE told them about would help us make an informed decision here. I don't like when the proposal doesn't actually solve the problem given as the motivation (but am happy to change the motivation in this case).

zooba · 2024-09-25T15:14:36Z

Also, we need to rule out PyObject_RichCompareBool as a suitable alternative (perhaps paired with PyUnicode_Check if the caller really doesn't know what types they've got, but almost certainly in most cases where they have enough calls to justify a performance optimisation they're going to be responsible for having created the values, and will know that they're strings already).

ZeroIntensity · 2024-09-25T15:40:20Z

Not that my opinion matters much here, but I might as well share my thoughts: I think a faster comparison protocol is a great idea (and one I'd be happy to help work on), but probably better suited for api-revolution rather than this proposal. PyUnicode_Equal/PyUnicode_IsEqual is simply a convenient drop in for extensions that were relying on the private API, nothing more, nothing less.

Though, maybe it's going a bit far to add this to the stable ABI -- the main beneficiaries from this are people who weren't using the limited API anyway, so really we're just adding another function to do the same thing, but with less version compatibility than PyUnicode_Compare. Why not mark this as an unstable API? That way, if we do add a better comparison interface in the future, then we won't have to continue maintaining PyUnicode_Equal.

encukou · 2024-09-25T16:45:23Z

Both projects use it for handling keyword arguments. I expected either that or attribute names (e.g. implementing descriptor-like behaviour in tp_getattro).
IMO, that's a better fit for the C API than, say, PyLong_IsNegative, where the uses are more application-logic-y.

I think locale-aware comparisons have no place in the PyUnicode_* namespace. At this level, strings are sequences Unicode codepoints.
(FWIW, locale-aware equality involves a lot of computation too. If you want 'é' == 'é' to be True, you should not use a PyUnicode_* function.)

+1 for PyUnicode_Equal.
The docs should mention that the function works for str subclasses, but does not honor custom __eq__/tp_richcompare. (We don't have a lot of this kind of info in the docs, but we should start adding it.)

vstinner · 2024-09-25T17:51:25Z

API shape seems fine, I prefer PyUnicode_IsEqual as a name.

As naming, PyUnicode_EqualToUTF8() function was added to Python 3.13. You might want to use a similar name. But I don't have a strong opinion, English is not my first language :-)

I find the performance argument fairly compelling

I ran a benchmark on strings of 10 characters:

PyUnicode_Equal: Mean +- std dev: 4.57 ns +- 0.11 ns
PyUnicode_Compare: Mean +- std dev: 5.87 ns +- 0.13 ns
PyUnicode_RichCompare: Mean +- std dev: 6.96 ns +- 0.22 ns

PyUnicode_Equal() is the fastest function: 1.3x faster than PyUnicode_Compare(). But we're talking about nanoseconds here: it's only 1.3 ns faster than PyUnicode_Compare().

Also, we need to rule out PyObject_RichCompareBool as a suitable alternative

It returns PyObject* which is less convenient, requires calling Py_DECREF(), and is less efficient.

vstinner added the vote label Sep 25, 2024

This was referenced Sep 25, 2024

[C API] Add PyUnicode_Equal() function python/cpython#124502

Open

gh-124502: Add PyUnicode_Equal() function python/cpython#124504

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `PyUnicode_Equal()` function #43

Add `PyUnicode_Equal()` function #43

vstinner commented Sep 25, 2024 •

edited by encukou

Loading

zooba commented Sep 25, 2024

zooba commented Sep 25, 2024

ZeroIntensity commented Sep 25, 2024

encukou commented Sep 25, 2024

vstinner commented Sep 25, 2024

Add PyUnicode_Equal() function #43

Add PyUnicode_Equal() function #43

Comments

vstinner commented Sep 25, 2024 • edited by encukou Loading

zooba commented Sep 25, 2024

zooba commented Sep 25, 2024

ZeroIntensity commented Sep 25, 2024

encukou commented Sep 25, 2024

vstinner commented Sep 25, 2024

Add `PyUnicode_Equal()` function #43

Add `PyUnicode_Equal()` function #43

vstinner commented Sep 25, 2024 •

edited by encukou

Loading