gh-111089: Add PyUnicode_AsUTF8NoNUL() function #111688

vstinner · 2023-11-03T10:04:32Z

Revert PyUnicode_AsUTF8() change: it no longer rejects embedded null characters: the PyUnicode_AsUTF8NoNUL() function should be used instead.

Issue: [C API] Change PyUnicode_AsUTF8() to return NULL on embedded null characters #111089

📚 Documentation preview 📚: https://cpython-previews--111688.org.readthedocs.build/

vstinner · 2023-11-03T10:09:27Z

I'm not sure about PyUnicode_AsUTF8Safe() name: "safe" might give a feeling of security which can be wrong. As written in the function documentation, it doesn't check for "dangerous Unicode characters". It implements a single check: only tests if the string contains null characters.

For example, if the string is used to open a file, the function doesn't reject ../ substring which is commonly used for directory traversal vulnerabilities. Further sanitization and validation is required depending on the domain where the string is used.

Maybe the name should be more explicitly about its purpose, such as: PyUnicode_AsUTF8NoNul()? Can such name be misunderstood as "the result cannot be NULL", such as "the function cannot fail"? Or maybe PyUnicode_AsUTF8NoNulChar()?

AlexWaygood · 2023-11-03T10:11:27Z

the clinic.py changes look fine to me; I have no opinion on the C changes :)

erlend-aasland · 2023-11-03T11:49:41Z

Clinic changes are ok with me. I'll leave the C API discussion to the C API WG :)

vstinner · 2023-11-03T11:50:07Z

If this change is merged, it would be interesting to go through PyUnicode_AsUTF8() usage in the Python code base and check if switching PyUnicode_AsUTF8Safe() would be worth it.

See #111656 (comment) list for example.

vstinner · 2023-11-03T11:53:32Z

I'll leave the C API discussion to the C API WG :)

The working group doesn't exist yet, it's still a draft PEP: https://discuss.python.org/t/pep-731-c-api-working-group-charter/36117

vstinner · 2023-11-03T17:32:16Z

@encukou @gpshead @Yhg1s @serhiy-storchaka: Would you mind to review this change?

serhiy-storchaka · 2023-11-03T18:24:43Z

Are there strong objections against simply making PyUnicode_AsUTF8AdnSize(unicode, NULL) to check for embedded NULs?

serhiy-storchaka · 2023-11-03T18:29:54Z

Tools/clinic/clinic.py

@@ -4350,7 +4350,7 @@ def parse_arg(self, argname: str, displayname: str, *, limited_capi: bool) -> st
                    {bad_argument}
                    goto exit;
                }}}}
-                {paramname} = PyUnicode_AsUTF8({argname});
+                {paramname} = PyUnicode_AsUTF8Safe({argname});


There should be different code for limited_capi is true.

Oh. I wanted to add PyUnicode_AsUTF8Safe() to the limited C API in a separated PR. Maybe it's more convenient to do it in the PR, since there is a lot of code generated by Argument Clinic impacted by these changes.

vstinner · 2023-11-03T19:58:13Z

Are there strong objections against simply making PyUnicode_AsUTF8AdnSize(unicode, NULL) to check for embedded NULs?

I suppose that it's too late to change it, same rationale than changing PyUnicode_AsUTF8().

Moreover, PyUnicode_AsUTF8AndSize(str, NULL) is the most convenient API of the stable ABI to convert a Python str to char* (as UTF-8) without storing the size.

vstinner · 2023-11-04T21:32:48Z

I rebased the PR on the main branch to get the test_asyncio fix. I also squashed commits.

encukou · 2023-11-06T12:52:12Z

Wait until the WG can discuss this.

I agree that Safe is too generic.

I wouldn't mind naming it UTF8Z, with Z used for “zero-terminated string without NUL characters” for all new API from now on.

vstinner · 2023-11-06T14:08:36Z

If "Safe" is too generic, what about "AsUTF8NoNul"? I propose "Nul" instead of "Null" which looks like "NULL pointer". Or maybe "AsUTF8NoNullChars"?

"Z" looks too short to me, it's not easy to guess its intent.

vstinner · 2023-11-07T10:50:54Z

If "Safe" is too generic, what about "AsUTF8NoNul"? I propose "Nul" instead of "Null" which looks like "NULL pointer". Or maybe "AsUTF8NoNullChars"?

Alternative short name: PyUnicode_AsUTF8NoNUL() since it's common to refer to "null characters" as NUL. It's commonly used in encoding tables such as the ASCII table. Example on Wikipedia: ASCII: Character Set.

Revert PyUnicode_AsUTF8() change: it no longer rejects embedded null characters: the PyUnicode_AsUTF8Safe() function should be used instead.

vstinner · 2023-11-07T11:02:20Z

@serhiy-storchaka @erlend-aasland: I renamed the function to PyUnicode_AsUTF8NoNUL(). Are you fine with this name?

erlend-aasland · 2023-11-07T11:15:28Z

@serhiy-storchaka @erlend-aasland: I renamed the function to PyUnicode_AsUTF8NoNUL(). Are you fine with this name?

I didn't follow the discussion. My first impression is that I don't understand what that API is supposed to do, based on the name only.

serhiy-storchaka · 2023-11-07T11:27:51Z

I am fine with added the embedded null character check in PyUnicode_AsUTF8() or PyUnicode_AsUTF8AndSize(). If you ask for a char pointer, but have no way to get the size, you ask for a null-terminated C string. If the Python string contains embedded nulls, the result is ambiguous, therefore it must be exception.

If people not fine with this, ask them with what are they fine.

vstinner · 2023-11-07T11:33:14Z

If people not fine with this, ask them with what are they fine.

I asked @gpshead and @Yhg1s for a review ;-)

Yhg1s · 2023-11-07T12:31:37Z

I don't know why we need a new public function. For a new function, if you feel like it needs a strlen check, that's fine. I don't think it's a sensible way to deal with C strings (everyone who works with C strings should know about the importance of NUL), but for new APIs I don't really care either way.

However:

Changing existing functions is a really, really bad idea.
Does this really need to be a public API function? It seems stunningly trivial.
There's nothing 'safer' about propagating NULs rather than truncating, so avoid that framing. It would be safer to remind people to learn about C strings when dealing with strings in C.
I think these kinds of decisions should go to the C API WG that's being considered. It doesn't hurt to wait a few weeks to see if that is going to happen.

vstinner · 2023-11-07T22:37:08Z

I close my issue: #111089 (comment)

vstinner requested review from erlend-aasland, AlexWaygood and berkerpeksag as code owners November 3, 2023 10:04

bedevere-app bot added the awaiting core review label Nov 3, 2023

bedevere-app bot mentioned this pull request Nov 3, 2023

[C API] Change PyUnicode_AsUTF8() to return NULL on embedded null characters #111089

Closed

AlexWaygood removed their request for review November 3, 2023 10:11

vstinner mentioned this pull request Nov 3, 2023

gh-111089: Add PyUnicode_AsUTF8Unsafe() function #111672

Closed

vstinner requested a review from corona10 as a code owner November 3, 2023 11:39

erlend-aasland approved these changes Nov 3, 2023

View reviewed changes

bedevere-app bot added awaiting merge and removed awaiting core review labels Nov 3, 2023

serhiy-storchaka reviewed Nov 3, 2023

View reviewed changes

serhiy-storchaka approved these changes Nov 4, 2023

View reviewed changes

vstinner force-pushed the unicode_asutf8safe branch from 613243e to 96b2b8b Compare November 4, 2023 21:31

vstinner added 3 commits November 7, 2023 11:55

pythongh-111089: Add PyUnicode_AsUTF8Safe() function

36973d6

Revert PyUnicode_AsUTF8() change: it no longer rejects embedded null characters: the PyUnicode_AsUTF8Safe() function should be used instead.

Add to the limited C API

844a399

Rename to PyUnicode_AsUTF8NoNUL()

193c18b

vstinner force-pushed the unicode_asutf8safe branch from 96b2b8b to 193c18b Compare November 7, 2023 11:01

vstinner requested review from a team and encukou as code owners November 7, 2023 11:01

vstinner changed the title ~~gh-111089: Add PyUnicode_AsUTF8Safe() function~~ gh-111089: Add PyUnicode_AsUTF8NoNUL() function Nov 7, 2023

vstinner closed this Nov 7, 2023

vstinner deleted the unicode_asutf8safe branch November 7, 2023 22:37

encukou mentioned this pull request Nov 8, 2023

Soft-deprecate PyUnicode_AsUTF8 capi-workgroup/api-evolution#39

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gh-111089: Add PyUnicode_AsUTF8NoNUL() function #111688

gh-111089: Add PyUnicode_AsUTF8NoNUL() function #111688

vstinner commented Nov 3, 2023 •

edited

Loading

vstinner commented Nov 3, 2023

AlexWaygood commented Nov 3, 2023

erlend-aasland commented Nov 3, 2023

vstinner commented Nov 3, 2023

vstinner commented Nov 3, 2023

vstinner commented Nov 3, 2023

serhiy-storchaka commented Nov 3, 2023

serhiy-storchaka Nov 3, 2023

vstinner Nov 3, 2023

vstinner commented Nov 3, 2023

vstinner commented Nov 4, 2023

encukou commented Nov 6, 2023

vstinner commented Nov 6, 2023

vstinner commented Nov 7, 2023

vstinner commented Nov 7, 2023

erlend-aasland commented Nov 7, 2023

serhiy-storchaka commented Nov 7, 2023

vstinner commented Nov 7, 2023

Yhg1s commented Nov 7, 2023

vstinner commented Nov 7, 2023

gh-111089: Add PyUnicode_AsUTF8NoNUL() function #111688

gh-111089: Add PyUnicode_AsUTF8NoNUL() function #111688

Conversation

vstinner commented Nov 3, 2023 • edited Loading

vstinner commented Nov 3, 2023

AlexWaygood commented Nov 3, 2023

erlend-aasland commented Nov 3, 2023

vstinner commented Nov 3, 2023

vstinner commented Nov 3, 2023

vstinner commented Nov 3, 2023

serhiy-storchaka commented Nov 3, 2023

serhiy-storchaka Nov 3, 2023

Choose a reason for hiding this comment

vstinner Nov 3, 2023

Choose a reason for hiding this comment

vstinner commented Nov 3, 2023

vstinner commented Nov 4, 2023

encukou commented Nov 6, 2023

vstinner commented Nov 6, 2023

vstinner commented Nov 7, 2023

vstinner commented Nov 7, 2023

erlend-aasland commented Nov 7, 2023

serhiy-storchaka commented Nov 7, 2023

vstinner commented Nov 7, 2023

Yhg1s commented Nov 7, 2023

vstinner commented Nov 7, 2023

vstinner commented Nov 3, 2023 •

edited

Loading