Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-111089: Add PyUnicode_AsUTF8NoNUL() function #111688

Closed
wants to merge 3 commits into from

Conversation

vstinner
Copy link
Member

@vstinner vstinner commented Nov 3, 2023

Revert PyUnicode_AsUTF8() change: it no longer rejects embedded null characters: the PyUnicode_AsUTF8NoNUL() function should be used instead.


📚 Documentation preview 📚: https://cpython-previews--111688.org.readthedocs.build/

@vstinner
Copy link
Member Author

vstinner commented Nov 3, 2023

I'm not sure about PyUnicode_AsUTF8Safe() name: "safe" might give a feeling of security which can be wrong. As written in the function documentation, it doesn't check for "dangerous Unicode characters". It implements a single check: only tests if the string contains null characters.

For example, if the string is used to open a file, the function doesn't reject ../ substring which is commonly used for directory traversal vulnerabilities. Further sanitization and validation is required depending on the domain where the string is used.

Maybe the name should be more explicitly about its purpose, such as: PyUnicode_AsUTF8NoNul()? Can such name be misunderstood as "the result cannot be NULL", such as "the function cannot fail"? Or maybe PyUnicode_AsUTF8NoNulChar()?

@AlexWaygood
Copy link
Member

the clinic.py changes look fine to me; I have no opinion on the C changes :)

@erlend-aasland
Copy link
Contributor

Clinic changes are ok with me. I'll leave the C API discussion to the C API WG :)

@vstinner
Copy link
Member Author

vstinner commented Nov 3, 2023

If this change is merged, it would be interesting to go through PyUnicode_AsUTF8() usage in the Python code base and check if switching PyUnicode_AsUTF8Safe() would be worth it.

See #111656 (comment) list for example.

@vstinner
Copy link
Member Author

vstinner commented Nov 3, 2023

I'll leave the C API discussion to the C API WG :)

The working group doesn't exist yet, it's still a draft PEP: https://discuss.python.org/t/pep-731-c-api-working-group-charter/36117

@vstinner
Copy link
Member Author

vstinner commented Nov 3, 2023

@encukou @gpshead @Yhg1s @serhiy-storchaka: Would you mind to review this change?

@serhiy-storchaka
Copy link
Member

Are there strong objections against simply making PyUnicode_AsUTF8AdnSize(unicode, NULL) to check for embedded NULs?

@@ -4350,7 +4350,7 @@ def parse_arg(self, argname: str, displayname: str, *, limited_capi: bool) -> st
{bad_argument}
goto exit;
}}}}
{paramname} = PyUnicode_AsUTF8({argname});
{paramname} = PyUnicode_AsUTF8Safe({argname});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be different code for limited_capi is true.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh. I wanted to add PyUnicode_AsUTF8Safe() to the limited C API in a separated PR. Maybe it's more convenient to do it in the PR, since there is a lot of code generated by Argument Clinic impacted by these changes.

@vstinner
Copy link
Member Author

vstinner commented Nov 3, 2023

Are there strong objections against simply making PyUnicode_AsUTF8AdnSize(unicode, NULL) to check for embedded NULs?

I suppose that it's too late to change it, same rationale than changing PyUnicode_AsUTF8().

Moreover, PyUnicode_AsUTF8AndSize(str, NULL) is the most convenient API of the stable ABI to convert a Python str to char* (as UTF-8) without storing the size.

@vstinner
Copy link
Member Author

vstinner commented Nov 4, 2023

I rebased the PR on the main branch to get the test_asyncio fix. I also squashed commits.

@encukou
Copy link
Member

encukou commented Nov 6, 2023

Wait until the WG can discuss this.

I agree that Safe is too generic.

I wouldn't mind naming it UTF8Z, with Z used for “zero-terminated string without NUL characters” for all new API from now on.

@vstinner
Copy link
Member Author

vstinner commented Nov 6, 2023

If "Safe" is too generic, what about "AsUTF8NoNul"? I propose "Nul" instead of "Null" which looks like "NULL pointer". Or maybe "AsUTF8NoNullChars"?

"Z" looks too short to me, it's not easy to guess its intent.

@vstinner
Copy link
Member Author

vstinner commented Nov 7, 2023

If "Safe" is too generic, what about "AsUTF8NoNul"? I propose "Nul" instead of "Null" which looks like "NULL pointer". Or maybe "AsUTF8NoNullChars"?

Alternative short name: PyUnicode_AsUTF8NoNUL() since it's common to refer to "null characters" as NUL. It's commonly used in encoding tables such as the ASCII table. Example on Wikipedia: ASCII: Character Set.

Revert PyUnicode_AsUTF8() change: it no longer rejects embedded null
characters: the PyUnicode_AsUTF8Safe() function should be used
instead.
@vstinner vstinner requested review from a team and encukou as code owners November 7, 2023 11:01
@vstinner
Copy link
Member Author

vstinner commented Nov 7, 2023

@serhiy-storchaka @erlend-aasland: I renamed the function to PyUnicode_AsUTF8NoNUL(). Are you fine with this name?

@vstinner vstinner changed the title gh-111089: Add PyUnicode_AsUTF8Safe() function gh-111089: Add PyUnicode_AsUTF8NoNUL() function Nov 7, 2023
@erlend-aasland
Copy link
Contributor

@serhiy-storchaka @erlend-aasland: I renamed the function to PyUnicode_AsUTF8NoNUL(). Are you fine with this name?

I didn't follow the discussion. My first impression is that I don't understand what that API is supposed to do, based on the name only.

@serhiy-storchaka
Copy link
Member

I am fine with added the embedded null character check in PyUnicode_AsUTF8() or PyUnicode_AsUTF8AndSize(). If you ask for a char pointer, but have no way to get the size, you ask for a null-terminated C string. If the Python string contains embedded nulls, the result is ambiguous, therefore it must be exception.

If people not fine with this, ask them with what are they fine.

@vstinner
Copy link
Member Author

vstinner commented Nov 7, 2023

If people not fine with this, ask them with what are they fine.

I asked @gpshead and @Yhg1s for a review ;-)

@Yhg1s
Copy link
Member

Yhg1s commented Nov 7, 2023

I don't know why we need a new public function. For a new function, if you feel like it needs a strlen check, that's fine. I don't think it's a sensible way to deal with C strings (everyone who works with C strings should know about the importance of NUL), but for new APIs I don't really care either way.

However:

  • Changing existing functions is a really, really bad idea.
  • Does this really need to be a public API function? It seems stunningly trivial.
  • There's nothing 'safer' about propagating NULs rather than truncating, so avoid that framing. It would be safer to remind people to learn about C strings when dealing with strings in C.
  • I think these kinds of decisions should go to the C API WG that's being considered. It doesn't hurt to wait a few weeks to see if that is going to happen.

@vstinner
Copy link
Member Author

vstinner commented Nov 7, 2023

I close my issue: #111089 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants