-
-
Notifications
You must be signed in to change notification settings - Fork 30.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gh-111089: PyUnicode_AsUTF8() now raises on embedded NUL #111091
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I considered this idea when added such checks in other functions, but at that time PyUnicode_AsUTF8()
was rather a simple accessor to internal data.
Most sites where it matters where changed to use PyUnicode_AsUTF8AndSize()
+ strlen()
. Now some of them can be changed back to use PyUnicode_AsUTF8()
.
Do you consider to include PyUnicode_AsUTF8()
in the limited C API?
extracted from the returned data. | ||
*/ | ||
|
||
// Returns a pointer to the default encoding (UTF-8) of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was it necessary to change the type of the comment? Most comments here are /* */
.
It makes reviewing the changes more difficult.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not necessary, but I take this PR as an opportunity to change the comment style. IMO multi-line comments written with // comment
syntax are way easier to read.
It's my main motivation for this change. It's the most natural way to get a |
* PyUnicode_AsUTF8() now raises an exception if the string contains embedded null characters. * Update related C API tests (test_capi.test_unicode). * type_new_set_doc() uses PyUnicode_AsUTF8AndSize() to silently truncate doc containing null bytes.
Once this change will be merged, we can review PyUnicode_AsUTF8() and PyUnicode_AsUTF8AndSize() usage. But I chose to make this PR as small to possible to ease the review. |
Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
Oh. It seems like I removed the cast fix in my latest change. Thanks @serhiy-storchaka, I applied your suggestion. |
But why set size to 0, and not to -1? There is larger chance to miss error. For example |
Is this comment related to my PyUnicode_AsUTF8AndSize change, PR #111106? |
Yes. |
Merged, thanks for the review @serhiy-storchaka. |
Is this change really necessary? Are there any examples of this actually being a problem? How common are the problematic patterns in actual user code? Why is recommending a replacement of I use |
I wrote PR #111587 to avoid calling strlen() in most cases. |
Caching the result does not affect my use-cases. |
Can you please add a comment on PR #111587 to explain why your use case is not covered by the cache? |
Sorry, I made the assumption that the background of this change was a common knowledge. I wrote issue #111656 to describe the issue. |
If the string does not contain embedded null character, the code continues to work. But if it contains null characters, right, you now get an exception. In that case, you should use |
I've commented there, but just for clarity: it's because it only ever calls |
This is exactly the problem I'm pointing out. You are breaking existing code because you're making bad assumptions. The assumption is that the user didn't check for NULs before, or that handling NULs this way would be problematic. It does not have to be. You're forcing users who are correctly using the API to use a different API for no reason other than to signal to you that they're using it correctly. If users don't care about the length of the returned string, and they don't care about embedded NULs, why would they have to change anything? You're just introducing churn here. |
Using PyUnicode_AsUTF8() instead of PyUnicode_AsUTF8AndSize() doesn't mean that you don't care about the length. You can use PyUnicode_AsUTF8() when you use any C API which expects a null terminated string. strlen() gives you the string length, but you don't need to explicitly pass the length, the null terminator is used for that. I'm not sure that PyUnicode_AsUTF8() implies that you don't care about embedded null characters. Would you mind to elaborate how you create a string with null characters and then truncate the string at the first null character on purpose? What does the string contain? Why not truncating at null when you create the string? Do you have examples of code doing that? |
python#111091)" This reverts commit d731579.
* Revert "gh-111089: Use PyUnicode_AsUTF8() in Argument Clinic (#111585)" This reverts commit d9b606b. * Revert "gh-111089: Use PyUnicode_AsUTF8() in getargs.c (#111620)" This reverts commit cde1071. * Revert "gh-111089: PyUnicode_AsUTF8() now raises on embedded NUL (#111091)" This reverts commit d731579. * Revert "gh-111089: Add PyUnicode_AsUTF8() to the limited C API (#111121)" This reverts commit d8f32be. * Revert "gh-111089: Use PyUnicode_AsUTF8() in sqlite3 (#111122)" This reverts commit 37e4e20.
* Revert "pythongh-111089: Use PyUnicode_AsUTF8() in Argument Clinic (python#111585)" This reverts commit d9b606b. * Revert "pythongh-111089: Use PyUnicode_AsUTF8() in getargs.c (python#111620)" This reverts commit cde1071. * Revert "pythongh-111089: PyUnicode_AsUTF8() now raises on embedded NUL (python#111091)" This reverts commit d731579. * Revert "pythongh-111089: Add PyUnicode_AsUTF8() to the limited C API (python#111121)" This reverts commit d8f32be. * Revert "pythongh-111089: Use PyUnicode_AsUTF8() in sqlite3 (python#111122)" This reverts commit 37e4e20.
…n#111091) * PyUnicode_AsUTF8() now raises an exception if the string contains embedded null characters. * Update related C API tests (test_capi.test_unicode). * type_new_set_doc() uses PyUnicode_AsUTF8AndSize() to silently truncate doc containing null bytes. Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
* Revert "pythongh-111089: Use PyUnicode_AsUTF8() in Argument Clinic (python#111585)" This reverts commit d9b606b. * Revert "pythongh-111089: Use PyUnicode_AsUTF8() in getargs.c (python#111620)" This reverts commit cde1071. * Revert "pythongh-111089: PyUnicode_AsUTF8() now raises on embedded NUL (python#111091)" This reverts commit d731579. * Revert "pythongh-111089: Add PyUnicode_AsUTF8() to the limited C API (python#111121)" This reverts commit d8f32be. * Revert "pythongh-111089: Use PyUnicode_AsUTF8() in sqlite3 (python#111122)" This reverts commit 37e4e20.
…n#111091) * PyUnicode_AsUTF8() now raises an exception if the string contains embedded null characters. * Update related C API tests (test_capi.test_unicode). * type_new_set_doc() uses PyUnicode_AsUTF8AndSize() to silently truncate doc containing null bytes. Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
* Revert "pythongh-111089: Use PyUnicode_AsUTF8() in Argument Clinic (python#111585)" This reverts commit d9b606b. * Revert "pythongh-111089: Use PyUnicode_AsUTF8() in getargs.c (python#111620)" This reverts commit cde1071. * Revert "pythongh-111089: PyUnicode_AsUTF8() now raises on embedded NUL (python#111091)" This reverts commit d731579. * Revert "pythongh-111089: Add PyUnicode_AsUTF8() to the limited C API (python#111121)" This reverts commit d8f32be. * Revert "pythongh-111089: Use PyUnicode_AsUTF8() in sqlite3 (python#111122)" This reverts commit 37e4e20.
*size
to 0 on error to avoid undefined variable value.📚 Documentation preview 📚: https://cpython-previews--111091.org.readthedocs.build/