Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-111089: PyUnicode_AsUTF8() now raises on embedded NUL #111091

Merged
merged 2 commits into from
Oct 20, 2023

Conversation

vstinner
Copy link
Member

@vstinner vstinner commented Oct 20, 2023

  • PyUnicode_AsUTF8() now raises an exception if the string contains embedded null characters.
  • PyUnicode_AsUTF8AndSize() now sets *size to 0 on error to avoid undefined variable value.
  • Update related C API tests (test_capi.test_unicode).
  • type_new_set_doc() uses PyUnicode_AsUTF8AndSize() to silently truncate doc containing null bytes.

📚 Documentation preview 📚: https://cpython-previews--111091.org.readthedocs.build/

@vstinner
Copy link
Member Author

cc @serhiy-storchaka

Objects/unicodeobject.c Outdated Show resolved Hide resolved
Copy link
Member

@serhiy-storchaka serhiy-storchaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered this idea when added such checks in other functions, but at that time PyUnicode_AsUTF8() was rather a simple accessor to internal data.

Most sites where it matters where changed to use PyUnicode_AsUTF8AndSize() + strlen(). Now some of them can be changed back to use PyUnicode_AsUTF8().

Do you consider to include PyUnicode_AsUTF8() in the limited C API?

extracted from the returned data.
*/

// Returns a pointer to the default encoding (UTF-8) of the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was it necessary to change the type of the comment? Most comments here are /* */.

It makes reviewing the changes more difficult.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not necessary, but I take this PR as an opportunity to change the comment style. IMO multi-line comments written with // comment syntax are way easier to read.

Objects/unicodeobject.c Outdated Show resolved Hide resolved
@vstinner
Copy link
Member Author

Do you consider to include PyUnicode_AsUTF8() in the limited C API?

It's my main motivation for this change. It's the most natural way to get a char*. But I prefer to discuss it separetely.

* PyUnicode_AsUTF8() now raises an exception if the string contains
  embedded null characters.
* Update related C API tests (test_capi.test_unicode).
* type_new_set_doc() uses PyUnicode_AsUTF8AndSize() to silently
  truncate doc containing null bytes.
@vstinner
Copy link
Member Author

Most sites where it matters where changed to use PyUnicode_AsUTF8AndSize() + strlen(). Now some of them can be changed back to use PyUnicode_AsUTF8().

Once this change will be merged, we can review PyUnicode_AsUTF8() and PyUnicode_AsUTF8AndSize() usage. But I chose to make this PR as small to possible to ease the review.

Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
@vstinner
Copy link
Member Author

Oh. It seems like I removed the cast fix in my latest change. Thanks @serhiy-storchaka, I applied your suggestion.

@serhiy-storchaka
Copy link
Member

But why set size to 0, and not to -1? There is larger chance to miss error. For example PyUnicode_FromStringAndSize(NULL, 0) and PyBytes_FromStringAndSize(NULL, 0) return empty string and empty bytes object, but if size is negative, they raise exception.

@vstinner
Copy link
Member Author

But why set size to 0, and not to -1?

Is this comment related to my PyUnicode_AsUTF8AndSize change, PR #111106?

@serhiy-storchaka
Copy link
Member

Yes.

@vstinner vstinner merged commit d731579 into python:main Oct 20, 2023
@vstinner vstinner deleted the asutf8 branch October 20, 2023 15:59
@vstinner
Copy link
Member Author

Merged, thanks for the review @serhiy-storchaka.

@Yhg1s
Copy link
Member

Yhg1s commented Nov 2, 2023

Is this change really necessary? Are there any examples of this actually being a problem? How common are the problematic patterns in actual user code?

Why is recommending a replacement of PyUnicode_AsUTF8AndSize(s, NULL) any better than leaving PyUnicode_AsUTF8(s) as-is? Considering the two have been documented and recommended as alternatives for so long, is it a good idea to break the equivalence? How should users know that one is "safe" and the other is not?

I use PyUnicode_AsUTF8() in performance-critical situations, and the extra check is wholly unnecessary for my uses. I'm also concerned about the change of semantics. Here's at least one simple example that would cause user code to start failing: creating a string with PyUnicode_FromUTF8AndSize() and passing buffer size instead of string length, then later fetching the data with PyUnicode_AsUTF8().

@vstinner
Copy link
Member Author

vstinner commented Nov 2, 2023

I use PyUnicode_AsUTF8() in performance-critical situations, and the extra check is wholly unnecessary for my uses.

I wrote PR #111587 to avoid calling strlen() in most cases.

@Yhg1s
Copy link
Member

Yhg1s commented Nov 2, 2023

I wrote PR #111587 to avoid calling strlen() in most cases.

Caching the result does not affect my use-cases.

@vstinner
Copy link
Member Author

vstinner commented Nov 2, 2023

Caching the result does not affect my use-cases.

Can you please add a comment on PR #111587 to explain why your use case is not covered by the cache?

@vstinner
Copy link
Member Author

vstinner commented Nov 2, 2023

Is this change really necessary? Are there any examples of this actually being a problem? How common are the problematic patterns in actual user code?

Sorry, I made the assumption that the background of this change was a common knowledge. I wrote issue #111656 to describe the issue.

@vstinner
Copy link
Member Author

vstinner commented Nov 2, 2023

Here's at least one simple example that would cause user code to start failing: creating a string with PyUnicode_FromUTF8AndSize() and passing buffer size instead of string length, then later fetching the data with PyUnicode_AsUTF8().

If the string does not contain embedded null character, the code continues to work. But if it contains null characters, right, you now get an exception. In that case, you should use PyUnicode_AsUTF8AndSize(), PyUnicode_AsUTF8String() or any other existing API which returns the size of the UTF-8 encoded string (and don't raise an exception in case of embedded null character).

@Yhg1s
Copy link
Member

Yhg1s commented Nov 2, 2023

Caching the result does not affect my use-cases.

Can you please add a comment on PR #111587 to explain why your use case is not covered by the cache?

I've commented there, but just for clarity: it's because it only ever calls PyUnicode_AsUTF8() once per string.

@Yhg1s
Copy link
Member

Yhg1s commented Nov 2, 2023

Here's at least one simple example that would cause user code to start failing: creating a string with PyUnicode_FromUTF8AndSize() and passing buffer size instead of string length, then later fetching the data with PyUnicode_AsUTF8().

If the string does not contain embedded null character, the code continues to work. But if it contains null characters, right, you now get an exception. In that case, you should use PyUnicode_AsUTF8AndSize(), PyUnicode_AsUTF8String() or any other existing API which returns the size of the UTF-8 encoded string (and don't raise an exception in case of embedded null character).

This is exactly the problem I'm pointing out. You are breaking existing code because you're making bad assumptions. The assumption is that the user didn't check for NULs before, or that handling NULs this way would be problematic. It does not have to be. You're forcing users who are correctly using the API to use a different API for no reason other than to signal to you that they're using it correctly. If users don't care about the length of the returned string, and they don't care about embedded NULs, why would they have to change anything? You're just introducing churn here.

@vstinner
Copy link
Member Author

vstinner commented Nov 2, 2023

If users don't care about the length of the returned string, and they don't care about embedded NULs, why would they have to change anything? You're just introducing churn here.

Using PyUnicode_AsUTF8() instead of PyUnicode_AsUTF8AndSize() doesn't mean that you don't care about the length. You can use PyUnicode_AsUTF8() when you use any C API which expects a null terminated string. strlen() gives you the string length, but you don't need to explicitly pass the length, the null terminator is used for that.

I'm not sure that PyUnicode_AsUTF8() implies that you don't care about embedded null characters.

Would you mind to elaborate how you create a string with null characters and then truncate the string at the first null character on purpose? What does the string contain? Why not truncating at null when you create the string? Do you have examples of code doing that?

vstinner added a commit to vstinner/cpython that referenced this pull request Nov 7, 2023
vstinner added a commit that referenced this pull request Nov 7, 2023
* Revert "gh-111089: Use PyUnicode_AsUTF8() in Argument Clinic (#111585)"

This reverts commit d9b606b.

* Revert "gh-111089: Use PyUnicode_AsUTF8() in getargs.c (#111620)"

This reverts commit cde1071.

* Revert "gh-111089: PyUnicode_AsUTF8() now raises on embedded NUL (#111091)"

This reverts commit d731579.

* Revert "gh-111089: Add PyUnicode_AsUTF8() to the limited C API (#111121)"

This reverts commit d8f32be.

* Revert "gh-111089: Use PyUnicode_AsUTF8() in sqlite3 (#111122)"

This reverts commit 37e4e20.
hugovk pushed a commit to hugovk/cpython that referenced this pull request Nov 8, 2023
* Revert "pythongh-111089: Use PyUnicode_AsUTF8() in Argument Clinic (python#111585)"

This reverts commit d9b606b.

* Revert "pythongh-111089: Use PyUnicode_AsUTF8() in getargs.c (python#111620)"

This reverts commit cde1071.

* Revert "pythongh-111089: PyUnicode_AsUTF8() now raises on embedded NUL (python#111091)"

This reverts commit d731579.

* Revert "pythongh-111089: Add PyUnicode_AsUTF8() to the limited C API (python#111121)"

This reverts commit d8f32be.

* Revert "pythongh-111089: Use PyUnicode_AsUTF8() in sqlite3 (python#111122)"

This reverts commit 37e4e20.
aisk pushed a commit to aisk/cpython that referenced this pull request Feb 11, 2024
…n#111091)

* PyUnicode_AsUTF8() now raises an exception if the string contains
  embedded null characters.
* Update related C API tests (test_capi.test_unicode).
* type_new_set_doc() uses PyUnicode_AsUTF8AndSize() to silently
  truncate doc containing null bytes.

Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
aisk pushed a commit to aisk/cpython that referenced this pull request Feb 11, 2024
* Revert "pythongh-111089: Use PyUnicode_AsUTF8() in Argument Clinic (python#111585)"

This reverts commit d9b606b.

* Revert "pythongh-111089: Use PyUnicode_AsUTF8() in getargs.c (python#111620)"

This reverts commit cde1071.

* Revert "pythongh-111089: PyUnicode_AsUTF8() now raises on embedded NUL (python#111091)"

This reverts commit d731579.

* Revert "pythongh-111089: Add PyUnicode_AsUTF8() to the limited C API (python#111121)"

This reverts commit d8f32be.

* Revert "pythongh-111089: Use PyUnicode_AsUTF8() in sqlite3 (python#111122)"

This reverts commit 37e4e20.
Glyphack pushed a commit to Glyphack/cpython that referenced this pull request Sep 2, 2024
…n#111091)

* PyUnicode_AsUTF8() now raises an exception if the string contains
  embedded null characters.
* Update related C API tests (test_capi.test_unicode).
* type_new_set_doc() uses PyUnicode_AsUTF8AndSize() to silently
  truncate doc containing null bytes.

Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
Glyphack pushed a commit to Glyphack/cpython that referenced this pull request Sep 2, 2024
* Revert "pythongh-111089: Use PyUnicode_AsUTF8() in Argument Clinic (python#111585)"

This reverts commit d9b606b.

* Revert "pythongh-111089: Use PyUnicode_AsUTF8() in getargs.c (python#111620)"

This reverts commit cde1071.

* Revert "pythongh-111089: PyUnicode_AsUTF8() now raises on embedded NUL (python#111091)"

This reverts commit d731579.

* Revert "pythongh-111089: Add PyUnicode_AsUTF8() to the limited C API (python#111121)"

This reverts commit d8f32be.

* Revert "pythongh-111089: Use PyUnicode_AsUTF8() in sqlite3 (python#111122)"

This reverts commit 37e4e20.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants