Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Embedded null characters can lead to bugs or even security vulnerabilities #111656

Closed
vstinner opened this issue Nov 2, 2023 · 10 comments
Closed
Labels
type-security A security issue

Comments

@vstinner
Copy link
Member

vstinner commented Nov 2, 2023

I just modified PyUnicode_AsUTF8() of the C API to raise an exception if a string contains an embedded null character to reduce the risk of security vulnerabilities. PyUnicode_AsUTF8() caller expects a string terminated by a null byte. If the UTF-8 encoded string contains embedded null byte, the caller is likely to truncate the string without knowing that there are more bytes after "the first" null byte.

See: https://owasp.org/www-community/attacks/Embedding_Null_Code

It's not only about security issue, it can also just be seen as a bug: unwanted behavior.

Previous issues:

Discussions:


Example with Python 3.12:

import ctypes

libc = ctypes.cdll.LoadLibrary('libc.so.6')
printf = libc.printf
PyUnicode_AsUTF8 = ctypes.pythonapi.PyUnicode_AsUTF8
PyUnicode_AsUTF8.argtypes = (ctypes.py_object,)
PyUnicode_AsUTF8.restype = ctypes.c_char_p

my_string = "World\0truncated string"
printf(b"Hello %s\n", PyUnicode_AsUTF8(my_string))

Output:

Hello World

The truncated string part is silently ignored!


Multiple functions were modified in the past to prevent this problem. Examples:

  • _dbm.open(): check filename
  • _gdbm.open(): check filename
  • PyBytes_AsStringAndSize(str, NULL)
  • grp.getgrnam(): check name
  • pwd.getpwnam(): check name
  • _locale.strxfrm(): check argument
  • path_converter() of the os module: basically any filename and path
  • PyUnicode_AsWideCharString()
  • os.putenv()
  • _posixsubprocess.fork_exec(): executable_list
  • _struct.Struct: check format
  • _tkinter SetVar() and varname_converter()
  • _winapi.CreateProcess() getenvironment()
  • PyUnicode_EncodeLocale()
  • PyUnicode_EncodeFSDefault()
  • unicode_decode_locale()
  • PyUnicode_FSConverter()
  • PyUnicode_DecodeLocale()
  • PyUnicode_DecodeLocaleAndSize()
  • PyUnicode_FSDecoder()
  • PyUnicode_AsUTF8() -- recently modified
  • _Py_stat(): check path
  • getargs.c: 's', 'y' and 'z' formats

There are exceptions which accept embedded null bytes/characters:

  • socket: AF_UNIX socket name
@terryjreedy
Copy link
Member

Just curious, what action is proposed by this issue?

@Yhg1s
Copy link
Member

Yhg1s commented Nov 2, 2023

Since this came up specifically about PyUnicode_AsUTF8, do you have any examples of problems caused by that function in particular? I feel bad for hammering on this, but I have asked that very question what feels like ten times now, and I just get vague "embedded NULs can cause problems" as answer.

Even in this list of "things that were fixed", you're not showing actual security issues, although I'm sure a lot of the fixes were good for both security reasons and others. I just want to see concrete, actual, real problems the PyUnicode_AsUTF8 change is fixing.

@vstinner
Copy link
Member Author

vstinner commented Nov 3, 2023

Multiple functions of my list of fixed functions (examples) use UTF-8 and so likely called PyUnicode_AsUTF8() before being fixed. My point is that blocking embedded null characters in PyUnicode_AsUTF8() would fix indirectly all of these examples.


Some examples of Python 3.12 functions using directly PyUnicode_AsUTF8() without checking for embedded null characters. I doubt that developers using PyUnicode_AsUTF8() are aware of the embedded null issue and took it in account when writing the code. For most of these functions, truncating sounds between bad (surprising or wrong behavior) or very bad (security issue).

  • Modules/getpath.c getpath_warn() truncates the message
  • _hmac.new(key, msg, digestmod) truncates digestmod -- related to security
  • os.confstr(name), os.sysconf(name): truncate name
  • socket.getaddrinfo(): truncates port string
  • syslog.openlog(ident): truncate ident
  • timemodule.c gettmarg() truncates tm_zone (timezone string)
  • _xxsubinterpretersmodule.c _copy_raw_string() truncates the string -- that sounds like a bad bug :-(
  • PyModule_FromDefAndSpec2(): truncate spec.name
  • PyModule_GetName() truncates the module name
  • PyModule_GetFilename() truncates the module filename
  • PyModule_GetAttr() truncates name in the tp->tp_getattr code path
  • sys.audit() truncates auditEvent -- this sound bad

I didn't check for functions calling indirectly PyUnicode_AsUTF8(). The exhaustive list should be way longer.


Have fun with null characters.

$ ./python -c 'import _imp, types; spec=types.SimpleNamespace(name="sys\0ignored"); _imp.create_builtin(spec)'
python: Python/import.c:939: hashtable_key_from_2_strings: Assertion `strlen(key) == size - 1' failed.
Abandon (core dumped)

$ ./python -c 'import _imp, types; spec=types.SimpleNamespace(name="sys\0ignored", origin=""); _imp.create_dynamic(spec)'
python: Python/import.c:939: hashtable_key_from_2_strings: Assertion `strlen(key) == size - 1' failed.

$ ./python -c 'import _imp, types; name="zipimport\0ignored"; print(_imp.find_frozen(name))'
(None, False, 'zipimport')

$ ./python -c 'import hmac; print(hmac.new(b"data", digestmod="md5\0ignored").hexdigest())'
b60acef36887ece81fffd558f7a62378
$ ./python -c 'import hmac; print(hmac.new(b"data", digestmod="md5").hexdigest())'
b60acef36887ece81fffd558f7a62378

$ ./python -c 'import os; print(os.sysconf("SC_NPROCESSORS_ONLN\0ignored"))'
12

$ ./python -c 'import socket; print(socket.getaddrinfo("google.com", "80\0ignored"))'
[(<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('142.250.179.78', 80)), ...]

$ ./python -c 'import syslog; syslog.openlog("pythontest\0ignored"); syslog.syslog("test")'
$ journalctl -r|head -n1
nov. 03 01:10:41 mona pythontest[149433]: test

Audit are related to security, right?

import sys
def hook(*args):
    print(args)
sys.addaudithook(hook)
sys.audit("event\0ignored")

Output:

('event', ())

@corona10
Copy link
Member

corona10 commented Nov 3, 2023

PyModule_FromDefAndSpec2(): truncate spec.name
PyModule_GetName() truncates the module name
PyModule_GetFilename() truncates the module filename
PyModule_GetAttr() truncates name in the tp->tp_getattr code path

The inputs of these APIs are dynamically created?
Most of them are not fixed inputs?

@vstinner
Copy link
Member Author

vstinner commented Nov 3, 2023

I'm sure that there are legit use cases to truncate a string at the first null character on purpose. The question is what is the behavior expected by the majority of users when it comes to null characters? Do developers think ahead of null characters when they write C code? Is the majority of developers aware of this issue?

If we add an hypothetical PyUnicode_AsUTF8Safe() function which raises an exception if a string contains a null character: should existing C extensions move towards this new function, at least for new code? Is it ok that existing C extensions (which are not updated) will continue truncating?

Should the C API be "safe" by default, or should developers opt-in to be safe? Apparently, it's still an open question. I thought that the question was already answered in the past since many functions have been already modified to reject embedded null characters. See the list at: #111089 (comment)

@vstinner
Copy link
Member Author

vstinner commented Nov 3, 2023

The inputs of these APIs are dynamically created?

My imp.create_builtin() example above calls PyModule_FromDefAndSpec2() with an arbitrary string.

You can create a module with arbitrary name and filename: import types; mod = types.ModuleType("sys\0ignored"); mod.__file__ = "safe_path\0unsafe".

I didn't check every single example to see how easy it is to play with embedded null characters. Maybe it's not possible to "exploit" all examples. I just gave multiple examples to show the diversity of the issue and see that many functions are impacted.

@vstinner
Copy link
Member Author

vstinner commented Nov 3, 2023

Java took a different way for this issue: DataOutputStream.writeUTF() encodes the null character as 2 bytes instead of 1 byte, and so doesn't respect the standard. DataInput.readUTF8() can read this format.

@Yhg1s
Copy link
Member

Yhg1s commented Nov 3, 2023

Victor, since this issue seems to be a response to me asking what actual security issue you're fixing by changing PyUnicode_AsUTF8()'s semantics, let me try to make sure I understand what you're saying here: "No, there are no known security issues caused by PyUnicode_AsUTF8()'s current behaviour." Is that right?

@vstinner
Copy link
Member Author

vstinner commented Nov 3, 2023

@terryjreedy:

Just curious, what action is proposed by this issue?

It's directly related to gh-111089 issue where I changed PyUnicode_AsUTF8() to reject embedded null characters.

I close this issue. I explained why in #111089 (comment). In short, PyUnicode_AsUTF8() is "too broad" or "too generic" to be treated as "security sensitive".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type-security A security issue
Projects
None yet
Development

No branches or pull requests

4 participants