Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NEW] Kill unkillable scripts #1783

Open
zuiderkwast opened this issue Feb 26, 2025 · 6 comments
Open

[NEW] Kill unkillable scripts #1783

zuiderkwast opened this issue Feb 26, 2025 · 6 comments

Comments

@zuiderkwast
Copy link
Contributor

The problem/use-case that the feature addresses

If a Lua script has done any writes, it's not killable using SCRIPT KILL, because it would violate the atomicity guarantee of Lua script. Only shutting down the server can kill the script.

The atomicity of a script isn't always that important though. In some workloads, it would be totally fine to violate it. At least, it's not as bad as having to restart the server.

Often when you have a hanging script, it's because of some infinite loop or other bug in the script. If you have a buggy script, it may not write correct stuff to the database anyway.

Description of the feature

Allow SCRIPT KILL (and FUNCTION KILL) to kill a script even if it has done some writes.

We could add some extra FORCE argument, or a config, if we're worried about changing the default behavior.

Alternatives you've considered

Some way to undo the writes done by a script, such as a redo-list, a frozen copy of the key space similar, a copy-on-write implemented in user space.

Additional information

@rjd15372
Copy link
Member

I would vote for adding the FORCE extra argument. That way the user is acknowledging that understands the consequences, which will be described in the documentation of the command.

@enjoy-binbin
Copy link
Member

Great, a FORCE LGTM, i think we can do it. btw, lets also talk about #866 (comment)?

Currently to ensure script atomicity, we can bypass maxmemory (data memory or lua heap memory). If we allow kill unkillable scripts, i think we can fix it. (we allow killing them in internal fork, ie. return OOM early)

EVAL "for i=1,10000000 do redis.call('SET', 'key-kkkk:' .. i, 'value-tttt:' .. i) end" 0
EVAL "for i=1,10000000 do redis.pcall('SET', 'key-kkkk:' .. i, 'value-tttt:' .. i) end" 0
EVAL "local a={}; local i; for i=1, 30000000 do a[i]=i; end" 0

@zuiderkwast
Copy link
Contributor Author

To return OOM early is an automatic kill, not SCRIPT KILL, right?

If we let the script be killed automatically by the memory limit, it's a behavior change that maybe should have a config, or at least be done in a major release.

@enjoy-binbin
Copy link
Member

To return OOM early is an automatic kill, not SCRIPT KILL, right?

yes, it is an automatic kill. There are always people messing around with scripts... We will give an extra 30MB of memory in case it is really lua tmp memory.

If we let the script be killed automatically by the memory limit, it's a behavior change that maybe should have a config, or at least be done in a major release.

yes, at least we're trying to take this step. This is usually the result of malicious script execution.

@madolson
Copy link
Member

madolson commented Feb 26, 2025

I would vote for adding the FORCE extra argument. That way the user is acknowledging that understands the consequences, which will be described in the documentation of the command.

Have we seen that much actual asks for this? I'm really concerned about comments like this, because the problem is people don't really understand the consequences and leave the datastore in a corrupted state. I like the alternatives that Viktor mentioned such as building rollbacks through a write ahead or an undo log. They're a lot more engineering effort, but they prevent inconsistent data. Inconsistent data is an extremely dangerous place to start wandering in to.

A second major concern I have here is that users basically never actually want to call script kill. Valkey is entirely single threaded and extremely latency sensitive. If you have a script running for 5 seconds, your application has already been down for 5 seconds. Paging an operator is at least 15 minutes away, and then they have to make a time pressure decision about what to do. If you're looking to automate, the safest thing is to build tooling around safely failing over to a replica or by reloading from the AOF.

This is usually the result of malicious script execution.

By malicious, do we mean like some type of hacker? If that is the case, I think the user has much larger problems than DOS. They could just intentionally corrupt all of the data. I think a long running lua script is actually fairly detectable.

@rjd15372
Copy link
Member

@madolson those are very good arguments against allowing to leave the database in an inconsistent state.

Last week I started to play a bit with the rollback stuff by implementing transactions that only guarantee atomicity and consistency (no isolation, or durability) on top of our hashtable implementation, but it requires changes to all data types code to support the rollback of transactions, and I'm not sure to be a good approach to follow.

Probably, solving the problem only for Lua scripts might be easier than making a generic infrastructure to support multi-command atomicity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants