[NEW] Kill unkillable scripts #1783

zuiderkwast · 2025-02-26T10:14:01Z

The problem/use-case that the feature addresses

If a Lua script has done any writes, it's not killable using SCRIPT KILL, because it would violate the atomicity guarantee of Lua script. Only shutting down the server can kill the script.

The atomicity of a script isn't always that important though. In some workloads, it would be totally fine to violate it. At least, it's not as bad as having to restart the server.

Often when you have a hanging script, it's because of some infinite loop or other bug in the script. If you have a buggy script, it may not write correct stuff to the database anyway.

Description of the feature

Allow SCRIPT KILL (and FUNCTION KILL) to kill a script even if it has done some writes.

We could add some extra FORCE argument, or a config, if we're worried about changing the default behavior.

Alternatives you've considered

Some way to undo the writes done by a script, such as a redo-list, a frozen copy of the key space similar, a copy-on-write implemented in user space.

Additional information

rjd15372 · 2025-02-26T10:20:12Z

I would vote for adding the FORCE extra argument. That way the user is acknowledging that understands the consequences, which will be described in the documentation of the command.

enjoy-binbin · 2025-02-26T11:49:29Z

Great, a FORCE LGTM, i think we can do it. btw, lets also talk about #866 (comment)?

Currently to ensure script atomicity, we can bypass maxmemory (data memory or lua heap memory). If we allow kill unkillable scripts, i think we can fix it. (we allow killing them in internal fork, ie. return OOM early)

EVAL "for i=1,10000000 do redis.call('SET', 'key-kkkk:' .. i, 'value-tttt:' .. i) end" 0
EVAL "for i=1,10000000 do redis.pcall('SET', 'key-kkkk:' .. i, 'value-tttt:' .. i) end" 0
EVAL "local a={}; local i; for i=1, 30000000 do a[i]=i; end" 0

zuiderkwast · 2025-02-26T12:03:29Z

To return OOM early is an automatic kill, not SCRIPT KILL, right?

If we let the script be killed automatically by the memory limit, it's a behavior change that maybe should have a config, or at least be done in a major release.

enjoy-binbin · 2025-02-26T12:07:53Z

To return OOM early is an automatic kill, not SCRIPT KILL, right?

yes, it is an automatic kill. There are always people messing around with scripts... We will give an extra 30MB of memory in case it is really lua tmp memory.

If we let the script be killed automatically by the memory limit, it's a behavior change that maybe should have a config, or at least be done in a major release.

yes, at least we're trying to take this step. This is usually the result of malicious script execution.

madolson · 2025-02-26T19:13:08Z

I would vote for adding the FORCE extra argument. That way the user is acknowledging that understands the consequences, which will be described in the documentation of the command.

Have we seen that much actual asks for this? I'm really concerned about comments like this, because the problem is people don't really understand the consequences and leave the datastore in a corrupted state. I like the alternatives that Viktor mentioned such as building rollbacks through a write ahead or an undo log. They're a lot more engineering effort, but they prevent inconsistent data. Inconsistent data is an extremely dangerous place to start wandering in to.

A second major concern I have here is that users basically never actually want to call script kill. Valkey is entirely single threaded and extremely latency sensitive. If you have a script running for 5 seconds, your application has already been down for 5 seconds. Paging an operator is at least 15 minutes away, and then they have to make a time pressure decision about what to do. If you're looking to automate, the safest thing is to build tooling around safely failing over to a replica or by reloading from the AOF.

This is usually the result of malicious script execution.

By malicious, do we mean like some type of hacker? If that is the case, I think the user has much larger problems than DOS. They could just intentionally corrupt all of the data. I think a long running lua script is actually fairly detectable.

rjd15372 · 2025-02-27T12:23:07Z

@madolson those are very good arguments against allowing to leave the database in an inconsistent state.

Last week I started to play a bit with the rollback stuff by implementing transactions that only guarantee atomicity and consistency (no isolation, or durability) on top of our hashtable implementation, but it requires changes to all data types code to support the rollback of transactions, and I'm not sure to be a good approach to follow.

Probably, solving the problem only for Lua scripts might be easier than making a generic infrastructure to support multi-command atomicity.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NEW] Kill unkillable scripts #1783

[NEW] Kill unkillable scripts #1783

zuiderkwast commented Feb 26, 2025

rjd15372 commented Feb 26, 2025

enjoy-binbin commented Feb 26, 2025

zuiderkwast commented Feb 26, 2025

enjoy-binbin commented Feb 26, 2025

madolson commented Feb 26, 2025 •

edited

Loading

rjd15372 commented Feb 27, 2025

[NEW] Kill unkillable scripts #1783

[NEW] Kill unkillable scripts #1783

Comments

zuiderkwast commented Feb 26, 2025

rjd15372 commented Feb 26, 2025

enjoy-binbin commented Feb 26, 2025

zuiderkwast commented Feb 26, 2025

enjoy-binbin commented Feb 26, 2025

madolson commented Feb 26, 2025 • edited Loading

rjd15372 commented Feb 27, 2025

madolson commented Feb 26, 2025 •

edited

Loading