Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid memory address or nil pointer when write commands in function _backgroundWrite #702

Closed
gj3233 opened this issue Dec 18, 2024 · 7 comments

Comments

@gj3233
Copy link

gj3233 commented Dec 18, 2024

Hello,
Yesterday when testing with redis cluster, my application had been panic at line ( (of rueidis v1.0.46) :
return c.cs.s which in function Commands():
func (c *Completed) Commands() []string { return c.cs.s }
which is call by err = writeCmd(p.w, cmd.Commands())
image

The complete code is:
`func (p *pipe) _backgroundWrite() (err error) {
var (
ones = make([]Completed, 1)
multi []Completed
ch chan RedisResult
flushDelay = p.maxFlushDelay
flushStart = time.Time{}
)

for err == nil {
	if ones[0], multi, ch = p.queue.NextWriteCmd(); ch == nil {
		if flushDelay != 0 {
			flushStart = time.Now()
		}
		if p.w.Buffered() != 0 {
			if err = p.w.Flush(); err != nil {
				break
			}
		}
		ones[0], multi, ch = p.queue.WaitForWrite()
		if flushDelay != 0 && atomic.LoadInt32(&p.waits) > 1 { // do not delay for sequential usage
			// Blocking commands are executed in dedicated client which is acquired from pool.
			// So, there is no sense to wait other commands to be written.
			// https://github.com/redis/rueidis/issues/379
			var blocked bool
			for i := 0; i < len(multi) && !blocked; i++ {
				blocked = multi[i].IsBlock()
			}
			if !blocked {
				time.Sleep(flushDelay - time.Since(flushStart)) // ref: https://github.com/redis/rueidis/issues/156
			}
		}
	}
	if ch != nil && multi == nil {
		multi = ones
	}
	for _, cmd := range multi {
		err = writeCmd(p.w, cmd.Commands())
	}
}
return

}`

Redis topology:
cluster enabled
3 master 3 replica

Action: Node is starting.

Could you help on this? Thanks. It seems cmd passed to writeCmd() has something wrong that lead to the crash. It confused because cmd is read from range multi, so I think something wrong happened to multi.

@rueian
Copy link
Collaborator

rueian commented Dec 18, 2024

Hi @gj3233, do you have a reproducible program on this? And what do you mean by Action: Node is starting?

@gj3233
Copy link
Author

gj3233 commented Dec 19, 2024

Hi @gj3233, do you have a reproducible program on this? And what do you mean by Action: Node is starting?

HI @rueian, I do not have a reproducible program on this, Action: Node is starting means the crash happened when the node just restarted, during the time the system is very busy for querying data from Redis DB with Client.DoMulti(ctx, cmds...), the crash is seen when there are amount of querying DB request send but not seen when querying is not that frequent. I attach the complete back trace file for your reference, thanks.
go-panic.txt

@rueian
Copy link
Collaborator

rueian commented Dec 19, 2024

Hi @gj3233,

From the clues you gave me:

  1. It panicked on c.cs.s where the only possibility is that the cs is invalid and this can only happen with false recycling.
  2. There was a lot of doretry in back your traces.

I think the panic was caused by an incorrect error checking that falsely recycled an intermediate struct here:

rueidis/cluster.go

Lines 703 to 705 in 9ff4c60

if ctx.Err() == nil {
retryp.Put(re)
}

#706 should fix the issue. You can try it with v1.0.52-alpha.3

@gj3233
Copy link
Author

gj3233 commented Dec 20, 2024

HI @rueian much thanks! I will test with the fix later and update to you.

@rueian
Copy link
Collaborator

rueian commented Dec 31, 2024

v1.0.52 was released last week. I am going to close the issue for now. Please feel free to reopen if you are still suffering from this.

@rueian rueian closed this as completed Dec 31, 2024
@nikhillimaje
Copy link

Hi @rueian, as mentioned in this thread, our application also recently encountered this issue. We're trying to understand the trigger scenario for this bug. In our case, read qps wasn't high neither the redis node/cluster were restarting. However, there was a connection storm to the Redis cluster (as the pods were crash looping) and we're assuming that should be the primary reason. Is there a way to confirm this? Just to ensure that this bug doesn't occur again in case of connection storm in the future.

@rueian
Copy link
Collaborator

rueian commented Feb 20, 2025

Hi @nikhillimaje,

As you know, non-pinned commands built by the command builder can’t be shared across multiple Do() and DoMulti() because they will be recycled by rueidis for saving memory allocations. The bug was caused by a wrong check for recycling errored commands internally.

Connection storms, depending on your use cases and setup, might be an expected behavior. But it indeed could cause network errors on the client side and further triggered the previous wrong check and made those commands be falsely recycled.

I can only say that was the only wrong check we know so far and has been fixed in the newer version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants