Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gohbase stuck retrying RPCs after region server temporarily down #186

Closed
rodicus opened this issue May 3, 2022 · 2 comments · Fixed by #228 · May be fixed by #227
Closed

gohbase stuck retrying RPCs after region server temporarily down #186

rodicus opened this issue May 3, 2022 · 2 comments · Fixed by #228 · May be fixed by #227
Labels

Comments

@rodicus
Copy link

rodicus commented May 3, 2022

We are experiencing an issue where Put RPCs (and possibly any other types of RPCs) for a specific region may end up continuously failing with a ServerError.
This seems to be triggered by HBase moving regions after one of the region servers goes down temporarily.
The current workaround is to restart the gohbase client process, which indicates that the issue is with gohbase and not HBase itself.

Our guess is that a region client connection is closed, but it remains associated with one or more region info instances in the gohbase cache.
Probably after a disconnection the region client is not properly cleaned up and may end up being reused in RPC operations.

We have seen the issue on a production HBase cluster that was periodically having sync failures making region servers restart multiple times a day.
We have tried reproducing this issue in a lab by bringing region servers down temporarily, but the cluster and client process always recover.

The gohbase versions we have seen this failed RPCs include:
6acaa09
3489911
which are relatively old, but there doesn’t appear to be any changes newer in the repo that could have potentially resolved the issue in question.

Below is a breakdown of goroutines of version 3489911 while RPCs are failing. In this setup there are 3 HBase region servers and 8000 workers issuing Put RPCs:

3 goroutines in the "send" loop (one per server?)
/go/src/arista/vendor/github.com/tsuna/gohbase/region/new.go:69 +0x c.Dial go c.processRPCs()
/go/src/arista/vendor/github.com/tsuna/gohbase/region/client.go:364 +0x c.processRPCs select // wait for the next batch

3 goroutines in the "receive" loop (one per server?)
/go/src/arista/vendor/github.com/tsuna/gohbase/region/new.go:71 +0x c.Dial go c.receiveRPCs()
/go/src/arista/vendor/github.com/tsuna/gohbase/region/client.go:444 +0x c.receiveRPCs c.receive()
/go/src/arista/vendor/github.com/tsuna/gohbase/region/client.go:464 +0x c.receive c.readFully(sz[:])
/go/src/arista/vendor/github.com/tsuna/gohbase/region/client.go:580 +0x c.readFully io.ReadFull

3 "Put" goroutines queueing an RPC
/go/src/arista/vendor/github.com/tsuna/gohbase/client.go:237 +0x c.Put c.mutate(p)
/go/src/arista/vendor/github.com/tsuna/gohbase/client.go:264 +0x c.mutate c.SendRPC(m)
/go/src/arista/vendor/github.com/tsuna/gohbase/rpc.go:77 +0x  c.SendRPC c.sendRPCToRegion(rpc, reg) <- ALL ON SAME REGION interface
/go/src/arista/vendor/github.com/tsuna/gohbase/rpc.go:141 +0x c.sendRPCToRegion sendBlocking(client, rpc)
/go/src/arista/vendor/github.com/tsuna/gohbase/rpc.go:111 +0x sendBlocking rc.QueueRPC(rpc)
/go/src/arista/vendor/github.com/tsuna/gohbase/region/client.go:206 +0x c.QueueRPC <-c.done: returnResult(rpc, nil, ErrClientClosed)

1 "Put" goroutines starting to mark region client as "down"
/go/src/arista/vendor/github.com/tsuna/gohbase/client.go:237 +0x c.Put c.mutate(p)
/go/src/arista/vendor/github.com/tsuna/gohbase/client.go:264 +0x c.mutate c.SendRPC(m)
/go/src/arista/vendor/github.com/tsuna/gohbase/rpc.go:77 +0x  c.SendRPC c.sendRPCToRegion(rpc, reg) <- ALL ON SAME REGION interface
/go/src/arista/vendor/github.com/tsuna/gohbase/rpc.go:167 +0x c.sendRPCToRegion ServerError: c.clientDown(client)
/go/src/arista/vendor/github.com/tsuna/gohbase/rpc.go:177 +0x c.clientDown(client hrpc.RegionClient)

7995 "Put" goroutines waiting (contending) to mark a server client as "down"
/go/src/arista/vendor/github.com/tsuna/gohbase/client.go:237 +0x c.Put c.mutate(p)
/go/src/arista/vendor/github.com/tsuna/gohbase/client.go:264 +0x c.mutate c.SendRPC(m)
/go/src/arista/vendor/github.com/tsuna/gohbase/rpc.go:77 +0x  c.SendRPC c.sendRPCToRegion(rpc, reg) <- ALL ON SAME REGION interface
/go/src/arista/vendor/github.com/tsuna/gohbase/rpc.go:167 +0x c.sendRPCToRegion ServerError: c.clientDown(client)
/go/src/arista/vendor/github.com/tsuna/gohbase/caches.go:84 +0x rcc.clientDown rcc.m.Lock()

1 "Put" goroutines finishing marking a server client as "down"
/go/src/arista/vendor/github.com/tsuna/gohbase/client.go:237 +0x c.Put c.mutate(p)
/go/src/arista/vendor/github.com/tsuna/gohbase/client.go:264 +0x c.mutate c.SendRPC(m)
/go/src/arista/vendor/github.com/tsuna/gohbase/rpc.go:77 +0x  c.SendRPC c.sendRPCToRegion(rpc, reg) <- ALL ON SAME REGION interface
/go/src/arista/vendor/github.com/tsuna/gohbase/rpc.go:167 +0x c.sendRPCToRegion ServerError: c.clientDown(client)
/go/src/arista/vendor/github.com/tsuna/gohbase/caches.go:87 +0x rcc.clientDown rcc.m.Unlock()

Some observations from the stacks:

  • The Put calls have a context timeout of 10 minutes
  • No goroutine is stuck greater than 2 minutes indicating that there is no deadlock, but rather that gohbase is retrying in a loop
  • All 8000 calls to sendRPCToRegion are referencing the same region info instance
  • Al calls to c.clients.clientDown are referencing the same region client instance

goroutine stacks attached:
gohbase-goroutines.txt

@dethi
Copy link
Collaborator

dethi commented May 3, 2022

Thanks for the detailed bug report. I will have a look today.

@dethi
Copy link
Collaborator

dethi commented May 4, 2022

Additional infos (could be or not relevant):

  • HBase 2.4.1
  • RS crashing because of:
2022-03-28 01:41:38,880 ERROR [regionserver/XXXXXX:16201.logRoller] regionserver.HRegionServer: ***** ABORTING region server XXXXXXXX,16201,1648415126068: IOE in log roller

@dethi dethi added the bug label Feb 9, 2023
aaronbee added a commit to aaronbee/gohbase that referenced this issue Aug 1, 2023
The clients cache can get out of sync with what regions are actually
associated with a region server. A case has been seen where region
points to a region client that doesn't exist in the clients map. When
this occurs subsequent calls to clientDown are no-ops, and the bad
region continues to point to the incorrect region server.

clientDown will now ensure that the given region is made unavailable,
even if it is not found in the clients list of regions.

closes: tsuna#186
aaronbee added a commit that referenced this issue Aug 10, 2023
The clients cache can get out of sync with what regions are actually
associated with a region server. A case has been seen where region
points to a region client that doesn't exist in the clients map. When
this occurs subsequent calls to clientDown are no-ops, and the bad
region continues to point to the incorrect region server.

clientDown will now ensure that the given region is made unavailable,
even if it is not found in the clients list of regions.

closes: #186
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
2 participants