-
Notifications
You must be signed in to change notification settings - Fork 930
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[KYUUBI #647] Capture exceptions in async ExecuteStatement #646
Conversation
kyuubi-main/src/main/scala/org/apache/kyuubi/operation/ExecuteStatement.scala
Outdated
Show resolved
Hide resolved
kyuubi-main/src/main/scala/org/apache/kyuubi/operation/KyuubiOperation.scala
Outdated
Show resolved
Hide resolved
Codecov Report
@@ Coverage Diff @@
## master #646 +/- ##
==========================================
+ Coverage 79.81% 80.09% +0.28%
==========================================
Files 120 120
Lines 4692 4693 +1
Branches 567 568 +1
==========================================
+ Hits 3745 3759 +14
+ Misses 627 610 -17
- Partials 320 324 +4
Continue to review full report at Codecov.
|
cc @ulysses-you, we shall fail 1.2.0-rc3 for this |
please link the target issue |
nice catch ! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
<!-- Thanks for sending a pull request! Here are some tips for you: 1. If this is your first time, please read our contributor guidelines: https://kyuubi.readthedocs.io/en/latest/community/contributions.html 2. If the PR is related to an issue in https://github.com/NetEase/kyuubi/issues, add '[KYUUBI #XXXX]' in your PR title, e.g., '[KYUUBI #XXXX] Your PR title ...'. 3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][KYUUBI #XXXX] Your PR title ...'. --> ### _Why are the changes needed?_ <!-- Please clarify why the changes are needed. For instance, 1. If you add a feature, you can talk about the use case of it. 2. If you fix a bug, you can clarify why it is a bug. --> When ExecuteStatement running in async, if engine broken(e.g. killed by yarn), the thread just crash without `setOperationException(ke)`, then client(like `beeline`) will hang. Close #647 ### _How was this patch tested?_ - [x] Add some test cases that check the changes thoroughly including negative and positive cases if possible - [ ] Add screenshots for manual tests if appropriate - [x] [Run test](https://kyuubi.readthedocs.io/en/latest/tools/testing.html#running-tests) locally before make a pull request Manual tested on our yarn cluster: kill yarn app when executing statement. 1. `beeline` connect hiveserver2. Result: exit with error code 2 2. `beeline` connect kyuubi(before this patch). Result: hang 3. `beeline` connect kyuubi(after this patch). Result: exit with error code 2 Closes #646 from pan3793/async-hang. Closes #647 418fca9 [Cheng Pan] move engine crash test to KyuubiOperationPerConnectionSuite 03e5d73 [Cheng Pan] simplify ut 1651991 [Cheng Pan] tune test name 392dd03 [Cheng Pan] ut 728c040 [Cheng Pan] address comments 8031528 [Cheng Pan] Capture exceptions in async ExecuteStatement Authored-by: Cheng Pan <379377944@qq.com> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit 8ce6d01) Signed-off-by: Kent Yao <yao@apache.org>
…ngine lost ### _Why are the changes needed?_ As described in #4457, when the backend engine becomes unresponsive, the beeline will hang because `client.getOperationStatus(_remoteOpHandle)` in `ExecuteStatement #waitStatementComplete()` would never receive any response. https://github.com/apache/kyuubi/blob/43309b86f1997b028e8fde5cb4e6449d818f4f73/kyuubi-server/src/main/scala/org/apache/kyuubi/operation/ExecuteStatement.scala#L101-L105 While the EngineAliveProbe can identify engine failure, it does not resolve the underlying issue of the thrift client waiting for a response, causing the beeline to remain unresponsive. The only way to resolve this state of suspension is by interrupting the thrift client thread. https://github.com/apache/kyuubi/blob/3d65f2711faa5dc9173130557e2d33adab04b5c7/kyuubi-server/src/main/scala/org/apache/kyuubi/client/KyuubiSyncThriftClient.scala#L84 This pull request introduces a new logic to the EngineAliveProbe feature, whereby upon detecting an engine failure, it deliberately closes the thrift connection. This creates an exception that can then be handled by the error handling mechanism introduced in #646, thereby allowing for a graceful shutdown of the session handle. ### _How was this patch tested?_ - [ ] Add some test cases that check the changes thoroughly including negative and positive cases if possible I am very happy to add a test case but I am not aware of how to simulate a state of unresponsive. Otherwise, I can implement test like this: https://github.com/apache/kyuubi/blob/43309b86f1997b028e8fde5cb4e6449d818f4f73/kyuubi-server/src/test/scala/org/apache/kyuubi/operation/KyuubiOperationPerUserSuite.scala#L169 - [x] Add screenshots for manual tests if appropriate 1. Execute a query with beeline ``` $ beeline -u 'jdbc:hive2://<KyuubiServer>:10009/default?socketTimeout=60000;#spark.yarn.queue=shared' -n jpz3032 -p <password> -f test.sql ``` 2. Go to ResourceManager and Identify the ApplicationMaster <img width="2091" alt="Screenshot 2023-03-08 at 12 13 31" src="https://user-images.githubusercontent.com/4378066/223610552-6a35dc6a-891a-415b-b142-a2b104da2c1f.png"> 3. ssh into the host and restart the NodeManager ``` [<user><NodeManager> ~]$ sudo -i [root<NodeManager> :~]# /usr/sbin/reboot Connection to <NodeManager> closed by remote host. Connection to <NodeManager> closed. ``` 4. Check the kyuubi server log: the session is terminated with error ``` 2023-03-14 10:56:10.330 INFO org.apache.kyuubi.operation.ExecuteStatement: Query[4b23657a-cacb-4ead-8abd-50920dd73de9] in RUNNING_STATE 2023-03-14 10:56:15.332 INFO org.apache.kyuubi.operation.ExecuteStatement: Query[4b23657a-cacb-4ead-8abd-50920dd73de9] in RUNNING_STATE 2023-03-14 10:56:20.333 INFO org.apache.kyuubi.operation.ExecuteStatement: Query[4b23657a-cacb-4ead-8abd-50920dd73de9] in RUNNING_STATE ``` After 3 consecutive alive probe fails, Engine is marked dead. ``` 2023-03-14 10:56:40.089 WARN org.apache.kyuubi.client.KyuubiSyncThriftClient: The engine[Some(application_1676285123186_0447)] alive probe fails org.apache.thrift.transport.TTransportException: java.net.SocketTimeoutException: Read timed out at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:129) ~[libthrift-0.9.3.jar:0.9.3] at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86) ~[libthrift-0.9.3.jar:0.9.3] at org.apache.thrift.transport.TSaslTransport.readLength(TSaslTransport.java:376) ~[libthrift-0.9.3.jar:0.9.3] at org.apache.thrift.transport.TSaslTransport.readFrame(TSaslTransport.java:453) ~[libthrift-0.9.3.jar:0.9.3] at org.apache.thrift.transport.TSaslTransport.read(TSaslTransport.java:435) ~[libthrift-0.9.3.jar:0.9.3] at org.apache.thrift.transport.TSaslClientTransport.read(TSaslClientTransport.java:37) ~[libthrift-0.9.3.jar:0.9.3] at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86) ~[libthrift-0.9.3.jar:0.9.3] at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:429) ~[libthrift-0.9.3.jar:0.9.3] at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:318) ~[libthrift-0.9.3.jar:0.9.3] at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:219) ~[libthrift-0.9.3.jar:0.9.3] at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:77) ~[libthrift-0.9.3.jar:0.9.3] at org.apache.hive.service.rpc.thrift.TCLIService$Client.recv_GetInfo(TCLIService.java:222) ~[hive-service-rpc-3.1.3.jar:3.1.3] at org.apache.hive.service.rpc.thrift.TCLIService$Client.GetInfo(TCLIService.java:209) ~[hive-service-rpc-3.1.3.jar:3.1.3] at org.apache.kyuubi.client.KyuubiSyncThriftClient$$anon$1.$anonfun$run$1(KyuubiSyncThriftClient.scala:93) ~[kyuubi-server_2.12-1.7.0-SNAPSHOT.jar:1.7.0-SNAPSHOT] at org.apache.kyuubi.client.KyuubiSyncThriftClient$$anon$1.$anonfun$run$1$adapted(KyuubiSyncThriftClient.scala:87) ~[kyuubi-server_2.12-1.7.0-SNAPSHOT.jar:1.7.0-SNAPSHOT] at scala.Option.foreach(Option.scala:407) ~[scala-library-2.12.17.jar:?] at org.apache.kyuubi.client.KyuubiSyncThriftClient$$anon$1.run(KyuubiSyncThriftClient.scala:87) ~[kyuubi-server_2.12-1.7.0-SNAPSHOT.jar:1.7.0-SNAPSHOT] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[?:1.8.0_362] at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) ~[?:1.8.0_362] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) ~[?:1.8.0_362] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) ~[?:1.8.0_362] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_362] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_362] at java.lang.Thread.run(Thread.java:750) ~[?:1.8.0_362] Caused by: java.net.SocketTimeoutException: Read timed out at java.net.SocketInputStream.socketRead0(Native Method) ~[?:1.8.0_362] at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) ~[?:1.8.0_362] at java.net.SocketInputStream.read(SocketInputStream.java:171) ~[?:1.8.0_362] at java.net.SocketInputStream.read(SocketInputStream.java:141) ~[?:1.8.0_362] at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) ~[?:1.8.0_362] at java.io.BufferedInputStream.read1(BufferedInputStream.java:286) ~[?:1.8.0_362] at java.io.BufferedInputStream.read(BufferedInputStream.java:345) ~[?:1.8.0_362] at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:127) ~[libthrift-0.9.3.jar:0.9.3] ... 23 more 2023-03-14 10:57:00.100 WARN org.apache.kyuubi.client.KyuubiSyncThriftClient: The engine[Some(application_1676285123186_0447)] alive probe fails org.apache.thrift.transport.TTransportException: java.net.SocketTimeoutException: Read timed out at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:129) ~[libthrift-0.9.3.jar:0.9.3] at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86) ~[libthrift-0.9.3.jar:0.9.3] at org.apache.thrift.transport.TSaslTransport.readLength(TSaslTransport.java:376) ~[libthrift-0.9.3.jar:0.9.3] at org.apache.thrift.transport.TSaslTransport.readFrame(TSaslTransport.java:453) ~[libthrift-0.9.3.jar:0.9.3] at org.apache.thrift.transport.TSaslTransport.read(TSaslTransport.java:435) ~[libthrift-0.9.3.jar:0.9.3] at org.apache.thrift.transport.TSaslClientTransport.read(TSaslClientTransport.java:37) ~[libthrift-0.9.3.jar:0.9.3] at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86) ~[libthrift-0.9.3.jar:0.9.3] at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:429) ~[libthrift-0.9.3.jar:0.9.3] at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:318) ~[libthrift-0.9.3.jar:0.9.3] at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:219) ~[libthrift-0.9.3.jar:0.9.3] at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:77) ~[libthrift-0.9.3.jar:0.9.3] at org.apache.hive.service.rpc.thrift.TCLIService$Client.recv_GetInfo(TCLIService.java:222) ~[hive-service-rpc-3.1.3.jar:3.1.3] at org.apache.hive.service.rpc.thrift.TCLIService$Client.GetInfo(TCLIService.java:209) ~[hive-service-rpc-3.1.3.jar:3.1.3] at org.apache.kyuubi.client.KyuubiSyncThriftClient$$anon$1.$anonfun$run$1(KyuubiSyncThriftClient.scala:93) ~[kyuubi-server_2.12-1.7.0-SNAPSHOT.jar:1.7.0-SNAPSHOT] at org.apache.kyuubi.client.KyuubiSyncThriftClient$$anon$1.$anonfun$run$1$adapted(KyuubiSyncThriftClient.scala:87) ~[kyuubi-server_2.12-1.7.0-SNAPSHOT.jar:1.7.0-SNAPSHOT] at scala.Option.foreach(Option.scala:407) ~[scala-library-2.12.17.jar:?] at org.apache.kyuubi.client.KyuubiSyncThriftClient$$anon$1.run(KyuubiSyncThriftClient.scala:87) ~[kyuubi-server_2.12-1.7.0-SNAPSHOT.jar:1.7.0-SNAPSHOT] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[?:1.8.0_362] at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) ~[?:1.8.0_362] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) ~[?:1.8.0_362] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) ~[?:1.8.0_362] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_362] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_362] at java.lang.Thread.run(Thread.java:750) ~[?:1.8.0_362] Caused by: java.net.SocketTimeoutException: Read timed out at java.net.SocketInputStream.socketRead0(Native Method) ~[?:1.8.0_362] at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) ~[?:1.8.0_362] at java.net.SocketInputStream.read(SocketInputStream.java:171) ~[?:1.8.0_362] at java.net.SocketInputStream.read(SocketInputStream.java:141) ~[?:1.8.0_362] at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) ~[?:1.8.0_362] at java.io.BufferedInputStream.read1(BufferedInputStream.java:286) ~[?:1.8.0_362] at java.io.BufferedInputStream.read(BufferedInputStream.java:345) ~[?:1.8.0_362] at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:127) ~[libthrift-0.9.3.jar:0.9.3] ... 23 more 2023-03-14 10:57:20.111 WARN org.apache.kyuubi.client.KyuubiSyncThriftClient: The engine[Some(application_1676285123186_0447)] alive probe fails org.apache.thrift.transport.TTransportException: java.net.SocketTimeoutException: Read timed out at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:129) ~[libthrift-0.9.3.jar:0.9.3] at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86) ~[libthrift-0.9.3.jar:0.9.3] at org.apache.thrift.transport.TSaslTransport.readLength(TSaslTransport.java:376) ~[libthrift-0.9.3.jar:0.9.3] at org.apache.thrift.transport.TSaslTransport.readFrame(TSaslTransport.java:453) ~[libthrift-0.9.3.jar:0.9.3] at org.apache.thrift.transport.TSaslTransport.read(TSaslTransport.java:435) ~[libthrift-0.9.3.jar:0.9.3] at org.apache.thrift.transport.TSaslClientTransport.read(TSaslClientTransport.java:37) ~[libthrift-0.9.3.jar:0.9.3] at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86) ~[libthrift-0.9.3.jar:0.9.3] at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:429) ~[libthrift-0.9.3.jar:0.9.3] at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:318) ~[libthrift-0.9.3.jar:0.9.3] at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:219) ~[libthrift-0.9.3.jar:0.9.3] at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:77) ~[libthrift-0.9.3.jar:0.9.3] at org.apache.hive.service.rpc.thrift.TCLIService$Client.recv_GetInfo(TCLIService.java:222) ~[hive-service-rpc-3.1.3.jar:3.1.3] at org.apache.hive.service.rpc.thrift.TCLIService$Client.GetInfo(TCLIService.java:209) ~[hive-service-rpc-3.1.3.jar:3.1.3] at org.apache.kyuubi.client.KyuubiSyncThriftClient$$anon$1.$anonfun$run$1(KyuubiSyncThriftClient.scala:93) ~[kyuubi-server_2.12-1.7.0-SNAPSHOT.jar:1.7.0-SNAPSHOT] at org.apache.kyuubi.client.KyuubiSyncThriftClient$$anon$1.$anonfun$run$1$adapted(KyuubiSyncThriftClient.scala:87) ~[kyuubi-server_2.12-1.7.0-SNAPSHOT.jar:1.7.0-SNAPSHOT] at scala.Option.foreach(Option.scala:407) ~[scala-library-2.12.17.jar:?] at org.apache.kyuubi.client.KyuubiSyncThriftClient$$anon$1.run(KyuubiSyncThriftClient.scala:87) ~[kyuubi-server_2.12-1.7.0-SNAPSHOT.jar:1.7.0-SNAPSHOT] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[?:1.8.0_362] at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) ~[?:1.8.0_362] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) ~[?:1.8.0_362] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) ~[?:1.8.0_362] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_362] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_362] at java.lang.Thread.run(Thread.java:750) ~[?:1.8.0_362] Caused by: java.net.SocketTimeoutException: Read timed out at java.net.SocketInputStream.socketRead0(Native Method) ~[?:1.8.0_362] at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) ~[?:1.8.0_362] at java.net.SocketInputStream.read(SocketInputStream.java:171) ~[?:1.8.0_362] at java.net.SocketInputStream.read(SocketInputStream.java:141) ~[?:1.8.0_362] at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) ~[?:1.8.0_362] at java.io.BufferedInputStream.read1(BufferedInputStream.java:286) ~[?:1.8.0_362] at java.io.BufferedInputStream.read(BufferedInputStream.java:345) ~[?:1.8.0_362] at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:127) ~[libthrift-0.9.3.jar:0.9.3] ... 23 more 2023-03-14 10:57:20.112 ERROR org.apache.kyuubi.client.KyuubiSyncThriftClient: Mark the engine[Some(application_1676285123186_0447)] not alive with no recent alive probe success: 60034 ms exceeds timeout 60000 ms ``` Now force closing the thrift client to generate an exception. ``` 2023-03-14 10:58:10.112 WARN org.apache.kyuubi.client.KyuubiSyncThriftClient: Removing Clients for TSessionHandle(sessionId:THandleIdentifier(guid:10 94 8A 85 00 DD 43 5D 85 58 91 84 61 81 A5 A4, secret:C2 EE 5B 97 3E A0 41 FC AC 16 9B D7 08 ED 8F 38)) 2023-03-14 10:58:10.114 INFO org.apache.kyuubi.operation.ExecuteStatement: Processing jpz3032's query[4b23657a-cacb-4ead-8abd-50920dd73de9]: RUNNING_STATE -> ERROR_STATE, time taken: 139.861 seconds 2023-03-14 10:58:10.119 WARN org.apache.kyuubi.operation.ExecuteStatement: Error closing THandleIdentifier(guid:65 95 B2 11 0C F1 45 7E B4 0B F8 D3 21 26 C1 05, secret:C2 EE 5B 97 3E A0 41 FC AC 16 9B D7 08 ED 8F 38): connection does not exist org.apache.kyuubi.KyuubiSQLException: connection does not exist at org.apache.kyuubi.KyuubiSQLException$.connectionDoesNotExist(KyuubiSQLException.scala:90) ~[kyuubi-common_2.12-1.7.0-SNAPSHOT.jar:1.7.0-SNAPSHOT] at org.apache.kyuubi.client.KyuubiSyncThriftClient.withLockAcquired(KyuubiSyncThriftClient.scala:139) ~[kyuubi-server_2.12-1.7.0-SNAPSHOT.jar:1.7.0-SNAPSHOT] at org.apache.kyuubi.client.KyuubiSyncThriftClient.withLockAcquiredAsyncRequest(KyuubiSyncThriftClient.scala:145) ~[kyuubi-server_2.12-1.7.0-SNAPSHOT.jar:1.7.0-SNAPSHOT] at org.apache.kyuubi.client.KyuubiSyncThriftClient.closeOperation(KyuubiSyncThriftClient.scala:393) ~[kyuubi-server_2.12-1.7.0-SNAPSHOT.jar:1.7.0-SNAPSHOT] at org.apache.kyuubi.operation.KyuubiOperation.liftedTree3$1(KyuubiOperation.scala:136) ~[kyuubi-server_2.12-1.7.0-SNAPSHOT.jar:1.7.0-SNAPSHOT] at org.apache.kyuubi.operation.KyuubiOperation.close(KyuubiOperation.scala:135) ~[kyuubi-server_2.12-1.7.0-SNAPSHOT.jar:1.7.0-SNAPSHOT] at org.apache.kyuubi.operation.OperationManager.closeOperation(OperationManager.scala:126) ~[kyuubi-common_2.12-1.7.0-SNAPSHOT.jar:1.7.0-SNAPSHOT] at org.apache.kyuubi.session.AbstractSession.$anonfun$closeOperation$1(AbstractSession.scala:224) ~[kyuubi-common_2.12-1.7.0-SNAPSHOT.jar:1.7.0-SNAPSHOT] at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) ~[scala-library-2.12.17.jar:?] at org.apache.kyuubi.session.AbstractSession.withAcquireRelease(AbstractSession.scala:82) ~[kyuubi-common_2.12-1.7.0-SNAPSHOT.jar:1.7.0-SNAPSHOT] at org.apache.kyuubi.session.AbstractSession.closeOperation(AbstractSession.scala:222) ~[kyuubi-common_2.12-1.7.0-SNAPSHOT.jar:1.7.0-SNAPSHOT] at org.apache.kyuubi.service.AbstractBackendService.closeOperation(AbstractBackendService.scala:188) ~[kyuubi-common_2.12-1.7.0-SNAPSHOT.jar:1.7.0-SNAPSHOT] at org.apache.kyuubi.server.KyuubiServer$$anon$1.org$apache$kyuubi$server$BackendServiceMetric$$super$closeOperation(KyuubiServer.scala:138) ~[kyuubi-server_2.12-1.7.0-SNAPSHOT.jar:1.7.0-SNAPSHOT] at org.apache.kyuubi.server.BackendServiceMetric.$anonfun$closeOperation$1(BackendServiceMetric.scala:169) ~[kyuubi-server_2.12-1.7.0-SNAPSHOT.jar:1.7.0-SNAPSHOT] at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) ~[scala-library-2.12.17.jar:?] at org.apache.kyuubi.metrics.MetricsSystem$.timerTracing(MetricsSystem.scala:111) ~[kyuubi-metrics_2.12-1.7.0-SNAPSHOT.jar:1.7.0-SNAPSHOT] at org.apache.kyuubi.server.BackendServiceMetric.closeOperation(BackendServiceMetric.scala:169) ~[kyuubi-server_2.12-1.7.0-SNAPSHOT.jar:1.7.0-SNAPSHOT] at org.apache.kyuubi.server.BackendServiceMetric.closeOperation$(BackendServiceMetric.scala:167) ~[kyuubi-server_2.12-1.7.0-SNAPSHOT.jar:1.7.0-SNAPSHOT] at org.apache.kyuubi.server.KyuubiServer$$anon$1.closeOperation(KyuubiServer.scala:138) ~[kyuubi-server_2.12-1.7.0-SNAPSHOT.jar:1.7.0-SNAPSHOT] at org.apache.kyuubi.service.TFrontendService.CloseOperation(TFrontendService.scala:498) ~[kyuubi-common_2.12-1.7.0-SNAPSHOT.jar:1.7.0-SNAPSHOT] at org.apache.hive.service.rpc.thrift.TCLIService$Processor$CloseOperation.getResult(TCLIService.java:1797) ~[hive-service-rpc-3.1.3.jar:3.1.3] at org.apache.hive.service.rpc.thrift.TCLIService$Processor$CloseOperation.getResult(TCLIService.java:1782) ~[hive-service-rpc-3.1.3.jar:3.1.3] at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) ~[libthrift-0.9.3.jar:0.9.3] at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) ~[libthrift-0.9.3.jar:0.9.3] at org.apache.kyuubi.service.authentication.TSetIpAddressProcessor.process(TSetIpAddressProcessor.scala:36) ~[kyuubi-common_2.12-1.7.0-SNAPSHOT.jar:1.7.0-SNAPSHOT] at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286) ~[libthrift-0.9.3.jar:0.9.3] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_362] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_362] at java.lang.Thread.run(Thread.java:750) ~[?:1.8.0_362] 2023-03-14 10:58:10.130 INFO org.apache.kyuubi.server.KyuubiTBinaryFrontendService: Received request of closing SessionHandle [7c5bd429-49a2-46c2-ba50-02825140e4c6] 2023-03-14 10:58:10.130 INFO org.apache.kyuubi.session.KyuubiSessionManager: jpz3032's session with SessionHandle [7c5bd429-49a2-46c2-ba50-02825140e4c6] is closed, current opening sessions 0 2023-03-14 10:58:10.131 INFO org.apache.kyuubi.server.KyuubiTBinaryFrontendService: Finished closing SessionHandle [7c5bd429-49a2-46c2-ba50-02825140e4c6] ``` 5. Beeline Logs Beeline finished with errors (as expected) ``` 2023-03-14 10:56:15.332 INFO org.apache.kyuubi.operation.ExecuteStatement: Query[4b23657a-cacb-4ead-8abd-50920dd73de9] in RUNNING_STATE 2023-03-14 10:56:20.333 INFO org.apache.kyuubi.operation.ExecuteStatement: Query[4b23657a-cacb-4ead-8abd-50920dd73de9] in RUNNING_STATE 2023-03-14 10:58:10.114 INFO org.apache.kyuubi.operation.ExecuteStatement: Processing jpz3032's query[4b23657a-cacb-4ead-8abd-50920dd73de9]: RUNNING_STATE -> ERROR_STATE, time taken: 139.861 seconds Error: org.apache.kyuubi.KyuubiSQLException: Error operating ExecuteStatement: org.apache.thrift.transport.TTransportException: java.net.SocketException: Socket closed at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:129) at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86) at org.apache.thrift.transport.TSaslTransport.readLength(TSaslTransport.java:376) at org.apache.thrift.transport.TSaslTransport.readFrame(TSaslTransport.java:453) at org.apache.thrift.transport.TSaslTransport.read(TSaslTransport.java:435) at org.apache.thrift.transport.TSaslClientTransport.read(TSaslClientTransport.java:37) at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86) at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:429) at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:318) at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:219) at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:77) at org.apache.hive.service.rpc.thrift.TCLIService$Client.recv_GetOperationStatus(TCLIService.java:475) at org.apache.hive.service.rpc.thrift.TCLIService$Client.GetOperationStatus(TCLIService.java:462) at org.apache.kyuubi.client.KyuubiSyncThriftClient.$anonfun$getOperationStatus$1(KyuubiSyncThriftClient.scala:377) at org.apache.kyuubi.client.KyuubiSyncThriftClient.$anonfun$withLockAcquiredAsyncRequest$2(KyuubiSyncThriftClient.scala:151) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) Caused by: java.net.SocketException: Socket closed at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) at java.net.SocketInputStream.read(SocketInputStream.java:171) at java.net.SocketInputStream.read(SocketInputStream.java:141) at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at java.io.BufferedInputStream.read1(BufferedInputStream.java:286) at java.io.BufferedInputStream.read(BufferedInputStream.java:345) at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:127) ... 20 more at org.apache.kyuubi.KyuubiSQLException$.apply(KyuubiSQLException.scala:69) at org.apache.kyuubi.operation.KyuubiOperation$$anonfun$onError$1.applyOrElse(KyuubiOperation.scala:76) at org.apache.kyuubi.operation.KyuubiOperation$$anonfun$onError$1.applyOrElse(KyuubiOperation.scala:57) at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38) at org.apache.kyuubi.operation.ExecuteStatement.waitStatementComplete(ExecuteStatement.scala:144) at org.apache.kyuubi.operation.ExecuteStatement.$anonfun$runInternal$1(ExecuteStatement.scala:161) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) Caused by: org.apache.thrift.transport.TTransportException: java.net.SocketException: Socket closed at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:129) at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86) at org.apache.thrift.transport.TSaslTransport.readLength(TSaslTransport.java:376) at org.apache.thrift.transport.TSaslTransport.readFrame(TSaslTransport.java:453) at org.apache.thrift.transport.TSaslTransport.read(TSaslTransport.java:435) at org.apache.thrift.transport.TSaslClientTransport.read(TSaslClientTransport.java:37) at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86) at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:429) at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:318) at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:219) at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:77) at org.apache.hive.service.rpc.thrift.TCLIService$Client.recv_GetOperationStatus(TCLIService.java:475) at org.apache.hive.service.rpc.thrift.TCLIService$Client.GetOperationStatus(TCLIService.java:462) at org.apache.kyuubi.client.KyuubiSyncThriftClient.$anonfun$getOperationStatus$1(KyuubiSyncThriftClient.scala:377) at org.apache.kyuubi.client.KyuubiSyncThriftClient.$anonfun$withLockAcquiredAsyncRequest$2(KyuubiSyncThriftClient.scala:151) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) ... 3 more Caused by: java.net.SocketException: Socket closed at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) at java.net.SocketInputStream.read(SocketInputStream.java:171) at java.net.SocketInputStream.read(SocketInputStream.java:141) at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at java.io.BufferedInputStream.read1(BufferedInputStream.java:286) at java.io.BufferedInputStream.read(BufferedInputStream.java:345) at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:127) ... 20 more (state=,code=0) Closing: 0: jdbc:hive2://lndcndevms1510.nhnjp.ism:10009/default?socketTimeout=60000;#spark.yarn.queue=shared ``` - [ ] [Run test](https://kyuubi.readthedocs.io/en/master/develop_tools/testing.html#running-tests) locally before make a pull request The kyuubi-server suite wasn't working even before this PR. Closes #4480 from risyomei/feature/kill-when-engine-fail. Closes #4480 f611022 [risyomei] Fixed Test messages(3) 3ed6886 [risyomei] Fixed Test messages(2) 30675d0 [risyomei] Fixed Test messages d80cfcb [Xieming LI] Merge remote-tracking branch 'origin/master' into feature/kill-when-engine-fail 3b320e1 [Xieming LI] Make sure thread is shut down 5ee9ddf [Xieming LI] Commit for work 8c854cc [Cheng Pan] Update kyuubi-server/src/main/scala/org/apache/kyuubi/client/KyuubiSyncThriftClient.scala 564e09c [Xieming LI] Fixed based on Comment ae1a025 [Xieming LI] Fixed unused import 1b50a59 [Xieming LI] rebase origin/master d9f9e3c [Xieming LI] Optimized logic based on PR comment a542285 [Xieming LI] Fixe based on comment 04b9ff3 [Xieming LI] Added an option to kill the thrift connection when Engine is considered dead 3571d68 [Xieming LI] Fixe based on comment e40fcfc [Xieming LI] Added an option to kill the thrift connection when Engine is considered dead Lead-authored-by: Xieming LI <risyomei@gmail.com> Co-authored-by: risyomei <risyomei@gmail.com> Co-authored-by: Cheng Pan <pan3793@gmail.com> Signed-off-by: fwang12 <fwang12@ebay.com>
…ngine lost ### _Why are the changes needed?_ As described in #4457, when the backend engine becomes unresponsive, the beeline will hang because `client.getOperationStatus(_remoteOpHandle)` in `ExecuteStatement #waitStatementComplete()` would never receive any response. https://github.com/apache/kyuubi/blob/43309b86f1997b028e8fde5cb4e6449d818f4f73/kyuubi-server/src/main/scala/org/apache/kyuubi/operation/ExecuteStatement.scala#L101-L105 While the EngineAliveProbe can identify engine failure, it does not resolve the underlying issue of the thrift client waiting for a response, causing the beeline to remain unresponsive. The only way to resolve this state of suspension is by interrupting the thrift client thread. https://github.com/apache/kyuubi/blob/3d65f2711faa5dc9173130557e2d33adab04b5c7/kyuubi-server/src/main/scala/org/apache/kyuubi/client/KyuubiSyncThriftClient.scala#L84 This pull request introduces a new logic to the EngineAliveProbe feature, whereby upon detecting an engine failure, it deliberately closes the thrift connection. This creates an exception that can then be handled by the error handling mechanism introduced in #646, thereby allowing for a graceful shutdown of the session handle. ### _How was this patch tested?_ - [ ] Add some test cases that check the changes thoroughly including negative and positive cases if possible I am very happy to add a test case but I am not aware of how to simulate a state of unresponsive. Otherwise, I can implement test like this: https://github.com/apache/kyuubi/blob/43309b86f1997b028e8fde5cb4e6449d818f4f73/kyuubi-server/src/test/scala/org/apache/kyuubi/operation/KyuubiOperationPerUserSuite.scala#L169 - [x] Add screenshots for manual tests if appropriate 1. Execute a query with beeline ``` $ beeline -u 'jdbc:hive2://<KyuubiServer>:10009/default?socketTimeout=60000;#spark.yarn.queue=shared' -n jpz3032 -p <password> -f test.sql ``` 2. Go to ResourceManager and Identify the ApplicationMaster <img width="2091" alt="Screenshot 2023-03-08 at 12 13 31" src="https://user-images.githubusercontent.com/4378066/223610552-6a35dc6a-891a-415b-b142-a2b104da2c1f.png"> 3. ssh into the host and restart the NodeManager ``` [<user><NodeManager> ~]$ sudo -i [root<NodeManager> :~]# /usr/sbin/reboot Connection to <NodeManager> closed by remote host. Connection to <NodeManager> closed. ``` 4. Check the kyuubi server log: the session is terminated with error ``` 2023-03-14 10:56:10.330 INFO org.apache.kyuubi.operation.ExecuteStatement: Query[4b23657a-cacb-4ead-8abd-50920dd73de9] in RUNNING_STATE 2023-03-14 10:56:15.332 INFO org.apache.kyuubi.operation.ExecuteStatement: Query[4b23657a-cacb-4ead-8abd-50920dd73de9] in RUNNING_STATE 2023-03-14 10:56:20.333 INFO org.apache.kyuubi.operation.ExecuteStatement: Query[4b23657a-cacb-4ead-8abd-50920dd73de9] in RUNNING_STATE ``` After 3 consecutive alive probe fails, Engine is marked dead. ``` 2023-03-14 10:56:40.089 WARN org.apache.kyuubi.client.KyuubiSyncThriftClient: The engine[Some(application_1676285123186_0447)] alive probe fails org.apache.thrift.transport.TTransportException: java.net.SocketTimeoutException: Read timed out at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:129) ~[libthrift-0.9.3.jar:0.9.3] at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86) ~[libthrift-0.9.3.jar:0.9.3] at org.apache.thrift.transport.TSaslTransport.readLength(TSaslTransport.java:376) ~[libthrift-0.9.3.jar:0.9.3] at org.apache.thrift.transport.TSaslTransport.readFrame(TSaslTransport.java:453) ~[libthrift-0.9.3.jar:0.9.3] at org.apache.thrift.transport.TSaslTransport.read(TSaslTransport.java:435) ~[libthrift-0.9.3.jar:0.9.3] at org.apache.thrift.transport.TSaslClientTransport.read(TSaslClientTransport.java:37) ~[libthrift-0.9.3.jar:0.9.3] at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86) ~[libthrift-0.9.3.jar:0.9.3] at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:429) ~[libthrift-0.9.3.jar:0.9.3] at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:318) ~[libthrift-0.9.3.jar:0.9.3] at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:219) ~[libthrift-0.9.3.jar:0.9.3] at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:77) ~[libthrift-0.9.3.jar:0.9.3] at org.apache.hive.service.rpc.thrift.TCLIService$Client.recv_GetInfo(TCLIService.java:222) ~[hive-service-rpc-3.1.3.jar:3.1.3] at org.apache.hive.service.rpc.thrift.TCLIService$Client.GetInfo(TCLIService.java:209) ~[hive-service-rpc-3.1.3.jar:3.1.3] at org.apache.kyuubi.client.KyuubiSyncThriftClient$$anon$1.$anonfun$run$1(KyuubiSyncThriftClient.scala:93) ~[kyuubi-server_2.12-1.7.0-SNAPSHOT.jar:1.7.0-SNAPSHOT] at org.apache.kyuubi.client.KyuubiSyncThriftClient$$anon$1.$anonfun$run$1$adapted(KyuubiSyncThriftClient.scala:87) ~[kyuubi-server_2.12-1.7.0-SNAPSHOT.jar:1.7.0-SNAPSHOT] at scala.Option.foreach(Option.scala:407) ~[scala-library-2.12.17.jar:?] at org.apache.kyuubi.client.KyuubiSyncThriftClient$$anon$1.run(KyuubiSyncThriftClient.scala:87) ~[kyuubi-server_2.12-1.7.0-SNAPSHOT.jar:1.7.0-SNAPSHOT] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[?:1.8.0_362] at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) ~[?:1.8.0_362] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) ~[?:1.8.0_362] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) ~[?:1.8.0_362] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_362] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_362] at java.lang.Thread.run(Thread.java:750) ~[?:1.8.0_362] Caused by: java.net.SocketTimeoutException: Read timed out at java.net.SocketInputStream.socketRead0(Native Method) ~[?:1.8.0_362] at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) ~[?:1.8.0_362] at java.net.SocketInputStream.read(SocketInputStream.java:171) ~[?:1.8.0_362] at java.net.SocketInputStream.read(SocketInputStream.java:141) ~[?:1.8.0_362] at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) ~[?:1.8.0_362] at java.io.BufferedInputStream.read1(BufferedInputStream.java:286) ~[?:1.8.0_362] at java.io.BufferedInputStream.read(BufferedInputStream.java:345) ~[?:1.8.0_362] at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:127) ~[libthrift-0.9.3.jar:0.9.3] ... 23 more 2023-03-14 10:57:00.100 WARN org.apache.kyuubi.client.KyuubiSyncThriftClient: The engine[Some(application_1676285123186_0447)] alive probe fails org.apache.thrift.transport.TTransportException: java.net.SocketTimeoutException: Read timed out at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:129) ~[libthrift-0.9.3.jar:0.9.3] at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86) ~[libthrift-0.9.3.jar:0.9.3] at org.apache.thrift.transport.TSaslTransport.readLength(TSaslTransport.java:376) ~[libthrift-0.9.3.jar:0.9.3] at org.apache.thrift.transport.TSaslTransport.readFrame(TSaslTransport.java:453) ~[libthrift-0.9.3.jar:0.9.3] at org.apache.thrift.transport.TSaslTransport.read(TSaslTransport.java:435) ~[libthrift-0.9.3.jar:0.9.3] at org.apache.thrift.transport.TSaslClientTransport.read(TSaslClientTransport.java:37) ~[libthrift-0.9.3.jar:0.9.3] at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86) ~[libthrift-0.9.3.jar:0.9.3] at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:429) ~[libthrift-0.9.3.jar:0.9.3] at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:318) ~[libthrift-0.9.3.jar:0.9.3] at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:219) ~[libthrift-0.9.3.jar:0.9.3] at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:77) ~[libthrift-0.9.3.jar:0.9.3] at org.apache.hive.service.rpc.thrift.TCLIService$Client.recv_GetInfo(TCLIService.java:222) ~[hive-service-rpc-3.1.3.jar:3.1.3] at org.apache.hive.service.rpc.thrift.TCLIService$Client.GetInfo(TCLIService.java:209) ~[hive-service-rpc-3.1.3.jar:3.1.3] at org.apache.kyuubi.client.KyuubiSyncThriftClient$$anon$1.$anonfun$run$1(KyuubiSyncThriftClient.scala:93) ~[kyuubi-server_2.12-1.7.0-SNAPSHOT.jar:1.7.0-SNAPSHOT] at org.apache.kyuubi.client.KyuubiSyncThriftClient$$anon$1.$anonfun$run$1$adapted(KyuubiSyncThriftClient.scala:87) ~[kyuubi-server_2.12-1.7.0-SNAPSHOT.jar:1.7.0-SNAPSHOT] at scala.Option.foreach(Option.scala:407) ~[scala-library-2.12.17.jar:?] at org.apache.kyuubi.client.KyuubiSyncThriftClient$$anon$1.run(KyuubiSyncThriftClient.scala:87) ~[kyuubi-server_2.12-1.7.0-SNAPSHOT.jar:1.7.0-SNAPSHOT] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[?:1.8.0_362] at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) ~[?:1.8.0_362] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) ~[?:1.8.0_362] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) ~[?:1.8.0_362] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_362] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_362] at java.lang.Thread.run(Thread.java:750) ~[?:1.8.0_362] Caused by: java.net.SocketTimeoutException: Read timed out at java.net.SocketInputStream.socketRead0(Native Method) ~[?:1.8.0_362] at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) ~[?:1.8.0_362] at java.net.SocketInputStream.read(SocketInputStream.java:171) ~[?:1.8.0_362] at java.net.SocketInputStream.read(SocketInputStream.java:141) ~[?:1.8.0_362] at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) ~[?:1.8.0_362] at java.io.BufferedInputStream.read1(BufferedInputStream.java:286) ~[?:1.8.0_362] at java.io.BufferedInputStream.read(BufferedInputStream.java:345) ~[?:1.8.0_362] at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:127) ~[libthrift-0.9.3.jar:0.9.3] ... 23 more 2023-03-14 10:57:20.111 WARN org.apache.kyuubi.client.KyuubiSyncThriftClient: The engine[Some(application_1676285123186_0447)] alive probe fails org.apache.thrift.transport.TTransportException: java.net.SocketTimeoutException: Read timed out at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:129) ~[libthrift-0.9.3.jar:0.9.3] at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86) ~[libthrift-0.9.3.jar:0.9.3] at org.apache.thrift.transport.TSaslTransport.readLength(TSaslTransport.java:376) ~[libthrift-0.9.3.jar:0.9.3] at org.apache.thrift.transport.TSaslTransport.readFrame(TSaslTransport.java:453) ~[libthrift-0.9.3.jar:0.9.3] at org.apache.thrift.transport.TSaslTransport.read(TSaslTransport.java:435) ~[libthrift-0.9.3.jar:0.9.3] at org.apache.thrift.transport.TSaslClientTransport.read(TSaslClientTransport.java:37) ~[libthrift-0.9.3.jar:0.9.3] at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86) ~[libthrift-0.9.3.jar:0.9.3] at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:429) ~[libthrift-0.9.3.jar:0.9.3] at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:318) ~[libthrift-0.9.3.jar:0.9.3] at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:219) ~[libthrift-0.9.3.jar:0.9.3] at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:77) ~[libthrift-0.9.3.jar:0.9.3] at org.apache.hive.service.rpc.thrift.TCLIService$Client.recv_GetInfo(TCLIService.java:222) ~[hive-service-rpc-3.1.3.jar:3.1.3] at org.apache.hive.service.rpc.thrift.TCLIService$Client.GetInfo(TCLIService.java:209) ~[hive-service-rpc-3.1.3.jar:3.1.3] at org.apache.kyuubi.client.KyuubiSyncThriftClient$$anon$1.$anonfun$run$1(KyuubiSyncThriftClient.scala:93) ~[kyuubi-server_2.12-1.7.0-SNAPSHOT.jar:1.7.0-SNAPSHOT] at org.apache.kyuubi.client.KyuubiSyncThriftClient$$anon$1.$anonfun$run$1$adapted(KyuubiSyncThriftClient.scala:87) ~[kyuubi-server_2.12-1.7.0-SNAPSHOT.jar:1.7.0-SNAPSHOT] at scala.Option.foreach(Option.scala:407) ~[scala-library-2.12.17.jar:?] at org.apache.kyuubi.client.KyuubiSyncThriftClient$$anon$1.run(KyuubiSyncThriftClient.scala:87) ~[kyuubi-server_2.12-1.7.0-SNAPSHOT.jar:1.7.0-SNAPSHOT] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[?:1.8.0_362] at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) ~[?:1.8.0_362] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) ~[?:1.8.0_362] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) ~[?:1.8.0_362] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_362] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_362] at java.lang.Thread.run(Thread.java:750) ~[?:1.8.0_362] Caused by: java.net.SocketTimeoutException: Read timed out at java.net.SocketInputStream.socketRead0(Native Method) ~[?:1.8.0_362] at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) ~[?:1.8.0_362] at java.net.SocketInputStream.read(SocketInputStream.java:171) ~[?:1.8.0_362] at java.net.SocketInputStream.read(SocketInputStream.java:141) ~[?:1.8.0_362] at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) ~[?:1.8.0_362] at java.io.BufferedInputStream.read1(BufferedInputStream.java:286) ~[?:1.8.0_362] at java.io.BufferedInputStream.read(BufferedInputStream.java:345) ~[?:1.8.0_362] at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:127) ~[libthrift-0.9.3.jar:0.9.3] ... 23 more 2023-03-14 10:57:20.112 ERROR org.apache.kyuubi.client.KyuubiSyncThriftClient: Mark the engine[Some(application_1676285123186_0447)] not alive with no recent alive probe success: 60034 ms exceeds timeout 60000 ms ``` Now force closing the thrift client to generate an exception. ``` 2023-03-14 10:58:10.112 WARN org.apache.kyuubi.client.KyuubiSyncThriftClient: Removing Clients for TSessionHandle(sessionId:THandleIdentifier(guid:10 94 8A 85 00 DD 43 5D 85 58 91 84 61 81 A5 A4, secret:C2 EE 5B 97 3E A0 41 FC AC 16 9B D7 08 ED 8F 38)) 2023-03-14 10:58:10.114 INFO org.apache.kyuubi.operation.ExecuteStatement: Processing jpz3032's query[4b23657a-cacb-4ead-8abd-50920dd73de9]: RUNNING_STATE -> ERROR_STATE, time taken: 139.861 seconds 2023-03-14 10:58:10.119 WARN org.apache.kyuubi.operation.ExecuteStatement: Error closing THandleIdentifier(guid:65 95 B2 11 0C F1 45 7E B4 0B F8 D3 21 26 C1 05, secret:C2 EE 5B 97 3E A0 41 FC AC 16 9B D7 08 ED 8F 38): connection does not exist org.apache.kyuubi.KyuubiSQLException: connection does not exist at org.apache.kyuubi.KyuubiSQLException$.connectionDoesNotExist(KyuubiSQLException.scala:90) ~[kyuubi-common_2.12-1.7.0-SNAPSHOT.jar:1.7.0-SNAPSHOT] at org.apache.kyuubi.client.KyuubiSyncThriftClient.withLockAcquired(KyuubiSyncThriftClient.scala:139) ~[kyuubi-server_2.12-1.7.0-SNAPSHOT.jar:1.7.0-SNAPSHOT] at org.apache.kyuubi.client.KyuubiSyncThriftClient.withLockAcquiredAsyncRequest(KyuubiSyncThriftClient.scala:145) ~[kyuubi-server_2.12-1.7.0-SNAPSHOT.jar:1.7.0-SNAPSHOT] at org.apache.kyuubi.client.KyuubiSyncThriftClient.closeOperation(KyuubiSyncThriftClient.scala:393) ~[kyuubi-server_2.12-1.7.0-SNAPSHOT.jar:1.7.0-SNAPSHOT] at org.apache.kyuubi.operation.KyuubiOperation.liftedTree3$1(KyuubiOperation.scala:136) ~[kyuubi-server_2.12-1.7.0-SNAPSHOT.jar:1.7.0-SNAPSHOT] at org.apache.kyuubi.operation.KyuubiOperation.close(KyuubiOperation.scala:135) ~[kyuubi-server_2.12-1.7.0-SNAPSHOT.jar:1.7.0-SNAPSHOT] at org.apache.kyuubi.operation.OperationManager.closeOperation(OperationManager.scala:126) ~[kyuubi-common_2.12-1.7.0-SNAPSHOT.jar:1.7.0-SNAPSHOT] at org.apache.kyuubi.session.AbstractSession.$anonfun$closeOperation$1(AbstractSession.scala:224) ~[kyuubi-common_2.12-1.7.0-SNAPSHOT.jar:1.7.0-SNAPSHOT] at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) ~[scala-library-2.12.17.jar:?] at org.apache.kyuubi.session.AbstractSession.withAcquireRelease(AbstractSession.scala:82) ~[kyuubi-common_2.12-1.7.0-SNAPSHOT.jar:1.7.0-SNAPSHOT] at org.apache.kyuubi.session.AbstractSession.closeOperation(AbstractSession.scala:222) ~[kyuubi-common_2.12-1.7.0-SNAPSHOT.jar:1.7.0-SNAPSHOT] at org.apache.kyuubi.service.AbstractBackendService.closeOperation(AbstractBackendService.scala:188) ~[kyuubi-common_2.12-1.7.0-SNAPSHOT.jar:1.7.0-SNAPSHOT] at org.apache.kyuubi.server.KyuubiServer$$anon$1.org$apache$kyuubi$server$BackendServiceMetric$$super$closeOperation(KyuubiServer.scala:138) ~[kyuubi-server_2.12-1.7.0-SNAPSHOT.jar:1.7.0-SNAPSHOT] at org.apache.kyuubi.server.BackendServiceMetric.$anonfun$closeOperation$1(BackendServiceMetric.scala:169) ~[kyuubi-server_2.12-1.7.0-SNAPSHOT.jar:1.7.0-SNAPSHOT] at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) ~[scala-library-2.12.17.jar:?] at org.apache.kyuubi.metrics.MetricsSystem$.timerTracing(MetricsSystem.scala:111) ~[kyuubi-metrics_2.12-1.7.0-SNAPSHOT.jar:1.7.0-SNAPSHOT] at org.apache.kyuubi.server.BackendServiceMetric.closeOperation(BackendServiceMetric.scala:169) ~[kyuubi-server_2.12-1.7.0-SNAPSHOT.jar:1.7.0-SNAPSHOT] at org.apache.kyuubi.server.BackendServiceMetric.closeOperation$(BackendServiceMetric.scala:167) ~[kyuubi-server_2.12-1.7.0-SNAPSHOT.jar:1.7.0-SNAPSHOT] at org.apache.kyuubi.server.KyuubiServer$$anon$1.closeOperation(KyuubiServer.scala:138) ~[kyuubi-server_2.12-1.7.0-SNAPSHOT.jar:1.7.0-SNAPSHOT] at org.apache.kyuubi.service.TFrontendService.CloseOperation(TFrontendService.scala:498) ~[kyuubi-common_2.12-1.7.0-SNAPSHOT.jar:1.7.0-SNAPSHOT] at org.apache.hive.service.rpc.thrift.TCLIService$Processor$CloseOperation.getResult(TCLIService.java:1797) ~[hive-service-rpc-3.1.3.jar:3.1.3] at org.apache.hive.service.rpc.thrift.TCLIService$Processor$CloseOperation.getResult(TCLIService.java:1782) ~[hive-service-rpc-3.1.3.jar:3.1.3] at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) ~[libthrift-0.9.3.jar:0.9.3] at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) ~[libthrift-0.9.3.jar:0.9.3] at org.apache.kyuubi.service.authentication.TSetIpAddressProcessor.process(TSetIpAddressProcessor.scala:36) ~[kyuubi-common_2.12-1.7.0-SNAPSHOT.jar:1.7.0-SNAPSHOT] at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286) ~[libthrift-0.9.3.jar:0.9.3] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_362] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_362] at java.lang.Thread.run(Thread.java:750) ~[?:1.8.0_362] 2023-03-14 10:58:10.130 INFO org.apache.kyuubi.server.KyuubiTBinaryFrontendService: Received request of closing SessionHandle [7c5bd429-49a2-46c2-ba50-02825140e4c6] 2023-03-14 10:58:10.130 INFO org.apache.kyuubi.session.KyuubiSessionManager: jpz3032's session with SessionHandle [7c5bd429-49a2-46c2-ba50-02825140e4c6] is closed, current opening sessions 0 2023-03-14 10:58:10.131 INFO org.apache.kyuubi.server.KyuubiTBinaryFrontendService: Finished closing SessionHandle [7c5bd429-49a2-46c2-ba50-02825140e4c6] ``` 5. Beeline Logs Beeline finished with errors (as expected) ``` 2023-03-14 10:56:15.332 INFO org.apache.kyuubi.operation.ExecuteStatement: Query[4b23657a-cacb-4ead-8abd-50920dd73de9] in RUNNING_STATE 2023-03-14 10:56:20.333 INFO org.apache.kyuubi.operation.ExecuteStatement: Query[4b23657a-cacb-4ead-8abd-50920dd73de9] in RUNNING_STATE 2023-03-14 10:58:10.114 INFO org.apache.kyuubi.operation.ExecuteStatement: Processing jpz3032's query[4b23657a-cacb-4ead-8abd-50920dd73de9]: RUNNING_STATE -> ERROR_STATE, time taken: 139.861 seconds Error: org.apache.kyuubi.KyuubiSQLException: Error operating ExecuteStatement: org.apache.thrift.transport.TTransportException: java.net.SocketException: Socket closed at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:129) at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86) at org.apache.thrift.transport.TSaslTransport.readLength(TSaslTransport.java:376) at org.apache.thrift.transport.TSaslTransport.readFrame(TSaslTransport.java:453) at org.apache.thrift.transport.TSaslTransport.read(TSaslTransport.java:435) at org.apache.thrift.transport.TSaslClientTransport.read(TSaslClientTransport.java:37) at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86) at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:429) at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:318) at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:219) at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:77) at org.apache.hive.service.rpc.thrift.TCLIService$Client.recv_GetOperationStatus(TCLIService.java:475) at org.apache.hive.service.rpc.thrift.TCLIService$Client.GetOperationStatus(TCLIService.java:462) at org.apache.kyuubi.client.KyuubiSyncThriftClient.$anonfun$getOperationStatus$1(KyuubiSyncThriftClient.scala:377) at org.apache.kyuubi.client.KyuubiSyncThriftClient.$anonfun$withLockAcquiredAsyncRequest$2(KyuubiSyncThriftClient.scala:151) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) Caused by: java.net.SocketException: Socket closed at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) at java.net.SocketInputStream.read(SocketInputStream.java:171) at java.net.SocketInputStream.read(SocketInputStream.java:141) at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at java.io.BufferedInputStream.read1(BufferedInputStream.java:286) at java.io.BufferedInputStream.read(BufferedInputStream.java:345) at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:127) ... 20 more at org.apache.kyuubi.KyuubiSQLException$.apply(KyuubiSQLException.scala:69) at org.apache.kyuubi.operation.KyuubiOperation$$anonfun$onError$1.applyOrElse(KyuubiOperation.scala:76) at org.apache.kyuubi.operation.KyuubiOperation$$anonfun$onError$1.applyOrElse(KyuubiOperation.scala:57) at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38) at org.apache.kyuubi.operation.ExecuteStatement.waitStatementComplete(ExecuteStatement.scala:144) at org.apache.kyuubi.operation.ExecuteStatement.$anonfun$runInternal$1(ExecuteStatement.scala:161) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) Caused by: org.apache.thrift.transport.TTransportException: java.net.SocketException: Socket closed at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:129) at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86) at org.apache.thrift.transport.TSaslTransport.readLength(TSaslTransport.java:376) at org.apache.thrift.transport.TSaslTransport.readFrame(TSaslTransport.java:453) at org.apache.thrift.transport.TSaslTransport.read(TSaslTransport.java:435) at org.apache.thrift.transport.TSaslClientTransport.read(TSaslClientTransport.java:37) at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86) at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:429) at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:318) at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:219) at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:77) at org.apache.hive.service.rpc.thrift.TCLIService$Client.recv_GetOperationStatus(TCLIService.java:475) at org.apache.hive.service.rpc.thrift.TCLIService$Client.GetOperationStatus(TCLIService.java:462) at org.apache.kyuubi.client.KyuubiSyncThriftClient.$anonfun$getOperationStatus$1(KyuubiSyncThriftClient.scala:377) at org.apache.kyuubi.client.KyuubiSyncThriftClient.$anonfun$withLockAcquiredAsyncRequest$2(KyuubiSyncThriftClient.scala:151) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) ... 3 more Caused by: java.net.SocketException: Socket closed at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) at java.net.SocketInputStream.read(SocketInputStream.java:171) at java.net.SocketInputStream.read(SocketInputStream.java:141) at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at java.io.BufferedInputStream.read1(BufferedInputStream.java:286) at java.io.BufferedInputStream.read(BufferedInputStream.java:345) at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:127) ... 20 more (state=,code=0) Closing: 0: jdbc:hive2://lndcndevms1510.nhnjp.ism:10009/default?socketTimeout=60000;#spark.yarn.queue=shared ``` - [ ] [Run test](https://kyuubi.readthedocs.io/en/master/develop_tools/testing.html#running-tests) locally before make a pull request The kyuubi-server suite wasn't working even before this PR. Closes #4480 from risyomei/feature/kill-when-engine-fail. Closes #4480 f611022 [risyomei] Fixed Test messages(3) 3ed6886 [risyomei] Fixed Test messages(2) 30675d0 [risyomei] Fixed Test messages d80cfcb [Xieming LI] Merge remote-tracking branch 'origin/master' into feature/kill-when-engine-fail 3b320e1 [Xieming LI] Make sure thread is shut down 5ee9ddf [Xieming LI] Commit for work 8c854cc [Cheng Pan] Update kyuubi-server/src/main/scala/org/apache/kyuubi/client/KyuubiSyncThriftClient.scala 564e09c [Xieming LI] Fixed based on Comment ae1a025 [Xieming LI] Fixed unused import 1b50a59 [Xieming LI] rebase origin/master d9f9e3c [Xieming LI] Optimized logic based on PR comment a542285 [Xieming LI] Fixe based on comment 04b9ff3 [Xieming LI] Added an option to kill the thrift connection when Engine is considered dead 3571d68 [Xieming LI] Fixe based on comment e40fcfc [Xieming LI] Added an option to kill the thrift connection when Engine is considered dead Lead-authored-by: Xieming LI <risyomei@gmail.com> Co-authored-by: risyomei <risyomei@gmail.com> Co-authored-by: Cheng Pan <pan3793@gmail.com> Signed-off-by: fwang12 <fwang12@ebay.com> (cherry picked from commit cee2f00) Signed-off-by: fwang12 <fwang12@ebay.com>
Why are the changes needed?
When ExecuteStatement running in async, if engine broken(e.g. killed by yarn), the thread just crash without
setOperationException(ke)
, then client(likebeeline
) will hang.Close #647
How was this patch tested?
Add some test cases that check the changes thoroughly including negative and positive cases if possible
Add screenshots for manual tests if appropriate
Run test locally before make a pull request
Manual tested on our yarn cluster: kill yarn app when executing statement.
beeline
connect hiveserver2. Result: exit with error code 2beeline
connect kyuubi(before this patch). Result: hangbeeline
connect kyuubi(after this patch). Result: exit with error code 2