Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IPsec pluto process crashes when the remote endpoint is unstable #2516

Closed
sridhargaddam opened this issue Jun 2, 2023 · 1 comment · Fixed by #2517
Closed

IPsec pluto process crashes when the remote endpoint is unstable #2516

sridhargaddam opened this issue Jun 2, 2023 · 1 comment · Fixed by #2517
Assignees
Labels
backport This change requires a backport to eligible release branches bug Something isn't working

Comments

@sridhargaddam
Copy link
Member

What happened:

In situations where a Kubernetes (K8s) cluster experiences high resource utilization and the K8s infrastructure restarts the submariner-gateway pods on the remote cluster, it has been observed that this action leads to the crash of submariner-gateway pods on the local cluster. Additionally, this situation can also occur when the remote IPsec endpoint fails to respond to the IKE messages, which are part of the IPsec negotiation process. Both of these factors can contribute to the crash of the submariner-gateway pods in the local cluster.

Jun  1 12:36:51.335934: | kernel: get_sa_bundle_info esp.de4b790b@3.133.126.34
Jun  1 12:36:51.335937: | xfrm: sendrecv_xfrm_msg() sending 18
Jun  1 12:36:51.335946: |] #33: "submariner-cable-sgaddam-aws-spoke2-20-0-31-85-1-1", type=ESP, add_time=1685623000, inBytes=968, outBytes=968, maxBytes=2^63B, id='20.0.31.85'
Jun  1 12:36:51.335953: | whack: stop: trafficstatus (fd@0x56173adbe4c8)
Jun  1 12:36:51.335966: | freeref fd@0x56173adbe4c8 (whack_handle_cb() +1038 /programs/pluto/rcv_whack.c)
Jun  1 12:36:55.041709: | timer_event_cb: processing EVENT_RETRANSMIT-event@0x56173ad8c7e8 for IKE SA #13 in state ESTABLISHED_IKE_SA
Jun  1 12:36:55.041727: | #13 deleting EVENT_RETRANSMIT
Jun  1 12:36:55.041736: | IKEv2 retransmit event
Jun  1 12:36:55.041743: | handling event EVENT_RETRANSMIT for 3.133.126.34 "submariner-cable-sgaddam-aws-spoke2-20-0-31-85-1-1" #13 attempt 1 of 0
Jun  1 12:36:55.041747: | and parent for 3.133.126.34 "submariner-cable-sgaddam-aws-spoke2-20-0-31-85-1-1" #13 keying attempt 0 of 0; retransmit 8
Jun  1 12:36:55.041752: | #13 STATE_V2_ESTABLISHED_IKE_SA: retransmits: current time 17492.78414
Jun  1 12:36:55.041755: | #13 STATE_V2_ESTABLISHED_IKE_SA: retransmits: retransmit count 7 exceeds limit? NO
Jun  1 12:36:55.041757: | #13 STATE_V2_ESTABLISHED_IKE_SA: retransmits: deltatime 64 exceeds limit? YES
Jun  1 12:36:55.041759: | #13 STATE_V2_ESTABLISHED_IKE_SA: retransmits: monotime 64.013435 exceeds limit? YES
Jun  1 12:36:55.041763: "submariner-cable-sgaddam-aws-spoke2-20-0-31-85-1-1" #13: STATE_V2_ESTABLISHED_IKE_SA: 60 second timeout exceeded after 7 retransmits.  No response (or no acceptable response) to our IKEv2 message
Jun  1 12:36:55.041768: ABORT: ASSERTION FAILED: switch (ike->sa.st_connection->config->dpd.action) case 0 (0x0) unexpected (retransmit_timeout_action() +100 /programs/pluto/ikev2_retry.c)
�[90m2023-06-01T12:36:55.297Z�[0m �[1m�[31mFTL�[0m�[0m ...3/pkg/log/logger.go:67 libreswan            Pluto exited: signal: aborted (core dumped)

What you expected to happen:

Pluto (indirectly submariner-gateway) should not crash even when the remote endpoint is not responding or unsable.

How to reproduce it (as minimally and precisely as possible):
Deploy two KIND/OCP Clusters and connect them via Submariner. Once the connections are successfully established, on one of the clusters, run the following script to periodically restart the submariner-gateway pod. Notice the submariner-gateway pod on the other cluster.

watch -n 20 "kubectl delete pod -n submariner-operator -l app=submariner-gateway"

Environment:

  • Cloud provider or hardware configuration: Can be reproduced on KIND as well as OCP Clusters
  • Others: Reproduced using Submariner 0.14 release images.
@sridhargaddam sridhargaddam added the bug Something isn't working label Jun 2, 2023
@sridhargaddam sridhargaddam added the backport This change requires a backport to eligible release branches label Jun 2, 2023
sridhargaddam added a commit to sridhargaddam/submariner that referenced this issue Jun 2, 2023
Currently, submariner-gateway pod while invoking the whack
commands does not set any dpdaction flags. So the default
dpdaction of disabled was applied. While using this action,
when the remote endpoint is not responding within a certain
duration, some problematic code path in Libreswan was getting
executed and leading to crash. The proper fix would be to use
an updated Libreswan, but as a workaround we can explicitly
set the dpdaction=hold to avoid hitting the problematic code
paths.

Related PR in libreswan:
libreswan/libreswan@c7a6113

Fixes: submariner-io#2516
Signed-off-by: Sridhar Gaddam <sgaddam@redhat.com>
sridhargaddam added a commit to sridhargaddam/submariner that referenced this issue Jun 2, 2023
Currently, submariner-gateway pod while invoking the whack
commands does not set any dpdaction flags. So the default
dpdaction of disabled was applied. While using this action,
when the remote endpoint is not responding within a certain
duration, some problematic code path in Libreswan was getting
executed and leading to crash. The proper fix would be to use
an updated Libreswan, but as a workaround we can explicitly
set the dpdaction=hold to avoid hitting the problematic code
paths.

Related PR in libreswan:
libreswan/libreswan@c7a6113

Fixes: submariner-io#2516
Signed-off-by: Sridhar Gaddam <sgaddam@redhat.com>
Co-authored-by: Yossi Boaron <yboaron@redhat.com>
sridhargaddam added a commit to sridhargaddam/submariner that referenced this issue Jun 2, 2023
Currently, submariner-gateway pod while invoking the whack
commands does not set any dpdaction flags. So the default
dpdaction of disabled was applied. While using this action,
when the remote endpoint is not responding within a certain
duration, some problematic code path in Libreswan was getting
executed and leading to crash. The proper fix would be to use
an updated Libreswan, but as a workaround we can explicitly
set the dpdaction=hold to avoid hitting the problematic code
paths.

Related PR in libreswan:
libreswan/libreswan@c7a6113

Fixes: submariner-io#2516
Signed-off-by: Sridhar Gaddam <sgaddam@redhat.com>
Co-authored-by: Yossi Boaron <yboaron@redhat.com>
sridhargaddam added a commit to sridhargaddam/submariner that referenced this issue Jun 2, 2023
Currently, submariner-gateway pod while invoking the whack
commands does not set any dpdaction flags. So the default
dpdaction of disabled was applied. While using this action,
when the remote endpoint is not responding within a certain
duration, some problematic code path in Libreswan was getting
executed and leading to crash. The proper fix would be to use
an updated Libreswan, but as a workaround we can explicitly
set the dpdaction=hold to avoid hitting the problematic code
paths.

Related PR in libreswan:
libreswan/libreswan@c7a6113

Fixes: submariner-io#2516
Signed-off-by: Sridhar Gaddam <sgaddam@redhat.com>
Co-authored-by: Yossi Boaron <yboaron@redhat.com>
sridhargaddam added a commit to sridhargaddam/submariner that referenced this issue Jun 2, 2023
Currently, submariner-gateway pod while invoking the whack
commands does not set any dpdaction flags. So the default
dpdaction of disabled was applied. While using this action,
when the remote endpoint is not responding within a certain
duration, some problematic code path in Libreswan was getting
executed and leading to crash. The proper fix would be to use
an updated Libreswan, but as a workaround we can explicitly
set the dpdaction=hold to avoid hitting the problematic code
paths.

Related PR in libreswan:
libreswan/libreswan@c7a6113

Fixes: submariner-io#2516
Signed-off-by: Sridhar Gaddam <sgaddam@redhat.com>
Co-authored-by: Yossi Boaron <yboaron@redhat.com>
sridhargaddam added a commit that referenced this issue Jun 2, 2023
Currently, submariner-gateway pod while invoking the whack
commands does not set any dpdaction flags. So the default
dpdaction of disabled was applied. While using this action,
when the remote endpoint is not responding within a certain
duration, some problematic code path in Libreswan was getting
executed and leading to crash. The proper fix would be to use
an updated Libreswan, but as a workaround we can explicitly
set the dpdaction=hold to avoid hitting the problematic code
paths.

Related PR in libreswan:
libreswan/libreswan@c7a6113

Fixes: #2516
Signed-off-by: Sridhar Gaddam <sgaddam@redhat.com>
Co-authored-by: Yossi Boaron <yboaron@redhat.com>
sridhargaddam added a commit to sridhargaddam/submariner that referenced this issue Jun 2, 2023
Currently, submariner-gateway pod while invoking the whack
commands does not set any dpdaction flags. So the default
dpdaction of disabled was applied. While using this action,
when the remote endpoint is not responding within a certain
duration, some problematic code path in Libreswan was getting
executed and leading to crash. The proper fix would be to use
an updated Libreswan, but as a workaround we can explicitly
set the dpdaction=hold to avoid hitting the problematic code
paths.

Related PR in libreswan:
libreswan/libreswan@c7a6113

Fixes: submariner-io#2516
Signed-off-by: Sridhar Gaddam <sgaddam@redhat.com>
Co-authored-by: Yossi Boaron <yboaron@redhat.com>
sridhargaddam added a commit to sridhargaddam/submariner that referenced this issue Jun 2, 2023
Currently, submariner-gateway pod while invoking the whack
commands does not set any dpdaction flags. So the default
dpdaction of disabled was applied. While using this action,
when the remote endpoint is not responding within a certain
duration, some problematic code path in Libreswan was getting
executed and leading to crash. The proper fix would be to use
an updated Libreswan, but as a workaround we can explicitly
set the dpdaction=hold to avoid hitting the problematic code
paths.

Related PR in libreswan:
libreswan/libreswan@c7a6113

Fixes: submariner-io#2516
Signed-off-by: Sridhar Gaddam <sgaddam@redhat.com>
Co-authored-by: Yossi Boaron <yboaron@redhat.com>
sridhargaddam added a commit to sridhargaddam/submariner that referenced this issue Jun 2, 2023
Currently, submariner-gateway pod while invoking the whack
commands does not set any dpdaction flags. So the default
dpdaction of disabled was applied. While using this action,
when the remote endpoint is not responding within a certain
duration, some problematic code path in Libreswan was getting
executed and leading to crash. The proper fix would be to use
an updated Libreswan, but as a workaround we can explicitly
set the dpdaction=hold to avoid hitting the problematic code
paths.

Related PR in libreswan:
libreswan/libreswan@c7a6113

Fixes: submariner-io#2516
Signed-off-by: Sridhar Gaddam <sgaddam@redhat.com>
Co-authored-by: Yossi Boaron <yboaron@redhat.com>
@sridhargaddam
Copy link
Member Author

Backport PRs:
release-0.15: #2520
release-0.14: #2521
release-0.13: #2522

tpantelis pushed a commit that referenced this issue Jun 2, 2023
Currently, submariner-gateway pod while invoking the whack
commands does not set any dpdaction flags. So the default
dpdaction of disabled was applied. While using this action,
when the remote endpoint is not responding within a certain
duration, some problematic code path in Libreswan was getting
executed and leading to crash. The proper fix would be to use
an updated Libreswan, but as a workaround we can explicitly
set the dpdaction=hold to avoid hitting the problematic code
paths.

Related PR in libreswan:
libreswan/libreswan@c7a6113

Fixes: #2516
Signed-off-by: Sridhar Gaddam <sgaddam@redhat.com>
Co-authored-by: Yossi Boaron <yboaron@redhat.com>
tpantelis pushed a commit that referenced this issue Jun 2, 2023
Currently, submariner-gateway pod while invoking the whack
commands does not set any dpdaction flags. So the default
dpdaction of disabled was applied. While using this action,
when the remote endpoint is not responding within a certain
duration, some problematic code path in Libreswan was getting
executed and leading to crash. The proper fix would be to use
an updated Libreswan, but as a workaround we can explicitly
set the dpdaction=hold to avoid hitting the problematic code
paths.

Related PR in libreswan:
libreswan/libreswan@c7a6113

Fixes: #2516
Signed-off-by: Sridhar Gaddam <sgaddam@redhat.com>
Co-authored-by: Yossi Boaron <yboaron@redhat.com>
tpantelis pushed a commit that referenced this issue Jun 2, 2023
Currently, submariner-gateway pod while invoking the whack
commands does not set any dpdaction flags. So the default
dpdaction of disabled was applied. While using this action,
when the remote endpoint is not responding within a certain
duration, some problematic code path in Libreswan was getting
executed and leading to crash. The proper fix would be to use
an updated Libreswan, but as a workaround we can explicitly
set the dpdaction=hold to avoid hitting the problematic code
paths.

Related PR in libreswan:
libreswan/libreswan@c7a6113

Fixes: #2516
Signed-off-by: Sridhar Gaddam <sgaddam@redhat.com>
Co-authored-by: Yossi Boaron <yboaron@redhat.com>
novad03 pushed a commit to novad03/k8s-submariner that referenced this issue Nov 25, 2023
Currently, submariner-gateway pod while invoking the whack
commands does not set any dpdaction flags. So the default
dpdaction of disabled was applied. While using this action,
when the remote endpoint is not responding within a certain
duration, some problematic code path in Libreswan was getting
executed and leading to crash. The proper fix would be to use
an updated Libreswan, but as a workaround we can explicitly
set the dpdaction=hold to avoid hitting the problematic code
paths.

Related PR in libreswan:
libreswan/libreswan@c7a6113

Fixes: submariner-io/submariner#2516
Signed-off-by: Sridhar Gaddam <sgaddam@redhat.com>
Co-authored-by: Yossi Boaron <yboaron@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport This change requires a backport to eligible release branches bug Something isn't working
Projects
No open projects
Status: Done
Development

Successfully merging a pull request may close this issue.

2 participants