Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix SIGTERM can't terminate PubSub::listen issue by add cancellation token support. #606

Merged
merged 19 commits into from
May 19, 2022

Conversation

liuh-80
Copy link
Contributor

@liuh-80 liuh-80 commented Apr 19, 2022

Why I did it

There are infinite loops inside PubSub::listen() method, so application using this method can't handle SIGTERM correctly.
https://github.com/Azure/sonic-swss-common/issues/603

How I did it

Add following class:
1. CancellationToken: this class will help exist the infinite loops when SIGTERM or other signal happen.
2. SignalHandlerHelper: Provide a native signal handler.

How to verify it

  1. manually test.
  2. Pass all test case.

Which release branch to backport (provide reason below if selected)

  • 201811
  • 201911
  • 202006
  • 202012
  • 202106
  • 202111

Description for the changelog

Link to config_db schema for YANG module changes

A picture of a cute animal (not mandatory but encouraged)

@liuh-80 liuh-80 changed the title Add cancellation token Fix SIGTERM can't terminate PubSub::listen issue by add cancellation token support. Apr 19, 2022
@liuh-80 liuh-80 marked this pull request as ready for review April 20, 2022 08:41
common/cancellationtoken.h Outdated Show resolved Hide resolved
common/cancellationtoken.cpp Outdated Show resolved Hide resolved
common/cancellationtoken.h Outdated Show resolved Hide resolved
common/signalhandlerhelper.cpp Outdated Show resolved Hide resolved

void SignalHandlerHelper::RegisterSignalHandler(int signalNumber)
{
signal(signalNumber, SignalHandlerHelper::OnSignal);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it override the signal handler defined by the application? In case of python, does it mean that the user defined signal handler in python won't be executed?

Copy link
Contributor Author

@liuh-80 liuh-80 Apr 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, because linux can only have 1 signal handler per-application.
If customer need define their signal handler, then customer also need cancel the token when signal happen in their handler, this class just a helper class.

common/select.cpp Outdated Show resolved Hide resolved
common/select.cpp Outdated Show resolved Hide resolved
@liuh-80 liuh-80 requested a review from qiluo-msft April 21, 2022 02:11
common/configdb.h Outdated Show resolved Hide resolved
common/configdb.h Outdated Show resolved Hide resolved
@qiluo-msft
Copy link
Contributor

qiluo-msft commented Apr 22, 2022

ret = poll_descriptors(c, timeout);

Did you change old behavior?


In reply to: 1106076032


Refers to: common/select.cpp:166 in 037283c. [](commit_id = 037283c, deletion_comment = False)

common/select.h Outdated Show resolved Hide resolved
@liuh-80
Copy link
Contributor Author

liuh-80 commented Apr 22, 2022

ret = poll_descriptors(c, timeout);

Did you change old behavior?

Refers to: common/select.cpp:166 in 037283c. [](commit_id = 037283c, deletion_comment = False)

No, I didn't. when timeout happen, current code still will try read again.

@qiluo-msft
Copy link
Contributor

ret = poll_descriptors(c, timeout);

For non terminating signals (like siguser), original behavior is keeping while loop, but the new behavior is return SIGNALINTR. I think this is unexpected.


In reply to: 1106076032


Refers to: common/select.cpp:166 in 037283c. [](commit_id = 037283c, deletion_comment = False)

@liuh-80
Copy link
Contributor Author

liuh-80 commented Apr 27, 2022

ret = poll_descriptors(c, timeout);

For non terminating signals (like siguser), original behavior is keeping while loop, but the new behavior is return SIGNALINTR. I think this is unexpected.

In reply to: 1106076032

Refers to: common/select.cpp:166 in 037283c. [](commit_id = 037283c, deletion_comment = False)

Will discussion offline, the reason is:

  1. without a customize signal handler we don't know which signal break the epoll_wait.
  2. When poll_descriptors break by siguser, it will return to python code and python code will re-run poll_descriptors method, so from the python code side, the behavior not change.

However. there are some UT failed, seems related with this change, will check reason.

Update: in latest code, only sigint can break the while loop, all other signal will not break the while loop.

@liuh-80
Copy link
Contributor Author

liuh-80 commented Apr 28, 2022

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@liuh-80
Copy link
Contributor Author

liuh-80 commented Apr 28, 2022

Add SignalHandlerHelper class back because:

  1. There are UT break because in last version any signal will break poll_descriptors() method.
  2. So the solution should only allow SIGTERM to break poll_descriptors().
  3. However, according to linux signal system, it's impossible to check which signal happening when there is no signal handler.

So we have to add this class back.

Also here is demo code in python3:

from swsscommon import swsscommon
swsscommon.SignalHandlerHelper.registerSignalHandler(swsscommon.SIGNAL_INT)
c=swsscommon.ConfigDBConnector()
c.subscribe('A', lambda a: None)
c.connect()
c.listen(None)
^C>>>

common/pubsub.cpp Outdated Show resolved Hide resolved
common/pubsub.h Outdated Show resolved Hide resolved
common/select.cpp Outdated Show resolved Hide resolved
@liuh-80
Copy link
Contributor Author

liuh-80 commented May 11, 2022

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@liuh-80
Copy link
Contributor Author

liuh-80 commented May 11, 2022

Code coverage does not reach target, will add UT to cover new code.

liat-grozovik pushed a commit to sonic-net/sonic-utilities that referenced this pull request May 11, 2022
…ystemd (#2133)" (#2161)

This reverts commit 23e9398.

- What I did
Revert "[scripts/fast-reboot] Shutdown remaining containers through systemd (#2133)"

This reverted PR is part of a story that refactors warm/fast shutdown sequence to gracefully stop services instead of killing them without any ordering and dependency requirements which creates several issues and is error prone for the future.

This PR must come together with sonic-net/sonic-buildimage#10510.
However, #10510 is blocked due to an issue in swss-common sonic-net/sonic-swss-common#603
And a fix by MSFT is in review sonic-net/sonic-swss-common#606

I am reverting it because its dependency is still blocked and we cannot update submodule pointer. Once the dependency of the reverted PR is resolved, it shall be re-committed.
liat-grozovik pushed a commit to sonic-net/sonic-utilities that referenced this pull request May 11, 2022
…ystemd (#2133)" (#2166)

- What I did
This reverts commit a5f55aa.

Revert "[scripts/fast-reboot] Shutdown remaining containers through systemd (#2133)"

This reverted PR is part of a story that refactors warm/fast shutdown sequence to gracefully stop services instead of killing them without any ordering and dependency requirements which creates several issues and is error prone for the future.

This PR must come together with sonic-net/sonic-buildimage#10510.
However, #10510 is blocked due to an issue in swss-common sonic-net/sonic-swss-common#603
And a fix by MSFT is in review sonic-net/sonic-swss-common#606

I am reverting it because its dependency is still blocked and we cannot update submodule pointer. Once the dependency of the reverted PR is resolved, it shall be re-committed.

- How I did it
git revert a5f55aa

- How to verify it
Run tests
@liuh-80
Copy link
Contributor Author

liuh-80 commented May 16, 2022

@stepanblyschak, could you please check this PR again? all issue fixed.

common/pubsub.cpp Outdated Show resolved Hide resolved
@qiluo-msft
Copy link
Contributor

ret = poll_descriptors(c, timeout);

Could you fix for non terminating signals (like siguser)?


In reply to: 1109033970


Refers to: common/select.cpp:166 in 037283c. [](commit_id = 037283c, deletion_comment = False)

@liuh-80
Copy link
Contributor Author

liuh-80 commented May 18, 2022

ret = poll_descriptors(c, timeout);

Could you fix for non terminating signals (like siguser)?

In reply to: 1109033970

Refers to: common/select.cpp:166 in 037283c. [](commit_id = 037283c, deletion_comment = False)

This already fixed, currently only sigint can break the while loop in select.cpp, all other signal will not break while loop.

@liuh-80 liuh-80 merged commit 7ae22be into sonic-net:master May 19, 2022
itamar-talmon pushed a commit to itamar-talmon/sonic-swss-common that referenced this pull request Jul 19, 2022
…token support. (sonic-net#606)

Why I did it
    There are infinite loops inside PubSub::listen() method, so application using this method can't handle SIGTERM correctly.
    sonic-net#603

How I did it
    Add following class:
    1. CancellationToken: this class will help exist the infinite loops when SIGTERM or other signal happen.
    2. SignalHandlerHelper: Provide a native signal handler.

How to verify it
   1. manually test.
   2. Pass all test case.
malletvapid23 added a commit to malletvapid23/Sonic-Utility that referenced this pull request Aug 3, 2023
…ystemd (#2133)" (#2161)

This reverts commit 23e9398.

- What I did
Revert "[scripts/fast-reboot] Shutdown remaining containers through systemd (#2133)"

This reverted PR is part of a story that refactors warm/fast shutdown sequence to gracefully stop services instead of killing them without any ordering and dependency requirements which creates several issues and is error prone for the future.

This PR must come together with sonic-net/sonic-buildimage#10510.
However, #10510 is blocked due to an issue in swss-common sonic-net/sonic-swss-common#603
And a fix by MSFT is in review sonic-net/sonic-swss-common#606

I am reverting it because its dependency is still blocked and we cannot update submodule pointer. Once the dependency of the reverted PR is resolved, it shall be re-committed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants