Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sock_async_event: fix race-condition #13438

Merged

Conversation

miri64
Copy link
Member

@miri64 miri64 commented Feb 21, 2020

Contribution description

If a new event is fired during the execution of the event callback, event->type might change. However as event->type is set to 0 after the execution of the callback, that leads to it being 0 on the next round of the event handler's execution, leading in the event getting lost.

Testing procedure

  • make -C tests/gnrc_sock_async_event flash test should still work with a board of your choice.
  • To confirm the race condition
    • try tests/lwip with lwip: provide sock_async support #13427 merged and start a TCP server with tcp server start 1337
    • connect to that server from your host machine using netcat (e.g. nc fe80::0200:dead:beef:affe%tapbr0 1337 when using native)
    • With this PR the closing of the connection when hitting Ctrl+C in netcat is registered and printed to stdout. You can reconnect to the TCP server by starting a new netcat process.
    • Without this PR the closing of the connection when hitting Ctrl+C in netcat is not registered. Reconnecting to the TCP server is not possible (accept will error with an -ENOMEM).

Issues/PRs references

Discovered in #13427.

If a new event is fired during the execution of the event callback,
`event->type` might change. However as `event->type` is set to 0 after
the execution of the callback, that leads to it being 0 on the next
round of the event handler's execution, leading in the event getting
lost.
@miri64 miri64 added Area: network Area: Networking Process: needs backport Integration Process: The PR is required to be backported to a release or feature branch Type: bug The issue reports a bug / The PR fixes a bug (including spelling errors) labels Feb 21, 2020
@miri64 miri64 added this to the Release 2020.04 milestone Feb 21, 2020
@miri64 miri64 added Area: pkg Area: External package ports and removed Process: needs backport Integration Process: The PR is required to be backported to a release or feature branch labels Feb 21, 2020
@miri64
Copy link
Member Author

miri64 commented Feb 21, 2020

Backport for 2020.01 probably not needed as no support for TCP for sock_async exists there

Copy link
Contributor

@gschorcht gschorcht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense.

@miri64 miri64 added the CI: ready for build If set, CI server will compile all applications for all available boards for the labeled PR label Feb 21, 2020
Copy link
Contributor

@gschorcht gschorcht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cannot reproduce the race conditions. Therefore I have set it to 'request changes' to avoid accidental merging.

@miri64
Copy link
Member Author

miri64 commented Feb 21, 2020

I cannot reproduce the race conditions. Therefore I have set it to 'request changes' to avoid accidental merging.

I don't understand your reasoning for blocking this then... Even if you can't reproduce, isn't it obvious that a race condition can occur and that this PR is a fix for that? Why should a merge be bad, even if accidentally?

@miri64
Copy link
Member Author

miri64 commented Feb 21, 2020

(can't reproduce it with #13427 either btw, most likely because my latest fixes hardened the backend itself against the race condition. Nevertheless, it still can occur and should be fixed.)

@gschorcht
Copy link
Contributor

I cannot reproduce the race conditions. Therefore I have set it to 'request changes' to avoid accidental merging.

Maybe, I missed something but I can't reproduce the race condition with PR #13427, neither with esp_wifi nor with native. It works as expected.

Without this PR I can observe following behavior when I leave nc with Ctrl-C.

TCP client [fe80::d4e9:caff:fe69:da71]:36076 connected
TCP connection to [fe80::d4e9:caff:fe69:da71]:36076 reset

I can reconnect afterwards.

With this PR on top of PR #13427, I observer the following:

TCP client [fe80::d4e9:caff:fe69:da71]:36294 connected
TCP connection to [fe80::d4e9:caff:fe69:da71]:36294 reset
TCP connection to [100:0:100:0:200:0:44ef:5f56]:22112 reset

But, at least I can reconnect afterwards.

@gschorcht
Copy link
Contributor

(can't reproduce it with #13427 either btw, most likely because my latest fixes hardened the backend itself against the race condition. Nevertheless, it still can occur and should be fixed.)

As said in #13438 (comment), the behavior seems to be more strange with this PR than before.

@miri64
Copy link
Member Author

miri64 commented Feb 21, 2020

With #13427 resetted to aaa644b I still can reproduce the race condition.

@miri64
Copy link
Member Author

miri64 commented Feb 21, 2020

With this PR on top of PR #13427, I observer the following:

TCP client [fe80::d4e9:caff:fe69:da71]:36294 connected
TCP connection to [fe80::d4e9:caff:fe69:da71]:36294 reset
TCP connection to [100:0:100:0:200:0:44ef:5f56]:22112 reset

But, at least I can reconnect afterwards.

This should have been fixed with 24ce2ba. Are you sure you used the most recent version?

@miri64
Copy link
Member Author

miri64 commented Feb 21, 2020

So... to prevent the garbage output and still have the race condition you need to revert 54069b1...

@gschorcht
Copy link
Contributor

gschorcht commented Feb 21, 2020

With this PR on top of PR #13427, I observer the following:

TCP client [fe80::d4e9:caff:fe69:da71]:36294 connected
TCP connection to [fe80::d4e9:caff:fe69:da71]:36294 reset
TCP connection to [100:0:100:0:200:0:44ef:5f56]:22112 reset

But, at least I can reconnect afterwards.

This should have been fixed with 24ce2ba. Are you sure you used the most recent version?

OK, that's my fault, the last commit in PR #13427 was 54069b1. I will check again.

@miri64
Copy link
Member Author

miri64 commented Feb 21, 2020

Here is what I got:

$ git log --oneline
dcc7f8832 Revert "fixup! lwip: provide sock_async support"
24ce2ba16 (origin/lwip/enh/sock_async) fixup! tests/lwip: adapt test to use sock_async
54069b14b fixup! lwip: provide sock_async support
aaa644b38 fixup! tests/lwip: adapt test to use sock_async
386c3f1b1 lwip: fix style errors found by Vera++
53e6a8190 tests/lwip: adapt test to use sock_async
b56479427 lwip: provide sock_async support
188cc20ac sock_async_event: fix typo in tcp_queue function
5f938bd49 lwip: introduce sock base-type

[…]

RIOT native @ tap0
/home/mlenders/Repositories/RIOT-OS/RIOT/tests/lwip/bin/native/tests_lwip.elf tap0 
RIOT native interrupts/signals initialized.
LED_RED_OFF
LED_GREEN_ON
RIOT native board initialized.
RIOT native hardware initialization complete.

main(): This is RIOT! (Version: 2020.04-devel-593-gdcc7f-lwip/enh/sock_async)
RIOT lwip test application
> ifconfig
ifconfig
ET_00:  inet6 fe80::682e:4fff:fe3d:dfcb
> tcp server start 1337
tcp server start 1337
Success: started TCP server on port 1337
> TCP client [fe80::98be:cbff:fe6f:6cae]:34928 connected
Error on TCP accept [-12]

after I did

$ nc -6 fe80::682e:4fff:fe3d:dfcb%tapbr0 1337
^C
$ nc -6 fe80::682e:4fff:fe3d:dfcb%tapbr0 1337
^C

Then

$ git cherry-pick d26355fbb41cdc0cb6a5af17901cd7e95a58bd8a
Auto-merging sys/net/sock/async/event/sock_async_event.c
[lwip/enh/sock_async 7647e854b] sock_async_event: fix race-condition
 Date: Fri Feb 21 15:50:30 2020 +0100
 1 file changed, 7 insertions(+), 1 deletion(-)

I get

RIOT native @ tap0
make: Entering directory '/home/mlenders/Repositories/RIOT-OS/RIOT/tests/lwip'
/home/mlenders/Repositories/RIOT-OS/RIOT/tests/lwip/bin/native/tests_lwip.elf tap0 
RIOT native interrupts/signals initialized.
LED_RED_OFF
LED_GREEN_ON
RIOT native board initialized.
RIOT native hardware initialization complete.

main(): This is RIOT! (Version: 2020.04-devel-594-g7647e-lwip/enh/sock_async)
RIOT lwip test application
> ifconfig
ifconfig
ET_00:  inet6 fe80::682e:4fff:fe3d:dfcb

> tcp server start 1337
tcp server start 1337
Success: started TCP server on port 1337
> TCP client [fe80::98be:cbff:fe6f:6cae]:35086 connected
TCP connection to [fe80::98be:cbff:fe6f:6cae]:35086 reset
TCP client [fe80::98be:cbff:fe6f:6cae]:35092 connected
TCP connection to [fe80::98be:cbff:fe6f:6cae]:35092 reset

Again, after I used nc twice

$ nc -6 fe80::682e:4fff:fe3d:dfcb%tapbr0 1337
^C
$ nc -6 fe80::682e:4fff:fe3d:dfcb%tapbr0 1337
^C

@miri64
Copy link
Member Author

miri64 commented Feb 21, 2020

(edited comment as I copy-pasted something wrongly 😅)

@miri64
Copy link
Member Author

miri64 commented Feb 21, 2020

Note the version strings that should make clear that I use the version I am actually claiming to use. Is this justification enough to have a bug fix for my own code merged 😉?

@gschorcht
Copy link
Contributor

I can confirm that #13427 resetted to aaa644b gives an error and I can't reconnect:

TCP client [fe80::d4e9:caff:fe69:da71]:38516 connected
Error on TCP accept [-12]

I can further confirm that this PR on top of aaa644b gives some garbage but I can reconnect:

TCP client [fe80::d4e9:caff:fe69:da71]:38620 connected
TCP connection to [fe80::d4e9:caff:fe69:da71]:38620 reset
TCP connection to [100:0:100:0:200:0:44df:6356]:22116 reset

I can also confirm that this PR on top of the last commit of PR #13427 works as expected.

TCP client [fe80::d4e9:caff:fe69:da71]:38664 connected
TCP connection to [fe80::d4e9:caff:fe69:da71]:38664 reset

@gschorcht
Copy link
Contributor

gschorcht commented Feb 21, 2020

BTW, I saw this Error on TCP accept [-12] with an earlier version of PR #13427 when I was playing a bit with two instances of nc. I can still reproduce it. Could you check please whether it is releated this this PR or PR #13427?

riot#> tcp server start 1337

term1> nc -6 fe80::d4e9:caff:fe69:da72%tap0 1337
riot#> TCP client [fe80::d4e9:caff:fe69:da71]:41788 connected

term2> nc -6 fe80::d4e9:caff:fe69:da72%tap0 1337
riot#> Error on TCP accept [-12]
term2> ^C

term1> ^C
riot#> TCP connection to [fe80::d4e9:caff:fe69:da71]:41788 reset

term1> nc -6 fe80::d4e9:caff:fe69:da72%tap0 1337
riot#> TCP client [fe80::d4e9:caff:fe69:da71]:41792 connected

term1> ^C
riot#> NOTHING HAPPENS

term1> nc -6 fe80::d4e9:caff:fe69:da72%tap0 1337
riot#> Error on TCP accept [-12]

Copy link
Contributor

@gschorcht gschorcht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change makes sense. The problem desribed in #13438 (comment) seems not to be related to this PR. ACK

@miri64
Copy link
Member Author

miri64 commented Feb 21, 2020

I am pretty sure this PR has nothing to do with it or in the worst case just reveals a problem with #13427. From a code perspective this PR should be ready to go.

@gschorcht
Copy link
Contributor

I am pretty sure this PR has nothing to do with it or in the worst case just reveals a problem with #13427. From a code perspective this PR should be ready to go.

Yes, but would you try to reproduce the behavior described in #13438 (comment)?

@gschorcht gschorcht merged commit 8dc53cf into RIOT-OS:master Feb 21, 2020
@miri64
Copy link
Member Author

miri64 commented Feb 21, 2020

I am pretty sure this PR has nothing to do with it or in the worst case just reveals a problem with #13427. From a code perspective this PR should be ready to go.

Yes, but would you try to reproduce the behavior described in #13438 (comment)?

Already was able to :-). It is definitely something being wrong with #13427 (the callback_arg member of the netconn at the second connect stays unset which is why the callback is skipped, as the sock can't be determined... still trying to figure out why).

@miri64 miri64 deleted the sock_async_event/fix/race-condition branch February 21, 2020 18:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: network Area: Networking Area: pkg Area: External package ports CI: ready for build If set, CI server will compile all applications for all available boards for the labeled PR Type: bug The issue reports a bug / The PR fixes a bug (including spelling errors)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants