Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

socket.cc recv() timeout #156

Closed
extf33 opened this issue Mar 30, 2023 · 23 comments
Closed

socket.cc recv() timeout #156

extf33 opened this issue Mar 30, 2023 · 23 comments

Comments

@extf33
Copy link

extf33 commented Mar 30, 2023

I implemented a custom mutator, and it takes about 1~5 mins for one mutate cycle.

When I run wtf with the custom mutator, I get the following error:

Could not receive size (-1)
Receive failed

wtf/src/wtf/socket.cc

Lines 241 to 245 in 3ccad88

if (const int R = recv(Fd, (char *)CurrentBuffer, sizeof(uint32_t), 0);
R == -1 || R != sizeof(uint32_t)) {
fmt::print("Could not receive size ({})\n", R);
return std::nullopt;
}

Is there any part of the code that sets the timeout or flag such as O_NONBLOCK?

Actually, I already checked O_NONBLOCK.

int is_block = !(fnctl(Fd, F_GETFL, 0) & O_NONBLOCK);
// Result: 1
@0vercl0k
Copy link
Owner

Oh interesting - if you start a server w/ no clients and let it run for 1-5min; does the receive fail as well? If not, I don't think it comes from timing. Also on what OS are you running into this?

And no, that recv call is meant to block (so NON_BLOCK isn't set anywhere afaiaa).

Cheers

@0vercl0k
Copy link
Owner

Okay actually, just spinning a server is not enough; I added a Sleep(5*60*1000); inside the fuzzer_dummy.cc module and it seems to work fine on Windows (no timeout).

Any special thing about your set-up?

Cheers

@extf33
Copy link
Author

extf33 commented Mar 31, 2023

I worked on Ubuntu 22.04 KVM, hmm I gonna test it with sleep().

@extf33
Copy link
Author

extf33 commented Mar 31, 2023

I worked on Ubuntu 22.04 KVM, hmm I gonna test it with sleep().

I tested with a sleep(5*60*1000);, the same error occured.

strace result is as follows:

recvfrom(8, 0x563e8545c340, 4, 0, NULL, NULL) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
rt_sigreturn({mask=[]})                 = -1 EINTR (Interrupted system call)

@0vercl0k
Copy link
Owner

0vercl0k commented Mar 31, 2023 via email

@0vercl0k
Copy link
Owner

0vercl0k commented Apr 1, 2023

Okay this is weird - I can't seem to repro this on an Ubuntu running in WSL2:

$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 22.04.1 LTS
Release:        22.04
Codename:       jammy

I wonder what's going on 🤔

@extf33
Copy link
Author

extf33 commented Apr 5, 2023

My bad, it was me.
I gave --limit 100 and thought it is only for execution time.

@0vercl0k
Copy link
Owner

0vercl0k commented Apr 5, 2023 via email

@0vercl0k
Copy link
Owner

Okay sorry for the lag, I just started to look into this.

Do you remember which backend were you using?

From what I can see in the code, the bxcpu backend stop executing and sends a timeout to the server (as expected), the KVM backend uses a SIGALRM to stop KVM and WHV uses TimerQ_t to cancel the VCPU.

Cheers

@extf33
Copy link
Author

extf33 commented Apr 21, 2023

I only tested on the KVM backend.

@0vercl0k
Copy link
Owner

0vercl0k commented Apr 21, 2023 via email

@extf33
Copy link
Author

extf33 commented Apr 21, 2023

If I remember correctly, on the fuzzer side.

@0vercl0k
Copy link
Owner

0vercl0k commented Apr 21, 2023 via email

@extf33
Copy link
Author

extf33 commented Apr 21, 2023

That's right, it matches what I saw :)

@0vercl0k
Copy link
Owner

0vercl0k commented Apr 21, 2023 via email

@extf33
Copy link
Author

extf33 commented Apr 21, 2023

As in bochscpu, I thought about specifying the KVM/WHv backend (time) limit only for execution (not execution+mutation), but I think it will be fine as it is now.

@0vercl0k
Copy link
Owner

0vercl0k commented Apr 21, 2023 via email

@extf33
Copy link
Author

extf33 commented Apr 21, 2023

Yes, that's the issue here..

Okay so that makes sense that the fuzzer side receives a SIGALRM; this
should stop KVM from running the testcase and the code will send a timeout
result to the server; but the server shouldn't stop.

In the above scenario,
SIGALRM does not occur while KVM is 'running' the testcase, but while the server is performing mutation (before KVM executes the testcase).

@0vercl0k
Copy link
Owner

0vercl0k commented Apr 21, 2023 via email

@0vercl0k
Copy link
Owner

0vercl0k commented Apr 21, 2023 via email

@0vercl0k
Copy link
Owner

Okay, I set-up a VM w/ nested virtualization to try this out, and I played with it some more. At first, I thought I had a repro when simulating a long InsertTestcase using a sleep like I recommended earlier. When doing that, the first testcase executed as expected but the next ones would no sleep the full time and it would get interrupted which was weird. After reading the sleep manual more, I realized it doesn't play nice w/ signals so it's not a good way to faking a slow testcase generation.

At that point, I removed the sleep and made a long loop to simulate a long workload. The first testcase executes as expected, the generation completes, the testcase is run and returns a cr3. Now here's what I think you observed. The second testcase takes a long time to generate like the first one, which is expected, but this time when it gets executed it ends up timing out before hitting the expected cr3 result and generate a timeout result.

# Server log
$ ../../src/build/wtf master --runs=100 --max_len=10000 --name dummy --target .
Seeded with 1181810723460784320
Iterating through the corpus..
Sorting through the 1 entries..
Running server on tcp://localhost:31337..
#0 cov: 0 (+0) corp: 0 (0.0b) exec/s: 9223372036854.8m (1 nodes) lastcov: 2.0s crash: 0 timeout: 0 cr3: 0 uptime: 2.0s
#0 cov: 0 (+0) corp: 0 (0.0b) exec/s: 0.0 (1 nodes) lastcov: 1.4min crash: 0 timeout: 0 cr3: 0 uptime: 1.4min
#1 cov: 13 (+13) corp: 1 (4.0b) exec/s: 0.0 (1 nodes) lastcov: 1.4min crash: 0 timeout: 0 cr3: 1 uptime: 2.9min
#2 cov: 13 (+0) corp: 1 (4.0b) exec/s: 0.0 (1 nodes) lastcov: 2.8min crash: 0 timeout: 1 cr3: 1 uptime: 4.2min
#3 cov: 13 (+0) corp: 1 (4.0b) exec/s: 0.0 (1 nodes) lastcov: 4.2min crash: 0 timeout: 2 cr3: 1 uptime: 5.6min

# Fuzzer log
$ sudo ../../src/build/wtf fuzz --backend kvm --name dummy --limit 1
The debugger instance is loaded with 6 items
Parsing coverage/HEVD.cov..
Applied 432 code coverage breakpoints
Setting debug register status to zero.
Setting debug register status to zero.
Setting mxcsr_mask to 0xffbf.
Resolved breakpoint 0xfffff8046f1c3880 at GPA 0x27c3880 aka HVA 0x55a4bf491970
Dialing to tcp://localhost:31337/..
Generating the testcase, sleeping for 60s..
Done
nt!SwapContext
#1 cov: 13 exec/s: 0.0 lastcov: 0.0s crash: 0 timeout: 0 cr3: 1 uptime: 1.4min
Generating the testcase, sleeping for 60s..
Done
#2 cov: 13 exec/s: 0.0 lastcov: 1.4min crash: 0 timeout: 1 cr3: 1 uptime: 2.8min
Generating the testcase, sleeping for 60s..
Done
#3 cov: 13 exec/s: 0.0 lastcov: 2.8min crash: 0 timeout: 2 cr3: 1 uptime: 4.2min
Generating the testcase, sleeping for 60s..
Done
#4 cov: 13 exec/s: 0.0 lastcov: 4.2min crash: 0 timeout: 3 cr3: 1 uptime: 5.6min
Generating the testcase, sleeping for 60s..

At this point I think there's definitely a bug related to the limit feature in KVM; maybe it comes from the way I handled SIGARLM or something. Maybe the signal from the testcase n-1 is triggered when the testcase n is executed which shuts it down prior to what it should be.

Anyways, now that I have a repro I can investigate :)

Cheers

@0vercl0k
Copy link
Owner

Boy, this is a bad bug actually. The issue might actually be that I never reset immediate_exit to 0 🤦🏽‍♂️. Which means any testcase that triggers a timeout will make all the next testcases triggering timeouts as well 😬

If you still have you environment to repro, I'd love if you verify that the below patch fixes your issue:

--- a/src/wtf/kvm_backend.cc
+++ b/src/wtf/kvm_backend.cc
@@ -1454,7 +1454,7 @@ std::optional<TestcaseResult_t> KvmBackend_t::Run(const uint8_t *Buffer,
   Stop_ = false;
   TestcaseRes_ = Ok_t();
   Coverage_.clear();
-
+  Run_->immediate_exit = 0;
   while (!Stop_) {

It fixes it on my test bench:

$ sudo ../../src/build/wtf fuzz --backend kvm --name dummy --limit 1
[sudo] password for over:
The debugger instance is loaded with 6 items
Parsing coverage/HEVD.cov..
Applied 432 code coverage breakpoints
Setting debug register status to zero.
Setting debug register status to zero.
Setting mxcsr_mask to 0xffbf.
Resolved breakpoint 0xfffff8046f1c3880 at GPA 0x27c3880 aka HVA 0x563171252970
Dialing to tcp://localhost:31337/..
Generating the testcase, sleeping for 60s..
Done
nt!SwapContext
#1 cov: 13 exec/s: 0.0 lastcov: 0.0s crash: 0 timeout: 0 cr3: 1 uptime: 1.6min
Generating the testcase, sleeping for 60s..
Done
nt!SwapContext
#2 cov: 13 exec/s: 0.0 lastcov: 1.5min crash: 0 timeout: 0 cr3: 2 uptime: 3.1min
Generating the testcase, sleeping for 60s..
Done
nt!SwapContext
#3 cov: 13 exec/s: 0.0 lastcov: 2.8min crash: 0 timeout: 0 cr3: 3 uptime: 4.4min

Cheers

@0vercl0k
Copy link
Owner

FYI I will land #161 & close this issue next week if I don't hear anything back :)

Cheers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants