socket.cc `recv()` timeout #156

extf33 · 2023-03-30T04:42:41Z

I implemented a custom mutator, and it takes about 1~5 mins for one mutate cycle.

When I run wtf with the custom mutator, I get the following error:

Could not receive size (-1)
Receive failed

wtf/src/wtf/socket.cc

Lines 241 to 245 in 3ccad88

    
           if (const int R = recv(Fd, (char *)CurrentBuffer, sizeof(uint32_t), 0); 
        
               R == -1 || R != sizeof(uint32_t)) { 
        
             fmt::print("Could not receive size ({})\n", R); 
        
             return std::nullopt; 
        
           }

Is there any part of the code that sets the timeout or flag such as O_NONBLOCK?

Actually, I already checked O_NONBLOCK.

int is_block = !(fnctl(Fd, F_GETFL, 0) & O_NONBLOCK);
// Result: 1

The text was updated successfully, but these errors were encountered:

0vercl0k · 2023-03-31T02:24:03Z

Oh interesting - if you start a server w/ no clients and let it run for 1-5min; does the receive fail as well? If not, I don't think it comes from timing. Also on what OS are you running into this?

And no, that recv call is meant to block (so NON_BLOCK isn't set anywhere afaiaa).

Cheers

0vercl0k · 2023-03-31T03:24:14Z

Okay actually, just spinning a server is not enough; I added a Sleep(5*60*1000); inside the fuzzer_dummy.cc module and it seems to work fine on Windows (no timeout).

Any special thing about your set-up?

Cheers

extf33 · 2023-03-31T05:22:30Z

I worked on Ubuntu 22.04 KVM, hmm I gonna test it with sleep().

extf33 · 2023-03-31T10:01:17Z

I worked on Ubuntu 22.04 KVM, hmm I gonna test it with sleep().

I tested with a sleep(5*60*1000);, the same error occured.

strace result is as follows:

recvfrom(8, 0x563e8545c340, 4, 0, NULL, NULL) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
rt_sigreturn({mask=[]})                 = -1 EINTR (Interrupted system call)

0vercl0k · 2023-03-31T15:27:31Z

I'll try to repro this this weekend then, thanks for all the details! Cheers

…

On Fri, Mar 31, 2023 at 3:01 AM extf33 ***@***.***> wrote: I worked on Ubuntu 22.04 KVM, hmm I gonna test it with sleep(). I tested with a sleep(5*60*1000);, the same error occured. strace result is as follows: recvfrom(8, 0x563e8545c340, 4, 0, NULL, NULL) = ? ERESTARTSYS (To be restarted if SA_RESTART is set) --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} --- rt_sigreturn({mask=[]}) = -1 EINTR (Interrupted system call) — Reply to this email directly, view it on GitHub <#156 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AALIORLCEXKHPOKHZP36UXDW62TPRANCNFSM6AAAAAAWMX2NXA> . You are receiving this because you commented.Message ID: ***@***.***>

0vercl0k · 2023-04-01T17:10:14Z

Okay this is weird - I can't seem to repro this on an Ubuntu running in WSL2:

$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 22.04.1 LTS
Release:        22.04
Codename:       jammy

I wonder what's going on 🤔

extf33 · 2023-04-05T01:27:26Z

My bad, it was me.
I gave --limit 100 and thought it is only for execution time.

0vercl0k · 2023-04-05T01:35:24Z

Hmmmm, it is meant to prevent an infinite testcase running and if it hits it should send a timeout status to the server. I might need to check that code then cause it might be a bug in its own 🤔

…

On Tue, Apr 4, 2023 at 6:27 PM extf33 ***@***.***> wrote: My bad, it was me. I gave --limit 100 and thought it is only for execution time. — Reply to this email directly, view it on GitHub <#156 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AALIORK4PLYWMDODWN3RNPTW7TDATANCNFSM6AAAAAAWMX2NXA> . You are receiving this because you commented.Message ID: ***@***.***>

0vercl0k · 2023-04-21T01:01:11Z

Okay sorry for the lag, I just started to look into this.

Do you remember which backend were you using?

From what I can see in the code, the bxcpu backend stop executing and sends a timeout to the server (as expected), the KVM backend uses a SIGALRM to stop KVM and WHV uses TimerQ_t to cancel the VCPU.

Cheers

extf33 · 2023-04-21T02:00:59Z

I only tested on the KVM backend.

0vercl0k · 2023-04-21T02:41:42Z

Do you remember on which side the above strace output was from? On the fuzzer side or on the master? Cheers

…

On Thu, Apr 20, 2023 at 7:01 PM extf33 ***@***.***> wrote: I only tested on the KVM backend. — Reply to this email directly, view it on GitHub <#156 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AALIORPJOBHRG4EFGGXWQKTXCHS6NANCNFSM6AAAAAAWMX2NXA> . You are receiving this because you commented.Message ID: ***@***.***>

extf33 · 2023-04-21T02:43:13Z

If I remember correctly, on the fuzzer side.

0vercl0k · 2023-04-21T02:48:27Z

Okay so that makes sense that the fuzzer side receives a SIGALRM; this should stop KVM from running the testcase and the code will send a timeout result to the server; but the server shouldn't stop. Does this match what you saw? If so, then we're good :-D Cheers

…

On Thu, Apr 20, 2023 at 7:43 PM extf33 ***@***.***> wrote: If I remember correctly, on the fuzzer side. — Reply to this email directly, view it on GitHub <#156 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AALIORPLEOKRZOQS3UC4L3LXCHX4ZANCNFSM6AAAAAAWMX2NXA> . You are receiving this because you commented.Message ID: ***@***.***>

extf33 · 2023-04-21T02:52:37Z

That's right, it matches what I saw :)

0vercl0k · 2023-04-21T02:55:22Z

Okay cool awesome! Thanks for sharing all those details with me, I think this issue is resolved then. Let me know if you disagree, otherwise I'll close it down. Cheers

…

On Thu, Apr 20, 2023 at 7:52 PM extf33 ***@***.***> wrote: That's right, it matches what I saw :) — Reply to this email directly, view it on GitHub <#156 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AALIORNRWHGXLNFNC72PBN3XCHZABANCNFSM6AAAAAAWMX2NXA> . You are receiving this because you commented.Message ID: ***@***.***>

extf33 · 2023-04-21T03:00:19Z

As in bochscpu, I thought about specifying the KVM/WHv backend (time) limit only for execution (not execution+mutation), but I think it will be fine as it is now.

0vercl0k · 2023-04-21T03:05:25Z

That's what it should be, at least that's the intended way. Generation of the testcase happens before execution of the testcase; and the execution length of the testcase is something hidden aways by the backends so there shouldn't be any link 🤔

…

On Thu, Apr 20, 2023 at 8:00 PM extf33 ***@***.***> wrote: As in bochscpu, I thought about specifying the KVM/WHv backend (time) limit only for execution (not execution+mutation), but I think it will be fine as it is now. — Reply to this email directly, view it on GitHub <#156 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AALIORL6WVRLOJBG54K5YMTXCHZ45ANCNFSM6AAAAAAWMX2NXA> . You are receiving this because you commented.Message ID: ***@***.***>

extf33 · 2023-04-21T03:13:34Z

Yes, that's the issue here..

Okay so that makes sense that the fuzzer side receives a SIGALRM; this
should stop KVM from running the testcase and the code will send a timeout
result to the server; but the server shouldn't stop.

In the above scenario,
SIGALRM does not occur while KVM is 'running' the testcase, but while the server is performing mutation (before KVM executes the testcase).

0vercl0k · 2023-04-21T03:16:21Z

OK something that might happen is that the alarm gets canceled and gets triggered after the end of the testcase... I will investigate this further, I might have an idea. Thank you! Cheers

…

On Thu, Apr 20, 2023 at 8:13 PM extf33 ***@***.***> wrote: Yes, that's the issue here.. Okay so that makes sense that the fuzzer side receives a SIGALRM; this should stop KVM from running the testcase and the code will send a timeout result to the server; but the server shouldn't stop. In the above scenario, SIGALRM does not occur while KVM is 'running' the testcase, but while the server is performing mutation (before KVM executes the testcase). — Reply to this email directly, view it on GitHub <#156 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AALIORJXHYKWRRQFPVOVFADXCH3ORANCNFSM6AAAAAAWMX2NXA> . You are receiving this because you commented.Message ID: ***@***.***>

0vercl0k · 2023-04-21T03:23:13Z

Actually the weird thing is that even if the SIGALRM happens after, it just turns a boolean to true: ```c++ void KvmBackend_t::SignalAlarm() { __atomic_store_n(&Run_->immediate_exit, 1, __ATOMIC_RELAXED); } void KvmBackend_t::StaticSignalAlarm(int, siginfo_t *, void *) { KvmBackend_t *KvmBackend = reinterpret_cast<KvmBackend_t *>(g_Backend); KvmBackend->SignalAlarm(); } ``` Oh well, back to the drawing board 😅 Cheers

0vercl0k · 2023-04-21T18:22:17Z

Okay, I set-up a VM w/ nested virtualization to try this out, and I played with it some more. At first, I thought I had a repro when simulating a long InsertTestcase using a sleep like I recommended earlier. When doing that, the first testcase executed as expected but the next ones would no sleep the full time and it would get interrupted which was weird. After reading the sleep manual more, I realized it doesn't play nice w/ signals so it's not a good way to faking a slow testcase generation.

At that point, I removed the sleep and made a long loop to simulate a long workload. The first testcase executes as expected, the generation completes, the testcase is run and returns a cr3. Now here's what I think you observed. The second testcase takes a long time to generate like the first one, which is expected, but this time when it gets executed it ends up timing out before hitting the expected cr3 result and generate a timeout result.

# Server log
$ ../../src/build/wtf master --runs=100 --max_len=10000 --name dummy --target .
Seeded with 1181810723460784320
Iterating through the corpus..
Sorting through the 1 entries..
Running server on tcp://localhost:31337..
#0 cov: 0 (+0) corp: 0 (0.0b) exec/s: 9223372036854.8m (1 nodes) lastcov: 2.0s crash: 0 timeout: 0 cr3: 0 uptime: 2.0s
#0 cov: 0 (+0) corp: 0 (0.0b) exec/s: 0.0 (1 nodes) lastcov: 1.4min crash: 0 timeout: 0 cr3: 0 uptime: 1.4min
#1 cov: 13 (+13) corp: 1 (4.0b) exec/s: 0.0 (1 nodes) lastcov: 1.4min crash: 0 timeout: 0 cr3: 1 uptime: 2.9min
#2 cov: 13 (+0) corp: 1 (4.0b) exec/s: 0.0 (1 nodes) lastcov: 2.8min crash: 0 timeout: 1 cr3: 1 uptime: 4.2min
#3 cov: 13 (+0) corp: 1 (4.0b) exec/s: 0.0 (1 nodes) lastcov: 4.2min crash: 0 timeout: 2 cr3: 1 uptime: 5.6min

# Fuzzer log
$ sudo ../../src/build/wtf fuzz --backend kvm --name dummy --limit 1
The debugger instance is loaded with 6 items
Parsing coverage/HEVD.cov..
Applied 432 code coverage breakpoints
Setting debug register status to zero.
Setting debug register status to zero.
Setting mxcsr_mask to 0xffbf.
Resolved breakpoint 0xfffff8046f1c3880 at GPA 0x27c3880 aka HVA 0x55a4bf491970
Dialing to tcp://localhost:31337/..
Generating the testcase, sleeping for 60s..
Done
nt!SwapContext
#1 cov: 13 exec/s: 0.0 lastcov: 0.0s crash: 0 timeout: 0 cr3: 1 uptime: 1.4min
Generating the testcase, sleeping for 60s..
Done
#2 cov: 13 exec/s: 0.0 lastcov: 1.4min crash: 0 timeout: 1 cr3: 1 uptime: 2.8min
Generating the testcase, sleeping for 60s..
Done
#3 cov: 13 exec/s: 0.0 lastcov: 2.8min crash: 0 timeout: 2 cr3: 1 uptime: 4.2min
Generating the testcase, sleeping for 60s..
Done
#4 cov: 13 exec/s: 0.0 lastcov: 4.2min crash: 0 timeout: 3 cr3: 1 uptime: 5.6min
Generating the testcase, sleeping for 60s..

At this point I think there's definitely a bug related to the limit feature in KVM; maybe it comes from the way I handled SIGARLM or something. Maybe the signal from the testcase n-1 is triggered when the testcase n is executed which shuts it down prior to what it should be.

Anyways, now that I have a repro I can investigate :)

Cheers

0vercl0k · 2023-04-21T18:35:29Z

Boy, this is a bad bug actually. The issue might actually be that I never reset immediate_exit to 0 🤦🏽‍♂️. Which means any testcase that triggers a timeout will make all the next testcases triggering timeouts as well 😬

If you still have you environment to repro, I'd love if you verify that the below patch fixes your issue:

--- a/src/wtf/kvm_backend.cc
+++ b/src/wtf/kvm_backend.cc
@@ -1454,7 +1454,7 @@ std::optional<TestcaseResult_t> KvmBackend_t::Run(const uint8_t *Buffer,
   Stop_ = false;
   TestcaseRes_ = Ok_t();
   Coverage_.clear();
-
+  Run_->immediate_exit = 0;
   while (!Stop_) {

It fixes it on my test bench:

$ sudo ../../src/build/wtf fuzz --backend kvm --name dummy --limit 1
[sudo] password for over:
The debugger instance is loaded with 6 items
Parsing coverage/HEVD.cov..
Applied 432 code coverage breakpoints
Setting debug register status to zero.
Setting debug register status to zero.
Setting mxcsr_mask to 0xffbf.
Resolved breakpoint 0xfffff8046f1c3880 at GPA 0x27c3880 aka HVA 0x563171252970
Dialing to tcp://localhost:31337/..
Generating the testcase, sleeping for 60s..
Done
nt!SwapContext
#1 cov: 13 exec/s: 0.0 lastcov: 0.0s crash: 0 timeout: 0 cr3: 1 uptime: 1.6min
Generating the testcase, sleeping for 60s..
Done
nt!SwapContext
#2 cov: 13 exec/s: 0.0 lastcov: 1.5min crash: 0 timeout: 0 cr3: 2 uptime: 3.1min
Generating the testcase, sleeping for 60s..
Done
nt!SwapContext
#3 cov: 13 exec/s: 0.0 lastcov: 2.8min crash: 0 timeout: 0 cr3: 3 uptime: 4.4min

Cheers

0vercl0k · 2023-04-27T17:47:37Z

FYI I will land #161 & close this issue next week if I don't hear anything back :)

Cheers

0vercl0k added a commit that referenced this issue Apr 27, 2023

Make sure immediate_exit is reset before running a testcase (fix #156)

b7357d5

0vercl0k closed this as completed in 43d0166 May 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

socket.cc `recv()` timeout #156

socket.cc `recv()` timeout #156

extf33 commented Mar 30, 2023 •

edited

Loading

0vercl0k commented Mar 31, 2023

0vercl0k commented Mar 31, 2023

extf33 commented Mar 31, 2023

extf33 commented Mar 31, 2023

0vercl0k commented Mar 31, 2023 via email

0vercl0k commented Apr 1, 2023

extf33 commented Apr 5, 2023

0vercl0k commented Apr 5, 2023 via email

0vercl0k commented Apr 21, 2023

extf33 commented Apr 21, 2023

0vercl0k commented Apr 21, 2023 via email

extf33 commented Apr 21, 2023

0vercl0k commented Apr 21, 2023 via email

extf33 commented Apr 21, 2023

0vercl0k commented Apr 21, 2023 via email

extf33 commented Apr 21, 2023

0vercl0k commented Apr 21, 2023 via email

extf33 commented Apr 21, 2023

0vercl0k commented Apr 21, 2023 via email

0vercl0k commented Apr 21, 2023 via email •

edited

Loading

0vercl0k commented Apr 21, 2023

0vercl0k commented Apr 21, 2023

0vercl0k commented Apr 27, 2023

socket.cc recv() timeout #156

socket.cc recv() timeout #156

Comments

extf33 commented Mar 30, 2023 • edited Loading

0vercl0k commented Mar 31, 2023

0vercl0k commented Mar 31, 2023

extf33 commented Mar 31, 2023

extf33 commented Mar 31, 2023

0vercl0k commented Mar 31, 2023 via email

0vercl0k commented Apr 1, 2023

extf33 commented Apr 5, 2023

0vercl0k commented Apr 5, 2023 via email

0vercl0k commented Apr 21, 2023

extf33 commented Apr 21, 2023

0vercl0k commented Apr 21, 2023 via email

extf33 commented Apr 21, 2023

0vercl0k commented Apr 21, 2023 via email

extf33 commented Apr 21, 2023

0vercl0k commented Apr 21, 2023 via email

extf33 commented Apr 21, 2023

0vercl0k commented Apr 21, 2023 via email

extf33 commented Apr 21, 2023

0vercl0k commented Apr 21, 2023 via email

0vercl0k commented Apr 21, 2023 via email • edited Loading

0vercl0k commented Apr 21, 2023

0vercl0k commented Apr 21, 2023

0vercl0k commented Apr 27, 2023

socket.cc `recv()` timeout #156

socket.cc `recv()` timeout #156

extf33 commented Mar 30, 2023 •

edited

Loading

0vercl0k commented Apr 21, 2023 via email •

edited

Loading