Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

msquic crashing on ARM32 machines #3958

Closed
1 of 4 tasks
wfurt opened this issue Nov 14, 2023 · 9 comments · Fixed by #4208
Closed
1 of 4 tasks

msquic crashing on ARM32 machines #3958

wfurt opened this issue Nov 14, 2023 · 9 comments · Fixed by #4208
Labels
Bug: Platform A code bug in platform/TLS specific code. OS: Ubuntu Partner: .NET By or For the .NET team
Milestone

Comments

@wfurt
Copy link
Member

wfurt commented Nov 14, 2023

Describe the bug

It seems like regression from #3826 @nibanks.
I did binary search on changes and previous commit (cd49fbb) works fine.
This is likely root cause for dotnet/runtime#91757 causing .NET CI failures.
While reported on Alpine, I can reproduce on my Raspberry PI with Debian 10.

Affected OS

  • Windows
  • Linux
  • macOS
  • Other (specify below)

Additional OS information

I tried main and commit 972e677
Comment cd49fbb seems ok.

MsQuic version

main

Steps taken to reproduce bug

I can reproduce it running .NET Quic test suite.

Expected behavior

no crash and everything runs as before

Actual outcome

Thread 14 ".NET Long Runni" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x678da440 (LWP 10187)]
0x683dc210 in InterlockedDecrement (Addend=0x60) at /home/furt/github/msquic/src/inc/quic_platform_posix.h:115
115	    return __sync_sub_and_fetch(Addend, (long)1);
(gdb) bt
#0  0x683dc210 in InterlockedDecrement (Addend=0x60) at /home/furt/github/msquic/src/inc/quic_platform_posix.h:115
#1  RecvDataReturn (RecvDataChain=0x0) at /home/furt/github/msquic/src/platform/datapath_epoll.c:2067
#2  0x683ce240 in CxPlatRecvDataReturn (RecvDataChain=<optimized out>) at /home/furt/github/msquic/src/platform/datapath_linux.c:338
#3  0x68381814 in QuicConnRecvDatagrams (Connection=Connection@entry=0x73398d58, Packets=0x0, Packets@entry=0x6ef34450, PacketChainCount=PacketChainCount@entry=2, PacketChainByteCount=PacketChainByteCount@entry=2324, IsDeferred=<optimized out>,
    IsDeferred@entry=0 '\000') at /home/furt/github/msquic/src/core/connection.c:5848
#4  0x68381d0c in QuicConnFlushRecv (Connection=Connection@entry=0x73398d58) at /home/furt/github/msquic/src/core/connection.c:5929
#5  0x68385844 in QuicConnDrainOperations (Connection=Connection@entry=0x73398d58) at /home/furt/github/msquic/src/core/connection.c:7676
#6  0x68359900 in QuicWorkerProcessConnection (Worker=Worker@entry=0x72921c10, Connection=0x73398d58, ThreadID=<optimized out>, TimeNow=TimeNow@entry=0x678d9e10) at /home/furt/github/msquic/src/core/worker.c:510
#7  0x6835a668 in QuicWorkerLoop (Context=0x72921c10, State=0x678d9e10) at /home/furt/github/msquic/src/core/worker.c:668
#8  0x683c8654 in CxPlatRunExecutionContexts (Worker=Worker@entry=0x7290aad0, State=State@entry=0x678d9e10) at /home/furt/github/msquic/src/platform/platform_worker.c:395
#9  0x683c8a74 in CxPlatWorkerThread (Context=0x7290aad0) at /home/furt/github/msquic/src/platform/platform_worker.c:492
#10 0x76f7c310 in start_thread (arg=0x678da440) at pthread_create.c:477
#11 0x76cc6598 in ?? () at ../sysdeps/unix/sysv/linux/arm/clone.S:73 from /lib/arm-linux-gnueabihf/libc.so.6
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Additional details

I don't tank if this is because of speed tf the machine or if this is because ARM has weak memory model. (or something completely else)

@wfurt
Copy link
Member Author

wfurt commented Nov 14, 2023

cc: @ManickaP @CarnaViire

@nibanks nibanks added this to the Future milestone Nov 14, 2023
@nibanks nibanks added OS: Ubuntu Bug: Platform A code bug in platform/TLS specific code. labels Nov 14, 2023
@nibanks
Copy link
Member

nibanks commented Nov 14, 2023

I'm not sure why there is a problem. As far as I can tell we aren't packing any structs or doing anything that should mess with alignment. The struct in question:

typedef struct DATAPATH_RX_IO_BLOCK {
    //
    // The pool owning this recv block.
    //
    CXPLAT_POOL* OwningPool;

    //
    // Represents the network route.
    //
    CXPLAT_ROUTE Route;

    //
    // Ref count of receive data/packets that are using this block.
    //
    long RefCount;

And the code that is crashing:

void
RecvDataReturn(
    _In_ CXPLAT_RECV_DATA* RecvDataChain
    )
{
    CXPLAT_RECV_DATA* Datagram;
    while ((Datagram = RecvDataChain) != NULL) {
        RecvDataChain = RecvDataChain->Next;
        DATAPATH_RX_PACKET* Packet =
            CXPLAT_CONTAINING_RECORD(Datagram, DATAPATH_RX_PACKET, Data);
        if (InterlockedDecrement(&Packet->IoBlock->RefCount) == 0) {     <==== HERE
            CxPlatPoolFree(Packet->IoBlock->OwningPool, Packet->IoBlock);
        }
    }
}

RefCount should be aligned properly for atomic access.

But this opens a bigger question about the importance of 32-bit ARM support (perhaps 32-bit support in general). For instance, Windows has completely dropped support. What is the practical expectation on Posix operating systems?

@nibanks
Copy link
Member

nibanks commented Nov 14, 2023

According to ChatGPT adding __attribute__((aligned(4))) to the field might help, but I have my doubts because it's already a 4-byte field and should be aligned as such. But @wfurt if you have the repro, perhaps you can try adding it to the struct?

typedef struct DATAPATH_RX_IO_BLOCK {
    //
    // The pool owning this recv block.
    //
    CXPLAT_POOL* OwningPool;

    //
    // Represents the network route.
    //
    CXPLAT_ROUTE Route;

    //
    // Ref count of receive data/packets that are using this block.
    //
    long __attribute__((aligned(4))) RefCount;

@wfurt
Copy link
Member Author

wfurt commented Nov 14, 2023

I'll give it try. It more looked like some null pointer to me. the Addend=0x60 looks pretty suspicious to me as pointer.

@wfurt
Copy link
Member Author

wfurt commented Nov 14, 2023

added assert

dotnet: /home/furt/github/msquic/src/platform/datapath_epoll.c:2069: RecvDataReturn: Assertion `Packet->IoBlock != NULL' failed.

when IoBlock is NULL, it does not fail with dereferencing as we are getting address of RefCount e.g. NULL + offsetoff(RefCount)

--- a/src/platform/datapath_epoll.c
+++ b/src/platform/datapath_epoll.c
@@ -2064,6 +2064,9 @@ RecvDataReturn(
         RecvDataChain = RecvDataChain->Next;
         DATAPATH_RX_PACKET* Packet =
             CXPLAT_CONTAINING_RECORD(Datagram, DATAPATH_RX_PACKET, Data);
+
+        assert(Packet != NULL);
+        assert(Packet->IoBlock != NULL);
         if (InterlockedDecrement(&Packet->IoBlock->RefCount) == 0) {
             CxPlatPoolFree(Packet->IoBlock->OwningPool, Packet->IoBlock);
         }

@ami-GS
Copy link
Contributor

ami-GS commented Mar 6, 2024

I tried run msquictest on ARM32 env (raspberry pi), but it immediately aborted from CxPlatWatchdog::WatchdogThreadCallback.
image

@ami-GS
Copy link
Contributor

ami-GS commented Mar 7, 2024

msquicplatform as well. msquiccoretest works. This means that there seems issue on platform layer
image

@ami-GS
Copy link
Contributor

ami-GS commented Mar 11, 2024

By shortening the Watchdog to Timeout/2 can avoid the "Aborted".

watchdog = new CxPlatWatchdog(Timeout);

msquicplatformtest passed with one error (TlsTest.HandshakeCertFromFile) with this change, but msquictest seems having more issues.
@nibanks Do we really need to debug this? I heard that we don't validate ARM32 environment even though we build the binary. And the prebuilt binary doesn't work on this environment due to libc version mismatch

image

@nibanks
Copy link
Member

nibanks commented Mar 11, 2024

What about the assert Tomas was hitting in datapath_epoll.c? Does that not repro any more?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug: Platform A code bug in platform/TLS specific code. OS: Ubuntu Partner: .NET By or For the .NET team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants