Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Message is lost in dlt-daemon. #631

Open
tesaki opened this issue May 14, 2024 · 4 comments
Open

Message is lost in dlt-daemon. #631

tesaki opened this issue May 14, 2024 · 4 comments
Assignees
Labels

Comments

@tesaki
Copy link

tesaki commented May 14, 2024

The message is lost if more than 65536 bytes of data accumulates in the kernel's receive buffer, when the dlt-daemon receives message data.
When the message type is APP_MSG, the dlt-daemon reads up to 65,535 bytes from the kernel's receive buffer. (ref. dlt_receiver_init_global_buffer() function)
At this time, the last message created from the read data is often incomplete. As a result, this incomplete message is detected as an error and discarded.

The following patch introduced a change to discard incomplete messages:
commit: dlt-daemon: Handle partial message parsing in receiver buffer

Reverting this patch resolves this issue.

To confirm this phenomenon, DaemonFIFOSize is set to 1MiB in dlt.conf and the following log sending program is used.

#include <stdio.h>
#include <stdlib.h>
#include <dlt.h>

DLT_DECLARE_CONTEXT(con_exa1);

int main()
{
    DLT_REGISTER_APP("TAPP", "Test App");

    DLT_REGISTER_CONTEXT(con_exa1, "TAPP", "Test Context");

    int i;
    for (i = 0; i < 5000; i++) {
        DLT_LOG(con_exa1, DLT_LOG_INFO, DLT_UINT32(i),
         DLT_STRING("TEST ################################################"));
    }

    DLT_UNREGISTER_CONTEXT(con_exa1);

    DLT_UNREGISTER_APP();
}
@tesaki
Copy link
Author

tesaki commented Feb 27, 2025

I initially proposed increasing the dlt-daemon read size, but I realized this was incorrect. Therefore, I have updated the issue description accordingly.

Hello @minminlittleshrimp and @Suprathik-N
I would like to ask for some clarification regarding the patch "commit: dlt-daemon: Handle partial message parsing in receiver buffer":
This patch description states that partial messages may be received by the daemon and should be ignored. However, our investigation (*1) has indicated that this situation does not occur in a Linux environment.

Could you kindly provide details on the environment in which this patch was tested?
Additionally, could you share some background on how this patch was incorporated?

Based on our findings, this issue does not appear to occur in practice, leading me to believe that the patch may not be necessary.

*1: Investigation Findings
In a Linux environment, we confirmed that writev() and write() do not perform partial writes.

  • Case 1: Pipe (FIFO)
    According to man 7 pipe, for non-blocking mode:

    • If the data size ≤ PIPE_BUF, the entire message is written or an error occurs.
    • If the data size > PIPE_BUF, a partial write may occur.

    On Linux, PIPE_BUF = 4096 bytes. Additionally, the dlt-daemon library limits the message size to 1,424 bytes or less.
    → Since the message size is always within PIPE_BUF, partial messages will never be received when using a pipe.

  • Case 2: Unix Domain Socket
    According to man 2 write, there is possibility to write only part of data. However, an inspection of the Linux kernel code reveals that write() either writes all bytes or fails entirely. The dlt-daemon uses a non-blocking, stream-type Unix domain socket, following this call chain:
    writev() → ..(snip).. → net/unix/af_unix.c:unix_stream_sendmsg() → net/core/sock.c:sock_alloc_send_pskb()
    In sock_alloc_send_pskb(), available space is checked, and memory pages are allocated. When allocating a page, if there is any free space, the implementation ensures the full message size is allocated, regardless of buffer capacity constraints. (Reference: Linux kernel v6.13.3 source)
    This behavior has been in place since Linux kernel v4.0 (link).
    → Therefore, even when using a Unix domain socket on Linux, partial messages should never be received by dlt-daemon.

@minminlittleshrimp
Copy link
Collaborator

Hello @lti9hc @Bichdao021195
Kindly have a look.
Thanks

@Suprathik-N
Copy link
Contributor

@tesaki @minminlittleshrimp The issue was observed on QNX, and with FIFO as the IPC mechanism.
I'm afraid we overlooked testing with UDS enabled, on Linux.

The error flow is as follows -

  1. The PIPE_BUF of the FIFO is 4k bytes, and an App is generating messages with high frequency
  2. The rate at which PIPE_BUF gets free is based on the read speed of the message data into the DLT-Daemon receive-buffer
  3. Because the max message size in DLT is 1424 bytes, one can at best put 2 full messages + partial data of the 3rd message into PIPE_BUF
  4. When this happens, the check between the message length & number of bytes written returned by writev() fails
  5. The partial message is put back into the DLT-library ringbuffer, and the housekeeper thread resends it next time
  6. But the partial data that is already inside the DLT-Daemon receive-buffer was not being handled

Here's a small snippet that you can add in line 166 of dlt_user_shared.c to reproduce the error scenario -

#if 1
    static int cnt = 0;
    cnt++;
    if (cnt % 5 == 0) {
        // shorten the data length to reproduce the scenario that all data could not be sent
        iov[2].iov_len = len3 / 2;

        // send partial message
        writev(handle, iov, 3);

        // force error state. Will be written into ring buffer and resent from housekeeper thread
        return DLT_RETURN_PIPE_FULL;
    }
#endif

@tesaki
Copy link
Author

tesaki commented Feb 28, 2025

Hello @Suprathik-N

Thank you for your response. I see, the issue occurred on QNX.

However, based on the QNX documentation, it does not seem likely that partial messages would be sent.

According to the QNX documentation, the behavior of 'write()' is described as follows:
(Could this behavior vary depending on the QNX version?)

If the O_NONBLOCK flag is set, write requests are handled differently in the following ways:

  • The write() function does not block.
  • Write requests for PIPE_BUF bytes or less either succeed completely and return nbytes, or return -1 with errno set to EAGAIN.

Since the FIFO is set to O_NONBLOCK, and 1424 < PIPE_BUF = 4k, it appears to satisfy the above conditions, meaning that partial writes should not occur. As a result, the third message write should result in an error.

I think that unless the application is attempting to pack multiple messages to fully occupy the PIPE_BUF size, it seems unlikely that a message would be partially sent. How does the application send message?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants