Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Message loss: message fragments written across terms boundary are discarded in re-assembly #527

Closed
fredfp opened this issue Nov 4, 2024 · 2 comments

Comments

@fredfp
Copy link
Contributor

fredfp commented Nov 4, 2024

Description

Artio version 0.154

When working as initiator, some incoming FIX messages are lost/discarded. Symptoms are:

  • in the context of a connected and working session
  • some incoming messages are lost:
  • Logged as Enqueued in uk.co.real_logic.artio.protocol.GatewayPublication#saveMessage(org.agrona.DirectBuffer, int, int, int, long, long, int, long, uk.co.real_logic.artio.messages.MessageStatus, int, long, org.agrona.DirectBuffer, int)
  • never received by the library i.e., not logged as Received in uk.co.real_logic.artio.library.LibraryPoller#onMessage
  • however the next message (with the following sequence number) is correctly logged as Enqueued and then Received, which triggers a resend request.

Analysis

  • The same happens when replaying from the archive
  • when digging deeper:
    • the "lost" message is present in the archive
    • however it is discarded in io.aeron.FragmentAssembler#handleFragment
    • the second fragment is not considered because the header.termOffset() == builder.nextTermOffset() test fails, see more debug details below
  • artio is handling the fragmentation, but assembling the fragments is handled by aeron. I suspect something doesn't match anymore in there. For instance, the failing check was added recently in aeron: real-logic/aeron@54e8c73

More debug details

Data below seem to indicate that the problem indeed happens at term boundary, for the discarded fragment, in io.aeron.FragmentAssembler#handleFragment, I get the following values:

  • termLength = 4194304
  • builder.nextTermOffset() = 4193728
  • header.termOffset() = 0 indicates we just started a new term
  • length = 555 might be a clue:
    • builder.nextTermOffset() + length < termLength doesn't suggest reaching a new term (and requiring to fill the current one with padding).
    • could it be that artio assumes the current message fragments will fit in the rest of the term when it's not the case?
@fredfp fredfp changed the title Message loss: message fragments written across terms boundary are discorded in re-assembly Message loss: message fragments written across terms boundary are discarded in re-assembly Nov 4, 2024
@wojciech-adaptive
Copy link
Collaborator

Thanks for the detailed bug report, fixed. Will do a release soon.

@fredfp
Copy link
Contributor Author

fredfp commented Nov 5, 2024

Nice one, thank you!

grafstrom pushed a commit to VermicFinTech/artio that referenced this issue Nov 20, 2024
…ic#527)

The code to calculate required length was using the wrong HEADER_LENGTH
constant and undercounting it. In unlucky cases it might have looked like
a message would fit in the remaining space at the end of the term, while
actually it wouldn't. So instead of writing padding and then the fragmented
message, the first fragment would get written, then padding, and then the
remaining fragments in the next term. On the subscription side, a fragment
assembler would ignore such messages.

Fixes: 75b9fde ("fix term padding at the end of term buffers")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants