Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix a bug that causes sources to stall #1136

Merged
merged 7 commits into from
Nov 10, 2020
Merged

Fix a bug that causes sources to stall #1136

merged 7 commits into from
Nov 10, 2020

Conversation

dominiklohmann
Copy link
Member

@dominiklohmann dominiklohmann commented Nov 3, 2020

📔 Description

To reproduce the bug, run the following commands without this fix:

vast start
while true; do
  sleep 30
  cat integration/data/suricat/eve.json
done | vast import suricata

PR to be taken over by @tobim, who as per our call has an idea for another improvement.

📝 Checklist

  • All user-facing changes have changelog entries.
  • The changes are reflected on docs.tenzir.com/vast, if necessary.
  • The PR description contains instructions for the reviewer, if necessary.

🎯 Review Instructions

n/t

@dominiklohmann dominiklohmann added the bug Incorrect behavior label Nov 3, 2020
@dominiklohmann dominiklohmann requested a review from a team November 3, 2020 10:45
@tobim tobim marked this pull request as ready for review November 3, 2020 14:10
@tobim
Copy link
Member

tobim commented Nov 3, 2020

It took me a while to understand the failed test. What happens is the reader returns with stalled after reading the initial lines. It is expected to return normally afterwards because the batch-timeout should trigger. With the changes in the first commit, that does not happen any more, because the subsequent rounds don't produce any data.

The original change was made to fix the blocking of the source actor. It was achieved by making read re-entrant. But while the time-point for last_batch_sent_ was moved to the readers state, the same wasn't done for the produced counter.

I believe the implementation of the readers can be simplified if read() gets a time-point at which it must exit from it's loop, and this time-point is managed in the source.

@dominiklohmann
Copy link
Member Author

I can't review this as I've opened the PR, but here's some feedback on your changes @tobim:

I'd like to see better documentation on ec::stalled and ec::timeout. I did not immediately grok the difference when reading the comments currently above their definition.

Generally, this works aside from the unit tests as you mentioned.

For the proposed reader change, can you show the desired API first?

@mavam
Copy link
Member

mavam commented Nov 3, 2020

For the proposed reader change, can you show the desired API first?

I would like to avoid spending more cycles here right now. We have the input load-balancing epic in the backlog and we'll revisit the source and reader design at that point.

dominiklohmann and others added 7 commits November 10, 2020 12:51
To reproduce the bug, run the following commands without this fix:

```
vast start
```

```
while true; do
  sleep 30
  cat integration/data/suricat/eve.json
done | vast import suricata
```
This was a bit too noisy for verbose logs.
Under certain conditions, events would only be pushed to the
importer after a long delay. This happens specifically when
starting a new souce.

This is caused by the high initial event request for the stream source.
Internally, CAF repeatedly invokes the generator callback until
all buffers are filled to the desired level, or no new data is
put on the downstream buffer. The initial hint for the generator is
100, which multiplied by the default table slice size means the
reader needs to produce 100000 events before the slices are forwarded
to the node.
@tobim
Copy link
Member

tobim commented Nov 10, 2020

I implemented a workaround that fixes the immediate problems at the cost of a bit extra complexity. We plan to change the interaction between the source and reader to make it more straightforward soon, I hope this is acceptable in the meantime.

@dominiklohmann
Copy link
Member Author

I tested this locally. @tobim feel free to approve. Changelog entry lgtm.

@tobim tobim merged commit 0beaa48 into master Nov 10, 2020
@tobim tobim deleted the topic/source-stalling branch November 10, 2020 12:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Incorrect behavior
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants