-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix VM logs websocket endpoint getting stuck #736
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
olethanh
changed the title
Jira ticket: ALEPH-316
Fix VM logs websocket endpoint getting stuck
Jan 8, 2025
Jira ticket: ALEPH-316 At some point the websocket endpoint stopped sending log entries to client Original report from Arnaud: "aleph instance logs" output gets stuck VM started and is accessible, but logs seem frozen when querying them via CLI. Analysis: The log entries are in journald and retrievable but they don't get put at on the internal queue. My guess is that there is a lock bug between the journal polling and the queue max length, probably that if the queue is full and get emptied, the polling doesn't reset unless new message arrive in journald. Solution: Rework the journald to Queue implementation. Note: I have tried a lot of combinaison and variation of the logic and rewrote that code block multiple times, this version works as far as I can tell. The interaction between asyncio select handling, journald fd reset and async task is complicated and confusing. Other versions broke in some way or others, sometimes subtle. For examples: - Past logs working but new log entries not getting consumed - Queue (re)filling starting only when a new entrie was created - infinite tasks always popping up - multiple task consuming the same fd - Polling never resetting and consumer being called in a loop Some of theses issues get hidden by the TCP buffer on the websocket and only popup after some time or depending on the network or how the client interact. So beware if you try to rewrite that logic. Also DO NOT TRUST ChatGPT or Deepseek on this, they will produce nice looking code that do not works properly. To test: Start an instance or a program on your dev CRN. Fetch the logs using `aleph instance logs` with the --domain option. For testing you can insert additional log entries using the systemd-cat command ```bash echo 1 $(date)|systemd-cat -t vm-63faf8b5db1cf8d965e6a464a0cb8062af8e7df131729e48738342d956f29ace-stdout ``` or ```bash for ((i=1; i<=400; i++)); do echo $i echo $i |systemd-cat -t vm-63faf8b5db1cf8d965e6a464a0cb8062af8e7df131729e48738342d956f29ace-stdout sleep 1 done ``` Log server side can be checked using ``` journalctl -t vm-63faf8b5db1cf8d965e6a464a0cb8062af8e7df131729e48738342d956f29ace-stderr -t vm-63faf8b5db1cf8d965e6a464a0cb8062af8e7df131729e48738342d956f29ace-stdout -f ``` For reproducing or debugging issues adding an asyncio.sleep(0.2) inside the websocket handling code usually helps, most inner loop of `stream_logs` method inside src/aleph/vm/orchestrator/views/operator.py
olethanh
force-pushed
the
ol-ALEPH-316-instance-log-get-stuck
branch
from
January 8, 2025 13:04
0b2f7f4
to
e4ae222
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #736 +/- ##
==========================================
+ Coverage 62.81% 63.07% +0.25%
==========================================
Files 70 71 +1
Lines 6314 6334 +20
Branches 516 517 +1
==========================================
+ Hits 3966 3995 +29
+ Misses 2190 2179 -11
- Partials 158 160 +2 ☔ View full report in Codecov by Sentry. |
Psycojoker
reviewed
Jan 8, 2025
Psycojoker
reviewed
Jan 8, 2025
Psycojoker
reviewed
Jan 8, 2025
Psycojoker
reviewed
Jan 8, 2025
olethanh
force-pushed
the
ol-ALEPH-316-instance-log-get-stuck
branch
from
January 9, 2025 14:34
dab36ae
to
61642de
Compare
hoh
approved these changes
Jan 14, 2025
Psycojoker
approved these changes
Jan 15, 2025
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Jira ticket: ALEPH-316
At some point the websocket endpoint stopped sending log entries to client
Original report from Arnaud:
"aleph instance logs" output gets stuck
VM started and is accessible, but logs seem frozen when querying them via CLI.
Analysis:
The log entries are in journald and retrievable but they don't get put at on the internal queue. My guess is that there is a lock bug between the journal polling and the queue max length, probably that if the queue is full and get emptied, the polling doesn't reset unless new message arrive in journald.
Solution:
Rework the journald to Queue implementation.
Note:
I have tried a lot of combinaison and variation of the logic and rewrote that code block multiple times, this version works as far as I can tell.
The interaction between asyncio select handling, journald fd reset and async task is complicated and confusing.
Other versions broke in some way or others, sometimes subtle. For examples:
Some of theses issues get hidden by the TCP buffer on the websocket and only popup after some time or depending on the network or how the client interact.
So beware if you try to rewrite that logic.
Also DO NOT TRUST ChatGPT or Deepseek on this, they will produce nice looking code that do not works properly.
To test:
Start an instance or a program on your dev CRN. Fetch the logs using
aleph instance logs
with the --domain option.For testing you can insert additional log entries using the systemd-cat command
or
Log server side can be checked using
For reproducing or debugging issues adding an asyncio.sleep(0.2) inside the websocket handling code
usually helps, most inner loop of
stream_logs
method inside src/aleph/vm/orchestrator/views/operator.py