Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix VM logs websocket endpoint getting stuck #736

Merged
merged 3 commits into from
Jan 15, 2025

Conversation

olethanh
Copy link
Collaborator

@olethanh olethanh commented Jan 8, 2025

Jira ticket: ALEPH-316

At some point the websocket endpoint stopped sending log entries to client

Original report from Arnaud:

"aleph instance logs" output gets stuck
VM started and is accessible, but logs seem frozen when querying them via CLI.

Analysis:

The log entries are in journald and retrievable but they don't get put at on the internal queue. My guess is that there is a lock bug between the journal polling and the queue max length, probably that if the queue is full and get emptied, the polling doesn't reset unless new message arrive in journald.

Solution:

Rework the journald to Queue implementation.

Note:

I have tried a lot of combinaison and variation of the logic and rewrote that code block multiple times, this version works as far as I can tell.

The interaction between asyncio select handling, journald fd reset and async task is complicated and confusing.

Other versions broke in some way or others, sometimes subtle. For examples:

  • Past logs working but new log entries not getting consumed
  • Queue (re)filling starting only when a new entrie was created
  • infinite tasks always popping up
  • multiple task consuming the same fd
  • Polling never resetting and consumer being called in a loop

Some of theses issues get hidden by the TCP buffer on the websocket and only popup after some time or depending on the network or how the client interact.

So beware if you try to rewrite that logic.
Also DO NOT TRUST ChatGPT or Deepseek on this, they will produce nice looking code that do not works properly.

To test:

Start an instance or a program on your dev CRN. Fetch the logs using aleph instance logs with the --domain option.

For testing you can insert additional log entries using the systemd-cat command

echo 1 $(date)|systemd-cat -t vm-63faf8b5db1cf8d965e6a464a0cb8062af8e7df131729e48738342d956f29ace-stdout

or

for ((i=1; i<=400; i++));
do
 echo $i
 echo $i |systemd-cat -t vm-63faf8b5db1cf8d965e6a464a0cb8062af8e7df131729e48738342d956f29ace-stdout
 sleep 1
done

Log server side can be checked using

journalctl -t vm-63faf8b5db1cf8d965e6a464a0cb8062af8e7df131729e48738342d956f29ace-stderr -t vm-63faf8b5db1cf8d965e6a464a0cb8062af8e7df131729e48738342d956f29ace-stdout -f

For reproducing or debugging issues adding an asyncio.sleep(0.2) inside the websocket handling code
usually helps, most inner loop of stream_logs method inside src/aleph/vm/orchestrator/views/operator.py

@olethanh olethanh requested a review from hoh January 8, 2025 13:03
@olethanh olethanh changed the title Jira ticket: ALEPH-316 Fix VM logs websocket endpoint getting stuck Jan 8, 2025
Jira ticket: ALEPH-316

At some point the websocket endpoint stopped sending log entries to client

Original report from Arnaud:
"aleph instance logs" output gets stuck
VM started and is accessible, but logs seem frozen when querying them via CLI.

Analysis:
The log entries are in journald and retrievable but they don't get put at on the internal queue. My guess is that there is a lock bug between the journal polling and the queue max length, probably that if the queue is full and get emptied, the polling doesn't reset unless new message arrive in journald.

Solution:
Rework the journald to Queue implementation.

Note:
I have tried a lot of combinaison and variation of the logic and rewrote that code block multiple times, this version works as far as I can tell.

The interaction between asyncio select handling, journald fd reset and async task is complicated and confusing.

Other versions broke in some way or others, sometimes subtle. For examples:
- Past logs working but new log entries not getting consumed
- Queue (re)filling starting only when a new entrie was created
- infinite tasks always popping up
- multiple task consuming the same fd
- Polling never resetting and consumer being called in a loop

Some of theses issues get hidden by the TCP buffer on the websocket and only popup after some time or depending on the network or how the client interact.

So beware if you try to rewrite that logic.
Also DO NOT TRUST ChatGPT or Deepseek on this, they will produce nice looking code that do not works properly.

To test:
Start an instance or a program on your dev CRN. Fetch the logs using `aleph instance logs` with the --domain option.

For testing you can insert additional log entries using the systemd-cat command
 ```bash
echo 1 $(date)|systemd-cat -t vm-63faf8b5db1cf8d965e6a464a0cb8062af8e7df131729e48738342d956f29ace-stdout
```
or

```bash
for ((i=1; i<=400; i++));
do
 echo $i
 echo $i |systemd-cat -t vm-63faf8b5db1cf8d965e6a464a0cb8062af8e7df131729e48738342d956f29ace-stdout
 sleep 1
done
```

Log server side can be  checked using
```
journalctl -t vm-63faf8b5db1cf8d965e6a464a0cb8062af8e7df131729e48738342d956f29ace-stderr -t vm-63faf8b5db1cf8d965e6a464a0cb8062af8e7df131729e48738342d956f29ace-stdout -f
```

For reproducing or debugging issues adding an asyncio.sleep(0.2) inside the websocket handling code
usually helps, most inner loop of `stream_logs` method inside src/aleph/vm/orchestrator/views/operator.py
@olethanh olethanh force-pushed the ol-ALEPH-316-instance-log-get-stuck branch from 0b2f7f4 to e4ae222 Compare January 8, 2025 13:04
Copy link

codecov bot commented Jan 8, 2025

Codecov Report

Attention: Patch coverage is 75.67568% with 9 lines in your changes missing coverage. Please review.

Project coverage is 63.07%. Comparing base (c2ad82a) to head (7a50834).
Report is 6 commits behind head on main.

Files with missing lines Patch % Lines
src/aleph/vm/utils/logs.py 72.00% 7 Missing ⚠️
tests/supervisor/test_log.py 81.81% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #736      +/-   ##
==========================================
+ Coverage   62.81%   63.07%   +0.25%     
==========================================
  Files          70       71       +1     
  Lines        6314     6334      +20     
  Branches      516      517       +1     
==========================================
+ Hits         3966     3995      +29     
+ Misses       2190     2179      -11     
- Partials      158      160       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@olethanh olethanh force-pushed the ol-ALEPH-316-instance-log-get-stuck branch from dab36ae to 61642de Compare January 9, 2025 14:34
@olethanh olethanh requested a review from Psycojoker January 10, 2025 08:58
@olethanh olethanh merged commit 732aa1a into main Jan 15, 2025
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants