-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update to 1.2.0 failed #275
Comments
Ehm... why?
Please keep the default value. Please share |
Explanation: this used to be a workaround for an issue with the DB cleanup, and a delayed heartbeat to systemd. Cleanup is now more aggressive and takes place in a sub-process, so this is no longer an issue. The daemon MUST send its heartbeats regularly, and we want it to fail if it doesn't |
I had an old, half deleted, test vCenter in the DB and other problems with the previous version.
Changed back, reloaded and restarted.
|
Ok, got it :-) And you really removed ALL of them? The log should then contain a hint that the schema has been applied/migrated.
Looks good to me. Is there something in the directory, while the daemon is running? What happens, if you stop the daemon, to run it manually in the foreground? |
I had to use force ;) I think I saw something like this in the log.
I did run it as root before I ran it as icingavsphered and now I see data in the web interface. the only error I see, I had before the update. While running as icingavsphered I got:
|
Memory limit is now 1024M and |
Storage Pods: that's fine. It shouldn't log an error - but this doesn't cause problems. I'm wondering about the memory usage, how big are your vCenters? And what's that "daemon keep alive message" you're talking about? |
The message was on the modules "front page" and
36 hosts |
CentOS/RHEL have memory limits also CLI too, that's not the case on other systems. For that amount of objects, 1 GB is not enough. With v1.2 we already changed how we fetch data from your vCenter (or ESXi hosts?), we're now fetching it in chunks. However processing/db sync still happens with all objects in memory. There are plans to improve this, to get a way lower memory footprint. All we need to keep in memory is a reference to object names/IDs, to be able to purge outdated objects after all chunks have been processed. Please raise the limit to 4 GB (shouldn't need that much) and watch it's usage for some day. Open questions:
|
This message appears when the daemon stops. If this happens while running it's normal tasks, then this needs some tuning. Which shouldn't be the case, because as you can see -> after modifying 9722 elements, the next log entry appears one second later. Initially, the problem was for sure the memory limit. If you believe, that this alone does not explain the symptoms, it could be that filling the DB with the initial data took too long. When the daemon gets terminated in an ungraceful way (from systemd or because of a memory limit), we'll not see this in the logs shown in the UI. Could you please share the log lines from |
Could use some tuning then. Where can I share the logs in private? |
Aye aye, sir!
Feel free to send them to thomas@gelf.net, but I can also share an upload-only link, if you prefer that. |
Last message was 46s then it vanished again. Because I increased the memory_limit like you suggested from 1GB (somehow it still worked while using 2.14GB?) to 4GB I also now tried to run via systemd again but had no success. Now it's switching between not working for about a minute and some seconds and then it gets killed and is dead for about 30 seconds.
While it isn't working it get this message Stopping also not without "scary" messages. I noticed it while I was running it manually and used ctrl+c to exit and also now using
|
It's getting killed by systemd again and again, I just do not understand why. I have been able to simulate this by locking some DB tables, this gives me the same effect. Do you consider your DB being fast or slow? Are there other heavy services running on it? Could you please restart it and run |
NB: |
I created a branch with more debug logs, please install this one: feature/more-debug-logs. Then replace the
...and then:
|
I don't know how to measure the DB performance. It's a MariaDB Galera Cluster as far as I know.
I send you the new log. |
For those who are wondering: main problem here is a DB with very high latency. The setup uses a Galera cluster, which exists to satisfy specific guarantees. Speed and low latency is not what it has been built for. This hat the result that what was expected to take only a fraction of a second took 5-10 seconds to complete. This was longer than the allowed timeout for our heartbeats to systemd. While we raised those limits for this setup, we consider doing so a bad thing. Queries running that long block the whole event loop, and this messes up our scheduling. The linked issues #281, #287 and #288 are related, and have been solved to reduce pressure in such a scenario. While not officially supporting Galera, we didn't state that we don't - even if I would always discourage from using it. Similar problems might also occur on supported databases on very slow disks (or with high network latency). That's why I also created #282 as a task for a future major release. Closing this one, as it seems to be solved so far. @slalomsk8er: it would be great if you could give the current master branch a try. I'll probably release v1.2.1 tomorrow morning. Please keep your higher WatchDog timeout in place. |
@Thomas-Gelf I can check at 09:15 tomorrow. |
Thank you!! |
I have seen a similar problem after an upgrade from v.1.1.0 to v1.2.1. The watchdog timer constantly killed the daemon, SQL queries queued and system load went high. Even with a watchdog timer of 360 s the daemon was killed. Looking at the queued SQL queries I found two suspects (vcenter_uuid value is an example):
We have a total of over one million rows in these tables. EXPLAIN gives the hint, that both queries do not use an index and have to read half a million rows. After adding indices
both queries are answered directly. The daemon runs stable with the original watchdog timer value (10 s) for a couple of hours now. Maybe you can test this approach too. By the way, we do not use Galera. |
Good catch, thank you! |
Expected Behavior
Run setup script and every thing works.
Current Behavior
Can't get new version 1.2.0 to work.
In the web interface I got socket connect errors and now I see the following:
Possible Solution
Steps to Reproduce (for bugs)
Your Environment
The text was updated successfully, but these errors were encountered: