Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update to 1.2.0 failed #275

Closed
slalomsk8er opened this issue Nov 16, 2021 · 23 comments
Closed

Update to 1.2.0 failed #275

slalomsk8er opened this issue Nov 16, 2021 · 23 comments
Labels
Milestone

Comments

@slalomsk8er
Copy link
Contributor

Expected Behavior

Run setup script and every thing works.

Current Behavior

Can't get new version 1.2.0 to work.

In the web interface I got socket connect errors and now I see the following:

15:20:56: [localdb] database connection has been closed
15:20:56: Failed. Will try to reconnect to the Database
15:20:56: [localdb] component should be failed, but is now ready. Race condition?
15:20:56: [localdb] component changed from starting to failed
15:20:56: [localdb] component changed from stopped to starting
15:20:56: [db] component changed from idle to ready
15:20:56: [db] DB has been cleaned up
15:20:56: [db] Running DB cleanup (this could take some time)
15:20:56: [db] sending DB config to child process
15:20:56: [configwatch] DB configuration loaded
15:20:56: [db] component changed from starting to idle
15:20:56: [db] component changed from stopped to starting
15:20:56: [socket] launching control socket in /run/icinga-vspheredb/vspheredb.sock

Possible Solution

Steps to Reproduce (for bugs)

  1. run update script as described
  2. delete all tables in DB
  3. run update script again
  4. increase watchdog timeout in icinga-vspheredb.service
  5. enable debug line in icinga-vspheredb.service
  6. reload and restart icinga-vspheredb.service

Your Environment

  • VMware vCenter®/ESXi™-Version:
  • Version/GIT-Hash of this module: 1.2.0
  • Icinga Web 2 version: 2.9.3
  • Operating System and version: CentOS Linux 7
  • Webserver, PHP versions: Apache/2.4.6, PHP-Version 7.3.29
@Thomas-Gelf
Copy link
Contributor

  1. delete all tables in DB

Ehm... why?

  1. increase watchdog timeout in icinga-vspheredb.service

Please keep the default value.

Please share ls -ld /run/icinga-vspheredb/ and grep User /etc/systemd/system/icinga-vspheredb.service

@Thomas-Gelf
Copy link
Contributor

Please keep the default value.

Explanation: this used to be a workaround for an issue with the DB cleanup, and a delayed heartbeat to systemd. Cleanup is now more aggressive and takes place in a sub-process, so this is no longer an issue. The daemon MUST send its heartbeats regularly, and we want it to fail if it doesn't

@slalomsk8er
Copy link
Contributor Author

  1. delete all tables in DB

Ehm... why?

I had an old, half deleted, test vCenter in the DB and other problems with the previous version.
So I wanted a clean start.

  1. increase watchdog timeout in icinga-vspheredb.service

Please keep the default value.

Changed back, reloaded and restarted.

Please share ls -ld /run/icinga-vspheredb/ and grep User /etc/systemd/system/icinga-vspheredb.service

drwxr-xr-x 2 icingavspheredb icingaweb2 60 Nov 16 16:50 /run/icinga-vspheredb/

User=icingavspheredb

@Thomas-Gelf
Copy link
Contributor

So I wanted a clean start.

Ok, got it :-) And you really removed ALL of them? The log should then contain a hint that the schema has been applied/migrated.

icingavspheredb

Looks good to me. Is there something in the directory, while the daemon is running? What happens, if you stop the daemon, to run it manually in the foreground? icingacli vspheredb daemon run --debug --trace. Does it work, when executed as vspheredb (su - icingavsphere -s /bin/bash)? Does anything change, if you run it as root?

@slalomsk8er
Copy link
Contributor Author

slalomsk8er commented Nov 16, 2021

So I wanted a clean start.

Ok, got it :-) And you really removed ALL of them? The log should then contain a hint that the schema has been applied/migrated.

06014:54:12DROP TABLE `icinga_vspheredb`.`alarm_history`, `icinga_vspheredb`.`compute_resource`, `icinga_vspheredb`.`counter_300x5`, `icinga_vspheredb`.`datastore`, `icinga_vspheredb`.`distributed_virtual_portgroup`, `icinga_vspheredb`.`distributed_virtual_switch`, `icinga_vspheredb`.`host_list`, `icinga_vspheredb`.`host_list_member`, `icinga_vspheredb`.`host_monitoring_hoststate`, `icinga_vspheredb`.`host_monitoring_property`, `icinga_vspheredb`.`host_pci_device`, `icinga_vspheredb`.`host_physical_nic`, `icinga_vspheredb`.`host_quick_stats`, `icinga_vspheredb`.`host_sensor`, `icinga_vspheredb`.`host_system`, `icinga_vspheredb`.`host_virtual_nic`, `icinga_vspheredb`.`monitoring_connection`, `icinga_vspheredb`.`new_table`, `icinga_vspheredb`.`object`, `icinga_vspheredb`.`performance_collection_interval`, `icinga_vspheredb`.`performance_counter`, `icinga_vspheredb`.`performance_group`, `icinga_vspheredb`.`performance_unit`, `icinga_vspheredb`.`storage_pod`, `icinga_vspheredb`.`vcenter`, `icinga_vspheredb`.`vcenter_event_history_collector`, `icinga_vspheredb`.`vcenter_server`, `icinga_vspheredb`.`vcenter_session`, `icinga_vspheredb`.`vcenter_sync`, `icinga_vspheredb`.`virtual_machine`, `icinga_vspheredb`.`vm_datastore_usage`, `icinga_vspheredb`.`vm_disk`, `icinga_vspheredb`.`vm_disk_usage`, `icinga_vspheredb`.`vm_event_history`, `icinga_vspheredb`.`vm_hardware`, `icinga_vspheredb`.`vm_list`, `icinga_vspheredb`.`vm_list_member`, `icinga_vspheredb`.`vm_monitoring_hoststate`, `icinga_vspheredb`.`vm_monitoring_property`, `icinga_vspheredb`.`vm_network_adapter`, `icinga_vspheredb`.`vm_quick_stats`, `icinga_vspheredb`.`vm_snapshot`, `icinga_vspheredb`.`vspheredb_daemon`, `icinga_vspheredb`.`vspheredb_daemonlog`, `icinga_vspheredb`.`vspheredb_schema_migration`Error Code: 1451. Cannot delete or update a parent row: a foreign key constraint fails0.188 sec
06114:54:31DROP TABLE `icinga_vspheredb`.`host_list`, `icinga_vspheredb`.`perfdata_subscription`, `icinga_vspheredb`.`perfdata_consumer`, `icinga_vspheredb`.`vcenter`, `icinga_vspheredb`.`vcenter_server`, `icinga_vspheredb`.`virtual_machine`, `icinga_vspheredb`.`vm_list`Error Code: 1451. Cannot delete or update a parent row: a foreign key constraint fails0.047 sec
36214:55:23DROP TABLE `icinga_vspheredb`.`vcenter`0 row(s) affected0.016 sec

I had to use force ;)

I think I saw something like this in the log.

SELECT * FROM icinga_vspheredb.vspheredb_schema_migration; returns:

352021-11-16 14:55:52

icingavspheredb

Looks good to me. Is there something in the directory, while the daemon is running? What happens, if you stop the daemon, to run it manually in the foreground? icingacli vspheredb daemon run --debug --trace. Does it work, when executed as vspheredb (su - icingavsphere -s /bin/bash)? Does anything change, if you run it as root?

ls -lah /run/icinga-vspheredb/
total 0
drwxr-xr-x  2 icingavspheredb icingaweb2  60 Nov 16 17:08 .
drwxr-xr-x 34 root            root       980 Nov 16 14:22 ..
srwxrwxrwx  1 icingavspheredb icingaweb2   0 Nov 16 17:08 vspheredb.sock

I did run it as root before I ran it as icingavsphered and now I see data in the web interface.

the only error I see, I had before the update. error: [api vcenter (id=3)] Storage Pods: Got no returnval / RetrieveResult

While running as icingavsphered I got:

PHP Fatal error:  Allowed memory size of 536870912 bytes exhausted (tried to allocate 692224 bytes) in /usr/share/icinga-php/vendor/vendor/guzzlehttp/psr7/src/Message.php on line 126

Fatal error: Allowed memory size of 536870912 bytes exhausted (tried to allocate 692224 bytes) in /usr/share/icinga-php/vendor/vendor/guzzlehttp/psr7/src/Message.php on line 126

@slalomsk8er
Copy link
Contributor Author

slalomsk8er commented Nov 16, 2021

Memory limit is now 1024M and icingacli vspheredb daemon run --debug --trace still runs but I got a daemon keep alive message in icingaweb.

@Thomas-Gelf
Copy link
Contributor

Storage Pods: that's fine. It shouldn't log an error - but this doesn't cause problems. I'm wondering about the memory usage, how big are your vCenters? And what's that "daemon keep alive message" you're talking about?

@slalomsk8er
Copy link
Contributor Author

slalomsk8er commented Nov 16, 2021

The message was on the modules "front page" and icingacli vspheredb daemon run --debug --trace was and is still running.

7380 /bin/icingacli vspheredb daemon run --debug --trace 1,1 GiB

36 hosts
9722 Virtual Machines

@Thomas-Gelf
Copy link
Contributor

CentOS/RHEL have memory limits also CLI too, that's not the case on other systems. For that amount of objects, 1 GB is not enough. With v1.2 we already changed how we fetch data from your vCenter (or ESXi hosts?), we're now fetching it in chunks. However processing/db sync still happens with all objects in memory. There are plans to improve this, to get a way lower memory footprint. All we need to keep in memory is a reference to object names/IDs, to be able to purge outdated objects after all chunks have been processed.

Please raise the limit to 4 GB (shouldn't need that much) and watch it's usage for some day.

Open questions:

  • is it now running as root, or as icingavspheredb?
  • are all tasks (apart from StoragePod) running fine?

@slalomsk8er
Copy link
Contributor Author

This time I was fast enough to catch the message:
image

It runs still manually as icingavspheredb but, by chance or not, it only started working after first running it once as root.

I think it looks good but no Events and Alarms yet.

18:25:23: [api vCenter (id=3)] Data Stores: 0 new, 5 modified, 0 deleted (got 148 from DB, 148 from API)
18:25:23: [api vCenter  (id=3)] Host Quick Stats: 0 new, 36 modified, 0 deleted (got 36 from DB, 36 from API)
18:25:23: [api vCenter  (id=3)] Compute Resources: 0 new, 2 modified, 0 deleted (got 2 from DB, 2 from API)
18:25:23: [api vCenter  (id=3)] Storage Pods: Got no returnval / RetrieveResult
18:25:22: [api vCenter  (id=3)] VM Quick Stats: 0 new, 9722 modified, 0 deleted (got 9722 from DB, 9722 from API)
18:24:19: [api vCenter  (id=3)] Host Quick Stats: 0 new, 36 modified, 0 deleted (got 36 from DB, 36 from API)
18:24:18: [api vCenter  (id=3)] VM Quick Stats: 0 new, 9722 modified, 0 deleted (got 9722 from DB, 9722 from API)
18:23:43: [api vCenter  (id=3)] Virtual Machines: 0 new, 15 modified, 0 deleted (got 9722 from DB, 9722 from API)
18:23:15: [api vCenter  (id=3)] Host Quick Stats: 0 new, 36 modified, 0 deleted (got 36 from DB, 36 from API)
18:23:13: [api vCenter  (id=3)] VM Quick Stats: 0 new, 9722 modified, 0 deleted (got 9722 from DB, 9722 from API)

@Thomas-Gelf
Copy link
Contributor

Thomas-Gelf commented Nov 16, 2021

This message appears when the daemon stops. If this happens while running it's normal tasks, then this needs some tuning. Which shouldn't be the case, because as you can see -> after modifying 9722 elements, the next log entry appears one second later.

Initially, the problem was for sure the memory limit. If you believe, that this alone does not explain the symptoms, it could be that filling the DB with the initial data took too long. When the daemon gets terminated in an ungraceful way (from systemd or because of a memory limit), we'll not see this in the logs shown in the UI.

Could you please share the log lines from journalctl -u icinga-vspheredb.service from your first attempts?

@slalomsk8er
Copy link
Contributor Author

Could use some tuning then.

Where can I share the logs in private?

@Thomas-Gelf
Copy link
Contributor

Could use some tuning then.

Aye aye, sir!

Where can I share the logs in private?

Feel free to send them to thomas@gelf.net, but I can also share an upload-only link, if you prefer that.

@slalomsk8er
Copy link
Contributor Author

slalomsk8er commented Nov 16, 2021

Could use some tuning then.

Aye aye, sir!

Last message was 46s then it vanished again.

Because I increased the memory_limit like you suggested from 1GB (somehow it still worked while using 2.14GB?) to 4GB I also now tried to run via systemd again but had no success.

Now it's switching between not working for about a minute and some seconds and then it gets killed and is dead for about 30 seconds.

image

● icinga-vspheredb.service - Icinga vSphereDB Daemon
   Loaded: loaded (/etc/systemd/system/icinga-vspheredb.service; enabled; vendor preset: disabled)
   Active: activating (auto-restart) (Result: signal) since Tue 2021-11-16 21:53:11 CET; 17s ago
     Docs: https://icinga.com/docs/icinga-module-for-vsphere/latest/
  Process: 39408 ExecStart=/usr/bin/icingacli vspheredb daemon run (code=killed, signal=ABRT)
 Main PID: 39408 (code=killed, signal=ABRT)
   Status: "ready, db: ready, localdb: ready, api: ready"

Nov 16 21:53:11 icinga  systemd[1]: icinga-vspheredb.service: main process exited, code=killed, status=6/ABRT
Nov 16 21:53:11 icinga  systemd[1]: Unit icinga-vspheredb.service entered failed state.
Nov 16 21:53:11 icinga  systemd[1]: icinga-vspheredb.service failed.

While it isn't working it get this message Events: The operation is not allowed in the current state. a lot.

Stopping also not without "scary" messages. I noticed it while I was running it manually and used ctrl+c to exit and also now using systemctl stop icinga-vspheredb.service.

Nov 16 22:21:04 icinga  systemd[1]: Stopping Icinga vSphereDB Daemon...
Nov 16 22:21:04 icinga  icinga-vspheredb[58138]: Shutting down
Nov 16 22:21:04 icinga  icinga-vspheredb[58138]: [localdb] database connection has been closed
Nov 16 22:21:30 icinga  icinga-vspheredb[58138]: Failed to safely shutdown, stopping anyways: Timed out after 5 seconds
Nov 16 22:21:30 icinga  systemd[1]: Stopped Icinga vSphereDB Daemon.

@Thomas-Gelf
Copy link
Contributor

It's getting killed by systemd again and again, I just do not understand why. I have been able to simulate this by locking some DB tables, this gives me the same effect. Do you consider your DB being fast or slow? Are there other heavy services running on it? Could you please restart it and run SHOW PROCESSLIST a couple of times, unless vspheredb gets killed? If there is a lot unrelated noise, please run it with the DB user used by vspheredb;

@Thomas-Gelf
Copy link
Contributor

NB: rm -rf /tmp/iwebVsphere* and restart the daemon to resolve your "The operation is not allowed" issue.

@Thomas-Gelf
Copy link
Contributor

I created a branch with more debug logs, please install this one: feature/more-debug-logs. Then replace the Exec line in /etc/systemd/system/icinga-vspheredb.service as follows:

ExecStart=/usr/local/bin/icingacli vspheredb daemon run --debug --trace

...and then:

systemctl daemon-reload
systemctl restart icinga-vspheredb.service

@slalomsk8er
Copy link
Contributor Author

I don't know how to measure the DB performance. It's a MariaDB Galera Cluster as far as I know.
I only have a user for icinga and I get this:

14795566icingadbhost:54874icinga_directorSleep20.000
15036984icingadbhost:61688icinga_vspheredbSleep2860.000
15036985icingadbhost:61689icinga_vspheredbQuery0startingSHOW PROCESSLIST0.000
15037872icingadbhost:33054icinga_vspheredbSleep10.000

rm -rf /tmp/iwebVsphere* worked, thanks.

I send you the new log.

@Thomas-Gelf
Copy link
Contributor

For those who are wondering: main problem here is a DB with very high latency. The setup uses a Galera cluster, which exists to satisfy specific guarantees. Speed and low latency is not what it has been built for. This hat the result that what was expected to take only a fraction of a second took 5-10 seconds to complete. This was longer than the allowed timeout for our heartbeats to systemd.

While we raised those limits for this setup, we consider doing so a bad thing. Queries running that long block the whole event loop, and this messes up our scheduling. The linked issues #281, #287 and #288 are related, and have been solved to reduce pressure in such a scenario.

While not officially supporting Galera, we didn't state that we don't - even if I would always discourage from using it. Similar problems might also occur on supported databases on very slow disks (or with high network latency). That's why I also created #282 as a task for a future major release.

Closing this one, as it seems to be solved so far. @slalomsk8er: it would be great if you could give the current master branch a try. I'll probably release v1.2.1 tomorrow morning. Please keep your higher WatchDog timeout in place.

@Thomas-Gelf Thomas-Gelf added this to the v1.2.1 milestone Nov 22, 2021
@slalomsk8er
Copy link
Contributor Author

@Thomas-Gelf I can check at 09:15 tomorrow.

@Thomas-Gelf
Copy link
Contributor

Thank you!!

@idandr
Copy link

idandr commented Nov 24, 2021

I have seen a similar problem after an upgrade from v.1.1.0 to v1.2.1. The watchdog timer constantly killed the daemon, SQL queries queued and system load went high. Even with a watchdog timer of 360 s the daemon was killed. Looking at the queued SQL queries I found two suspects (vcenter_uuid value is an example):

SELECT MAX(event_key) FROM (
    SELECT MAX(event_key) AS event_key FROM vm_event_history WHERE (vcenter_uuid = 0x7EA584BCB28F458B8C3D45C0783420F7)
    UNION ALL SELECT MAX(event_key) AS event_key FROM alarm_history WHERE (vcenter_uuid = 0x7EA584BCB28F458B8C3D45C0783420F7)
) AS u;
SELECT MAX(ts_event_ms) FROM (
    SELECT MAX(ts_event_ms) AS ts_event_ms FROM vm_event_history WHERE (vcenter_uuid = 0x7EA584BCB28F458B8C3D45C0783420F7)
    UNION ALL SELECT MAX(ts_event_ms) AS ts_event_ms FROM alarm_history WHERE (vcenter_uuid = 0x7EA584BCB28F458B8C3D45C0783420F7)
) AS u;

We have a total of over one million rows in these tables. EXPLAIN gives the hint, that both queries do not use an index and have to read half a million rows. After adding indices

ALTER TABLE vm_event_history ADD KEY bugfix_search_vc_idx (vcenter_uuid, ts_event_ms);
ALTER TABLE vm_event_history ADD KEY bugfix2_search_vc_idx (vcenter_uuid, event_key);
ALTER TABLE alarm_history ADD KEY bugfix_search_vc_idx (vcenter_uuid, ts_event_ms);
ALTER TABLE alarm_history ADD KEY bugfix2_search_vc_idx (vcenter_uuid, event_key);

both queries are answered directly. The daemon runs stable with the original watchdog timer value (10 s) for a couple of hours now. Maybe you can test this approach too.

By the way, we do not use Galera.

@Thomas-Gelf
Copy link
Contributor

Good catch, thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants