Update to 1.2.0 failed #275

slalomsk8er · 2021-11-16T15:40:40Z

Expected Behavior

Run setup script and every thing works.

Current Behavior

Can't get new version 1.2.0 to work.

In the web interface I got socket connect errors and now I see the following:

15:20:56: [localdb] database connection has been closed
15:20:56: Failed. Will try to reconnect to the Database
15:20:56: [localdb] component should be failed, but is now ready. Race condition?
15:20:56: [localdb] component changed from starting to failed
15:20:56: [localdb] component changed from stopped to starting
15:20:56: [db] component changed from idle to ready
15:20:56: [db] DB has been cleaned up
15:20:56: [db] Running DB cleanup (this could take some time)
15:20:56: [db] sending DB config to child process
15:20:56: [configwatch] DB configuration loaded
15:20:56: [db] component changed from starting to idle
15:20:56: [db] component changed from stopped to starting
15:20:56: [socket] launching control socket in /run/icinga-vspheredb/vspheredb.sock

Possible Solution

Steps to Reproduce (for bugs)

run update script as described
delete all tables in DB
run update script again
increase watchdog timeout in icinga-vspheredb.service
enable debug line in icinga-vspheredb.service
reload and restart icinga-vspheredb.service

Your Environment

VMware vCenter®/ESXi™-Version:
Version/GIT-Hash of this module: 1.2.0
Icinga Web 2 version: 2.9.3
Operating System and version: CentOS Linux 7
Webserver, PHP versions: Apache/2.4.6, PHP-Version 7.3.29

The text was updated successfully, but these errors were encountered:

Thomas-Gelf · 2021-11-16T15:45:14Z

delete all tables in DB

Ehm... why?

increase watchdog timeout in icinga-vspheredb.service

Please keep the default value.

Please share ls -ld /run/icinga-vspheredb/ and grep User /etc/systemd/system/icinga-vspheredb.service

Thomas-Gelf · 2021-11-16T15:50:00Z

Please keep the default value.

Explanation: this used to be a workaround for an issue with the DB cleanup, and a delayed heartbeat to systemd. Cleanup is now more aggressive and takes place in a sub-process, so this is no longer an issue. The daemon MUST send its heartbeats regularly, and we want it to fail if it doesn't

slalomsk8er · 2021-11-16T15:54:11Z

delete all tables in DB

Ehm... why?

I had an old, half deleted, test vCenter in the DB and other problems with the previous version.
So I wanted a clean start.

increase watchdog timeout in icinga-vspheredb.service

Please keep the default value.

Changed back, reloaded and restarted.

Please share ls -ld /run/icinga-vspheredb/ and grep User /etc/systemd/system/icinga-vspheredb.service

drwxr-xr-x 2 icingavspheredb icingaweb2 60 Nov 16 16:50 /run/icinga-vspheredb/

User=icingavspheredb

Thomas-Gelf · 2021-11-16T15:59:34Z

So I wanted a clean start.

Ok, got it :-) And you really removed ALL of them? The log should then contain a hint that the schema has been applied/migrated.

icingavspheredb

Looks good to me. Is there something in the directory, while the daemon is running? What happens, if you stop the daemon, to run it manually in the foreground? icingacli vspheredb daemon run --debug --trace. Does it work, when executed as vspheredb (su - icingavsphere -s /bin/bash)? Does anything change, if you run it as root?

slalomsk8er · 2021-11-16T16:22:00Z

So I wanted a clean start.

Ok, got it :-) And you really removed ALL of them? The log should then contain a hint that the schema has been applied/migrated.

0	60	14:54:12	DROP TABLE `icinga_vspheredb`.`alarm_history`, `icinga_vspheredb`.`compute_resource`, `icinga_vspheredb`.`counter_300x5`, `icinga_vspheredb`.`datastore`, `icinga_vspheredb`.`distributed_virtual_portgroup`, `icinga_vspheredb`.`distributed_virtual_switch`, `icinga_vspheredb`.`host_list`, `icinga_vspheredb`.`host_list_member`, `icinga_vspheredb`.`host_monitoring_hoststate`, `icinga_vspheredb`.`host_monitoring_property`, `icinga_vspheredb`.`host_pci_device`, `icinga_vspheredb`.`host_physical_nic`, `icinga_vspheredb`.`host_quick_stats`, `icinga_vspheredb`.`host_sensor`, `icinga_vspheredb`.`host_system`, `icinga_vspheredb`.`host_virtual_nic`, `icinga_vspheredb`.`monitoring_connection`, `icinga_vspheredb`.`new_table`, `icinga_vspheredb`.`object`, `icinga_vspheredb`.`performance_collection_interval`, `icinga_vspheredb`.`performance_counter`, `icinga_vspheredb`.`performance_group`, `icinga_vspheredb`.`performance_unit`, `icinga_vspheredb`.`storage_pod`, `icinga_vspheredb`.`vcenter`, `icinga_vspheredb`.`vcenter_event_history_collector`, `icinga_vspheredb`.`vcenter_server`, `icinga_vspheredb`.`vcenter_session`, `icinga_vspheredb`.`vcenter_sync`, `icinga_vspheredb`.`virtual_machine`, `icinga_vspheredb`.`vm_datastore_usage`, `icinga_vspheredb`.`vm_disk`, `icinga_vspheredb`.`vm_disk_usage`, `icinga_vspheredb`.`vm_event_history`, `icinga_vspheredb`.`vm_hardware`, `icinga_vspheredb`.`vm_list`, `icinga_vspheredb`.`vm_list_member`, `icinga_vspheredb`.`vm_monitoring_hoststate`, `icinga_vspheredb`.`vm_monitoring_property`, `icinga_vspheredb`.`vm_network_adapter`, `icinga_vspheredb`.`vm_quick_stats`, `icinga_vspheredb`.`vm_snapshot`, `icinga_vspheredb`.`vspheredb_daemon`, `icinga_vspheredb`.`vspheredb_daemonlog`, `icinga_vspheredb`.`vspheredb_schema_migration`	Error Code: 1451. Cannot delete or update a parent row: a foreign key constraint fails	0.188 sec
0	61	14:54:31	DROP TABLE `icinga_vspheredb`.`host_list`, `icinga_vspheredb`.`perfdata_subscription`, `icinga_vspheredb`.`perfdata_consumer`, `icinga_vspheredb`.`vcenter`, `icinga_vspheredb`.`vcenter_server`, `icinga_vspheredb`.`virtual_machine`, `icinga_vspheredb`.`vm_list`	Error Code: 1451. Cannot delete or update a parent row: a foreign key constraint fails	0.047 sec
3	62	14:55:23	DROP TABLE `icinga_vspheredb`.`vcenter`	0 row(s) affected	0.016 sec

I had to use force ;)

I think I saw something like this in the log.

SELECT * FROM icinga_vspheredb.vspheredb_schema_migration; returns:

35	2021-11-16 14:55:52

icingavspheredb

Looks good to me. Is there something in the directory, while the daemon is running? What happens, if you stop the daemon, to run it manually in the foreground? icingacli vspheredb daemon run --debug --trace. Does it work, when executed as vspheredb (su - icingavsphere -s /bin/bash)? Does anything change, if you run it as root?

ls -lah /run/icinga-vspheredb/
total 0
drwxr-xr-x  2 icingavspheredb icingaweb2  60 Nov 16 17:08 .
drwxr-xr-x 34 root            root       980 Nov 16 14:22 ..
srwxrwxrwx  1 icingavspheredb icingaweb2   0 Nov 16 17:08 vspheredb.sock

I did run it as root before I ran it as icingavsphered and now I see data in the web interface.

the only error I see, I had before the update. error: [api vcenter (id=3)] Storage Pods: Got no returnval / RetrieveResult

While running as icingavsphered I got:

PHP Fatal error:  Allowed memory size of 536870912 bytes exhausted (tried to allocate 692224 bytes) in /usr/share/icinga-php/vendor/vendor/guzzlehttp/psr7/src/Message.php on line 126

Fatal error: Allowed memory size of 536870912 bytes exhausted (tried to allocate 692224 bytes) in /usr/share/icinga-php/vendor/vendor/guzzlehttp/psr7/src/Message.php on line 126

slalomsk8er · 2021-11-16T16:50:07Z

Memory limit is now 1024M and icingacli vspheredb daemon run --debug --trace still runs but I got a daemon keep alive message in icingaweb.

Thomas-Gelf · 2021-11-16T17:04:58Z

Storage Pods: that's fine. It shouldn't log an error - but this doesn't cause problems. I'm wondering about the memory usage, how big are your vCenters? And what's that "daemon keep alive message" you're talking about?

slalomsk8er · 2021-11-16T17:12:41Z

The message was on the modules "front page" and icingacli vspheredb daemon run --debug --trace was and is still running.

7380	/bin/icingacli vspheredb daemon run --debug --trace	1,1 GiB

36 hosts
9722 Virtual Machines

Thomas-Gelf · 2021-11-16T17:22:39Z

CentOS/RHEL have memory limits also CLI too, that's not the case on other systems. For that amount of objects, 1 GB is not enough. With v1.2 we already changed how we fetch data from your vCenter (or ESXi hosts?), we're now fetching it in chunks. However processing/db sync still happens with all objects in memory. There are plans to improve this, to get a way lower memory footprint. All we need to keep in memory is a reference to object names/IDs, to be able to purge outdated objects after all chunks have been processed.

Please raise the limit to 4 GB (shouldn't need that much) and watch it's usage for some day.

Open questions:

is it now running as root, or as icingavspheredb?
are all tasks (apart from StoragePod) running fine?

slalomsk8er · 2021-11-16T17:30:30Z

This time I was fast enough to catch the message:

It runs still manually as icingavspheredb but, by chance or not, it only started working after first running it once as root.

I think it looks good but no Events and Alarms yet.

18:25:23: [api vCenter (id=3)] Data Stores: 0 new, 5 modified, 0 deleted (got 148 from DB, 148 from API)
18:25:23: [api vCenter  (id=3)] Host Quick Stats: 0 new, 36 modified, 0 deleted (got 36 from DB, 36 from API)
18:25:23: [api vCenter  (id=3)] Compute Resources: 0 new, 2 modified, 0 deleted (got 2 from DB, 2 from API)
18:25:23: [api vCenter  (id=3)] Storage Pods: Got no returnval / RetrieveResult
18:25:22: [api vCenter  (id=3)] VM Quick Stats: 0 new, 9722 modified, 0 deleted (got 9722 from DB, 9722 from API)
18:24:19: [api vCenter  (id=3)] Host Quick Stats: 0 new, 36 modified, 0 deleted (got 36 from DB, 36 from API)
18:24:18: [api vCenter  (id=3)] VM Quick Stats: 0 new, 9722 modified, 0 deleted (got 9722 from DB, 9722 from API)
18:23:43: [api vCenter  (id=3)] Virtual Machines: 0 new, 15 modified, 0 deleted (got 9722 from DB, 9722 from API)
18:23:15: [api vCenter  (id=3)] Host Quick Stats: 0 new, 36 modified, 0 deleted (got 36 from DB, 36 from API)
18:23:13: [api vCenter  (id=3)] VM Quick Stats: 0 new, 9722 modified, 0 deleted (got 9722 from DB, 9722 from API)

Thomas-Gelf · 2021-11-16T17:39:03Z

This message appears when the daemon stops. If this happens while running it's normal tasks, then this needs some tuning. Which shouldn't be the case, because as you can see -> after modifying 9722 elements, the next log entry appears one second later.

Initially, the problem was for sure the memory limit. If you believe, that this alone does not explain the symptoms, it could be that filling the DB with the initial data took too long. When the daemon gets terminated in an ungraceful way (from systemd or because of a memory limit), we'll not see this in the logs shown in the UI.

Could you please share the log lines from journalctl -u icinga-vspheredb.service from your first attempts?

slalomsk8er · 2021-11-16T17:45:33Z

Could use some tuning then.

Where can I share the logs in private?

Thomas-Gelf · 2021-11-16T17:57:41Z

Could use some tuning then.

Aye aye, sir!

Where can I share the logs in private?

Feel free to send them to thomas@gelf.net, but I can also share an upload-only link, if you prefer that.

slalomsk8er · 2021-11-16T21:20:34Z

Could use some tuning then.

Aye aye, sir!

Last message was 46s then it vanished again.

Because I increased the memory_limit like you suggested from 1GB (somehow it still worked while using 2.14GB?) to 4GB I also now tried to run via systemd again but had no success.

Now it's switching between not working for about a minute and some seconds and then it gets killed and is dead for about 30 seconds.

● icinga-vspheredb.service - Icinga vSphereDB Daemon
   Loaded: loaded (/etc/systemd/system/icinga-vspheredb.service; enabled; vendor preset: disabled)
   Active: activating (auto-restart) (Result: signal) since Tue 2021-11-16 21:53:11 CET; 17s ago
     Docs: https://icinga.com/docs/icinga-module-for-vsphere/latest/
  Process: 39408 ExecStart=/usr/bin/icingacli vspheredb daemon run (code=killed, signal=ABRT)
 Main PID: 39408 (code=killed, signal=ABRT)
   Status: "ready, db: ready, localdb: ready, api: ready"

Nov 16 21:53:11 icinga  systemd[1]: icinga-vspheredb.service: main process exited, code=killed, status=6/ABRT
Nov 16 21:53:11 icinga  systemd[1]: Unit icinga-vspheredb.service entered failed state.
Nov 16 21:53:11 icinga  systemd[1]: icinga-vspheredb.service failed.

While it isn't working it get this message Events: The operation is not allowed in the current state. a lot.

Stopping also not without "scary" messages. I noticed it while I was running it manually and used ctrl+c to exit and also now using systemctl stop icinga-vspheredb.service.

Nov 16 22:21:04 icinga  systemd[1]: Stopping Icinga vSphereDB Daemon...
Nov 16 22:21:04 icinga  icinga-vspheredb[58138]: Shutting down
Nov 16 22:21:04 icinga  icinga-vspheredb[58138]: [localdb] database connection has been closed
Nov 16 22:21:30 icinga  icinga-vspheredb[58138]: Failed to safely shutdown, stopping anyways: Timed out after 5 seconds
Nov 16 22:21:30 icinga  systemd[1]: Stopped Icinga vSphereDB Daemon.

Thomas-Gelf · 2021-11-16T21:55:54Z

It's getting killed by systemd again and again, I just do not understand why. I have been able to simulate this by locking some DB tables, this gives me the same effect. Do you consider your DB being fast or slow? Are there other heavy services running on it? Could you please restart it and run SHOW PROCESSLIST a couple of times, unless vspheredb gets killed? If there is a lot unrelated noise, please run it with the DB user used by vspheredb;

Thomas-Gelf · 2021-11-16T21:59:26Z

NB: rm -rf /tmp/iwebVsphere* and restart the daemon to resolve your "The operation is not allowed" issue.

Thomas-Gelf · 2021-11-17T06:47:33Z

I created a branch with more debug logs, please install this one: feature/more-debug-logs. Then replace the Exec line in /etc/systemd/system/icinga-vspheredb.service as follows:

ExecStart=/usr/local/bin/icingacli vspheredb daemon run --debug --trace

...and then:

systemctl daemon-reload
systemctl restart icinga-vspheredb.service

slalomsk8er · 2021-11-17T09:39:21Z

I don't know how to measure the DB performance. It's a MariaDB Galera Cluster as far as I know.
I only have a user for icinga and I get this:

14795566	icinga	dbhost:54874	icinga_director	Sleep	2
15036984	icinga	dbhost:61688	icinga_vspheredb	Sleep	286
15036985	icinga	dbhost:61689	icinga_vspheredb	Query	0	starting	SHOW PROCESSLIST
15037872	icinga	dbhost:33054	icinga_vspheredb	Sleep	1

rm -rf /tmp/iwebVsphere* worked, thanks.

I send you the new log.

Thomas-Gelf · 2021-11-22T17:01:11Z

For those who are wondering: main problem here is a DB with very high latency. The setup uses a Galera cluster, which exists to satisfy specific guarantees. Speed and low latency is not what it has been built for. This hat the result that what was expected to take only a fraction of a second took 5-10 seconds to complete. This was longer than the allowed timeout for our heartbeats to systemd.

While we raised those limits for this setup, we consider doing so a bad thing. Queries running that long block the whole event loop, and this messes up our scheduling. The linked issues #281, #287 and #288 are related, and have been solved to reduce pressure in such a scenario.

While not officially supporting Galera, we didn't state that we don't - even if I would always discourage from using it. Similar problems might also occur on supported databases on very slow disks (or with high network latency). That's why I also created #282 as a task for a future major release.

Closing this one, as it seems to be solved so far. @slalomsk8er: it would be great if you could give the current master branch a try. I'll probably release v1.2.1 tomorrow morning. Please keep your higher WatchDog timeout in place.

slalomsk8er · 2021-11-22T17:24:08Z

@Thomas-Gelf I can check at 09:15 tomorrow.

Thomas-Gelf · 2021-11-22T17:29:51Z

Thank you!!

idandr · 2021-11-24T17:09:49Z

I have seen a similar problem after an upgrade from v.1.1.0 to v1.2.1. The watchdog timer constantly killed the daemon, SQL queries queued and system load went high. Even with a watchdog timer of 360 s the daemon was killed. Looking at the queued SQL queries I found two suspects (vcenter_uuid value is an example):

SELECT MAX(event_key) FROM (
    SELECT MAX(event_key) AS event_key FROM vm_event_history WHERE (vcenter_uuid = 0x7EA584BCB28F458B8C3D45C0783420F7)
    UNION ALL SELECT MAX(event_key) AS event_key FROM alarm_history WHERE (vcenter_uuid = 0x7EA584BCB28F458B8C3D45C0783420F7)
) AS u;
SELECT MAX(ts_event_ms) FROM (
    SELECT MAX(ts_event_ms) AS ts_event_ms FROM vm_event_history WHERE (vcenter_uuid = 0x7EA584BCB28F458B8C3D45C0783420F7)
    UNION ALL SELECT MAX(ts_event_ms) AS ts_event_ms FROM alarm_history WHERE (vcenter_uuid = 0x7EA584BCB28F458B8C3D45C0783420F7)
) AS u;

We have a total of over one million rows in these tables. EXPLAIN gives the hint, that both queries do not use an index and have to read half a million rows. After adding indices

ALTER TABLE vm_event_history ADD KEY bugfix_search_vc_idx (vcenter_uuid, ts_event_ms);
ALTER TABLE vm_event_history ADD KEY bugfix2_search_vc_idx (vcenter_uuid, event_key);
ALTER TABLE alarm_history ADD KEY bugfix_search_vc_idx (vcenter_uuid, ts_event_ms);
ALTER TABLE alarm_history ADD KEY bugfix2_search_vc_idx (vcenter_uuid, event_key);

both queries are answered directly. The daemon runs stable with the original watchdog timer value (10 s) for a couple of hours now. Maybe you can test this approach too.

By the way, we do not use Galera.

Thomas-Gelf · 2021-11-24T17:20:47Z

Good catch, thank you!

slalomsk8er mentioned this issue Nov 16, 2021

Hooks into the Monitoring module missing #262

Closed

This was referenced Nov 22, 2021

Do not call RefreshStorageInfo on startup #281

Closed

Move all DB operations to the "db" sub-process #282

Closed

Run RefreshStorageInfo only for a small chunk of VMs #287

Closed

FetchOutdatedVms should block like a real Task #288

Closed

Thomas-Gelf closed this as completed Nov 22, 2021

Thomas-Gelf added this to the v1.2.1 milestone Nov 22, 2021

Thomas-Gelf added the bug label Nov 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update to 1.2.0 failed #275

Update to 1.2.0 failed #275

slalomsk8er commented Nov 16, 2021

Thomas-Gelf commented Nov 16, 2021

Thomas-Gelf commented Nov 16, 2021

slalomsk8er commented Nov 16, 2021

Thomas-Gelf commented Nov 16, 2021

slalomsk8er commented Nov 16, 2021 •

edited

Loading

slalomsk8er commented Nov 16, 2021 •

edited

Loading

Thomas-Gelf commented Nov 16, 2021

slalomsk8er commented Nov 16, 2021 •

edited

Loading

Thomas-Gelf commented Nov 16, 2021

slalomsk8er commented Nov 16, 2021

Thomas-Gelf commented Nov 16, 2021 •

edited

Loading

slalomsk8er commented Nov 16, 2021

Thomas-Gelf commented Nov 16, 2021

slalomsk8er commented Nov 16, 2021 •

edited

Loading

Thomas-Gelf commented Nov 16, 2021

Thomas-Gelf commented Nov 16, 2021

Thomas-Gelf commented Nov 17, 2021

slalomsk8er commented Nov 17, 2021

Thomas-Gelf commented Nov 22, 2021

slalomsk8er commented Nov 22, 2021

Thomas-Gelf commented Nov 22, 2021

idandr commented Nov 24, 2021

Thomas-Gelf commented Nov 24, 2021

Update to 1.2.0 failed #275

Update to 1.2.0 failed #275

Comments

slalomsk8er commented Nov 16, 2021

Expected Behavior

Current Behavior

Possible Solution

Steps to Reproduce (for bugs)

Your Environment

Thomas-Gelf commented Nov 16, 2021

Thomas-Gelf commented Nov 16, 2021

slalomsk8er commented Nov 16, 2021

Thomas-Gelf commented Nov 16, 2021

slalomsk8er commented Nov 16, 2021 • edited Loading

slalomsk8er commented Nov 16, 2021 • edited Loading

Thomas-Gelf commented Nov 16, 2021

slalomsk8er commented Nov 16, 2021 • edited Loading

Thomas-Gelf commented Nov 16, 2021

slalomsk8er commented Nov 16, 2021

Thomas-Gelf commented Nov 16, 2021 • edited Loading

slalomsk8er commented Nov 16, 2021

Thomas-Gelf commented Nov 16, 2021

slalomsk8er commented Nov 16, 2021 • edited Loading

Thomas-Gelf commented Nov 16, 2021

Thomas-Gelf commented Nov 16, 2021

Thomas-Gelf commented Nov 17, 2021

slalomsk8er commented Nov 17, 2021

Thomas-Gelf commented Nov 22, 2021

slalomsk8er commented Nov 22, 2021

Thomas-Gelf commented Nov 22, 2021

idandr commented Nov 24, 2021

Thomas-Gelf commented Nov 24, 2021

slalomsk8er commented Nov 16, 2021 •

edited

Loading

slalomsk8er commented Nov 16, 2021 •

edited

Loading

slalomsk8er commented Nov 16, 2021 •

edited

Loading

Thomas-Gelf commented Nov 16, 2021 •

edited

Loading

slalomsk8er commented Nov 16, 2021 •

edited

Loading