Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Servicegroup creation error 503 on large environment #2048

Closed
lucafwp opened this issue Dec 11, 2019 · 10 comments
Closed

Servicegroup creation error 503 on large environment #2048

lucafwp opened this issue Dec 11, 2019 · 10 comments
Assignees
Labels
Milestone

Comments

@lucafwp
Copy link

lucafwp commented Dec 11, 2019

Describe the bug

Adding a new servicegroup, after clicking on Add and have waited more than 30 seconds, you get ad error "Service Unavailable":

After this error the servicegroup is created correctly anyway.

To Reproduce

  1. Set up an environment with about 90000 services and more than 10000 hosts
  2. Create a new servicegroup

Expected behavior

The servicegroup should be created straight forward without performing a lot of, apparently, useless query on db.

Screenshots

image

Your Environment

  • Icinga Web 2 version and modules (System - About): 2.7.3
  • Web browser used: Chrome 78.0.3904.108
  • Icinga 2 version used (icinga2 --version): 2.10.5-1
  • PHP version used (php --version): 7.1.8
  • Server operating system and version: CentOS 7.6.1810
@dnsmichi dnsmichi transferred this issue from Icinga/icingaweb2 Dec 11, 2019
@Thomas-Gelf
Copy link
Contributor

I tried it right now on a system with 36k Hosts and 200k Services. Took about five seconds and succeeded. However, only about 5k of those Services are single ones, the vast majority is being generated by Apply Rules and (applied) Service Sets.

Are you sure about your assumption regarding those DB queries? What kind of queries are you experiencing while this is running?

I see one thing that's going wrong: it loads all Services to pre-calculate group membership for applied Service Groups - even when that Group isn't applied at all. This "should" waste memory and CPU resources, but (in theory) it should fire only a very few fast queries.

The whole behavior here will change in future, as we'll shift these kind of calculations to the background daemon. Still, I'd love to see this fixed also for the current code-base. 10k hosts shouldn't be an issue at all, neither should 90k services. More insight and details would therefore be highly appreciated

As you're running CentOS 7 I guess you're running an old MySQL/MariaDB version? What InnoDB-related settings did you apply? Is the system under heavy load?

@lucafwp
Copy link
Author

lucafwp commented Dec 19, 2019

Hi Thomas, we are running CentOS 7 with MariaDB 5.5.60.
We discovered some issues with innodb settings not accepted by system.
These are the settings we would like to use:

large-pages = true
innodb_buffer_pool_size = 32GB
innodb_log_buffer_size = 4M
innodb_additional_mem_pool_size = 20M
innodb_thread_concurrency = 0

I'll report you back as soon as those settings are validated.

@lucafwp
Copy link
Author

lucafwp commented Jan 29, 2020

Hi @Thomas-Gelf ,

we installed MariaDB 10 on our system and configured the following settings:

innodb_buffer_pool_size=32G
tmp_table_size = 1G
max_heap_table_size = 1G
innodb_sort_buffer_size=16000000
sort_buffer_size=32M

The problem persists.

@astam
Copy link

astam commented Jul 24, 2020

Hi @Thomas-Gelf and @lucafwp ,

I am seeing the same thing happening. If I add a new servicegroup it triggers 34396 queries on the database. On our production and acceptance environments this can then take more then 10 seconds.

To be able to analyze this (I don have enough rights on the mysql cluster we are using) I have created a playground by creating a docker container using the jordan/icinga2:2.11.4 image. The data directory of mysql I have mounted to a tmpfs. The creation time is now 1 second.

In this director I have put the same as on our PRD and ACC environments:

Count Object
1919 Host
451 HostTemplate
6127 Service
1401 ServiceTemplate
44 ServiceApplyRule
44 Command
236 ExternalCommand
573 HostGroup
451 HostTemplate
374 ServiceGroup
125 ServiceSet
7 TimePeriod
2 Basket
1582 Datafield

A small base.

In mysql I've enabled logging during the creation of the servicegroup. It logs to a table.

When I run this query (where I group the queries together and replace id's with hashes):

select count(*) as count, regexp_replace(argument, '[0-9]', '#' ) as arg
from mysql.general_log 
where thread_id = 31799
group by arg
order by count desc;

I get a nice overview over which queries are executed:

'Count','Query'
'6473', 'SELECT g.* FROM icinga_servicegroup_service AS go\n INNER JOIN icinga_servicegroup AS g ON go.servicegroup_id = g.id WHERE (go.service_id = \'####\') ORDER BY g.object_name ASC'
'6473', 'SELECT v.varname, v.varvalue, v.format FROM icinga_service_var AS v WHERE (v.service_id = \'####\')'
'6171', 'SELECT icinga_timeperiod.* FROM icinga_timeperiod WHERE (id = #)'
'5026', 'SELECT icinga_host.* FROM icinga_host WHERE (id = ####)'
'4660', 'SELECT icinga_command.* FROM icinga_command WHERE (id = ###)'
'1511', 'SELECT icinga_command.* FROM icinga_command WHERE (id = ##)'
'821', 'SELECT icinga_service.* FROM icinga_service WHERE (id = ###)'
'821', 'SELECT g.* FROM icinga_servicegroup_service AS go\n INNER JOIN icinga_servicegroup AS g ON go.servicegroup_id = g.id WHERE (go.service_id = \'###\') ORDER BY g.object_name ASC'
'821', 'SELECT v.varname, v.varvalue, v.format FROM icinga_service_var AS v WHERE (v.service_id = \'###\')'
'517', 'SELECT icinga_host.* FROM icinga_host WHERE (id = ###)'
'395', 'SELECT icinga_service_set.* FROM icinga_service_set WHERE (id = ##)'
'302', 'SELECT icinga_service.* FROM icinga_service WHERE (id = ####)'
'112', 'SELECT icinga_service_set.* FROM icinga_service_set WHERE (id = ###)'
'79', 'SELECT icinga_service.* FROM icinga_service WHERE (id = ##)'
'79', 'SELECT v.varname, v.varvalue, v.format FROM icinga_service_var AS v WHERE (v.service_id = \'##\')'
'77', 'SELECT icinga_service_set.* FROM icinga_service_set WHERE (id = #)'
'24', 'SELECT g.* FROM icinga_servicegroup_service AS go\n INNER JOIN icinga_servicegroup AS g ON go.servicegroup_id = g.id WHERE (go.service_id = \'##\') ORDER BY g.object_name ASC'
'9', 'SELECT icinga_service.* FROM icinga_service WHERE (id = #)'
'9', 'SELECT v.varname, v.varvalue, v.format FROM icinga_service_var AS v WHERE (v.service_id = \'#\')'
'1', 'icinga#@localhost as anonymous on icingaweb#_director'
'1', 'SET SESSION SQL_MODE=\'STRICT_ALL_TABLES,NO_ZERO_IN_DATE,NO_ZERO_DATE,ERROR_FOR_DIVISION_BY_ZERO,ANSI_QUOTES,PIPES_AS_CONCAT,NO_ENGINE_SUBSTITUTION\', NAMES utf#, time_zone=\'+#:##\''
'1', 'SET time_zone = \'+#:##\''
'1', 'SELECT o.id, o.object_name AS name, o.object_type, p.id AS parent_id, p.object_name AS parent_name FROM icinga_servicegroup AS o\n LEFT JOIN icinga_servicegroup_inheritance AS i ON o.id = i.servicegroup_id\n LEFT JOIN icinga_servicegroup AS p ON p.id = i.parent_servicegroup_id AND p.object_type = \'template\' WHERE (o.object_type = \'template\') ORDER BY o.id ASC, i.weight ASC'
'1', 'SELECT i.servicegroup_id AS object, i.parent_servicegroup_id AS parent FROM icinga_servicegroup_inheritance AS i ORDER BY i.weight ASC'
'1', 'SELECT icinga_servicegroup.* FROM icinga_servicegroup WHERE (object_name = \'xxxxxx\')'
'1', 'INSERT INTO icinga_servicegroup (object_name, object_type, disabled, display_name, assign_filter) VALUES (\'xxxxxx\', \'object\', \'n\', \'xxxxx\', NULL)'
'1', 'SELECT checksum FROM (SELECT * FROM (SELECT # AS pos, LOWER(HEX(checksum)) AS checksum FROM director_activity_log ORDER BY id DESC LIMIT #) a UNION SELECT # AS pos, \'\' AS checksum) u ORDER BY pos LIMIT #'
'1', 'SELECT s.setting_name, s.setting_value FROM director_setting AS s'
'1', 'INSERT INTO director_activity_log (object_name, action_name, object_type, old_properties, new_properties, author, change_time, checksum, parent_checksum) VALUES (\'xxxxxx\', \'create\', \'icinga_servicegroup\', NULL, \'{\\\"display_name\\\":\\\"xxxxx\\\",\\\"object_name\\\":\\\"xxxxxx\\\",\\\"object_type\\\":\\\"object\\\"}\', \'icingaadmin\', \'####-##-## ##:##:##\', \'\\x##\\xA#\\xA#\\rջq\\xF#\\xE#F\\xCBP\\x##\\xA#F)\\xF#\\xB#G�\', \'#\\x#Fl#~ZK\\xECm[\\xE#\\xB#\\xC#JJߒAPj\')'
'1', 'SELECT icinga_service.* FROM icinga_service'
'1', 'SELECT o.id, o.object_name AS name, o.object_type, p.id AS parent_id, p.object_name AS parent_name FROM icinga_service AS o\n LEFT JOIN icinga_service_inheritance AS i ON o.id = i.service_id\n LEFT JOIN icinga_service AS p ON p.id = i.parent_service_id AND p.object_type = \'template\' WHERE (o.object_type = \'template\') ORDER BY o.id ASC, i.weight ASC'
'1', 'SELECT i.service_id AS object, i.parent_service_id AS parent FROM icinga_service_inheritance AS i ORDER BY i.weight ASC'
'1', 'SELECT icinga_servicegroup.object_name, icinga_servicegroup.id FROM icinga_servicegroup'
'1', 'SELECT hgh.servicegroup_id AS group_id, hgh.service_id AS object_id FROM icinga_servicegroup_service_resolved AS hgh WHERE (servicegroup_id = ###) OR (servicegroup_id IN (#, #, ##, ##, ##, ##, #, ##, #, #, ##, #, ###, ##, ##, ###, #, ##, ##, ##, ##, ##, ##, ##, ##, ##, ##, ##, ##, ##, ##, ##, ##, ##, ##, ##, ##, ##, ##, ##, ##, ##, #, ##, ##, ##, ##, ##, ##, ##, ##, ##, ##, ##, ##, ##, ##, ##, ##, ##, ##, ##, ##, ##, ##, ##, ##, ###, ##, ###, ###, ###, ###, ##, ##, ##, ##, ###, ##, ##, ##, ##, ##, ##, ###, ###, ##, ###, ##, ##, ##, ###, ###, ###, ##, ###, ##, ###, ##, ###, ##, ###, ##, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ##, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ##, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ##, ###, ###, ###, ###, ###, ###, ###, ###, ###, ##, ##, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ##, ###, ###, ##, ###, ##, ###, ###, ##, ###, ##, ###, ###, ##, ###, ###, ###, ###, ###, ###, ###, ###, ##, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###, ###))'
'1', ''

If I look at the queries without replacing the id's:

select count(*) as count, argument as arg
from mysql.general_log 
where thread_id = 31799
group by arg
order by count desc;

I get this: logs_grouped.xlsx

Looks like there are a lot of queries called unnecessary.

I hope this will help.

Your Environment

  • Director version (System - About): 1.7.0
  • Icinga Web 2 version and modules (System - About): 2.8.0
  • Icinga 2 version (icinga2 --version): r2.11.4-1
  • Operating System and version: jordan/icinga2:2.11.4 docker image
  • Webserver, PHP versions: Apache 2.4.38-3+deb10u3, PHP 7.3.14-1~deb10u1

@w1ll-i-code
Copy link

w1ll-i-code commented Jan 16, 2023

Hi, @Thomas-Gelf. We are having the same issue. This happens mostly on systems with a "high" count on manually created services. To reproduce this I set up a new icinga environment and created the services as follows:

icingacli director host create 'test-host.wp.bz.it' --check_command 'dummy'
for i in $(seq 1 100); do
    icingacli director service create "service-$i" --host 'test-host.wp.bz.it' --check_command 'dummy'
done

Your Environment

  • Director version (System - About): 1.9.1
  • Icinga Web 2 version and modules (System - About): 2.10.1
  • Icinga 2 version (icinga2 --version): r2.11.9-1
  • Operating System and version: RHEL 8.7
  • Webserver, PHP versions: Apache/2.4.37, PHP 7.4.30

Problem

As @astam already pointed out, the problem lies in the dependency resolution of the icingaweb2 objects that get fetched from the DB. To be specific here: library/Director/Objects/IcingaObject.php:2857. To generate the group it will fetch all icingaweb2 service objects with all it's dependencies. But each dependency will be it's own call to the database, resulting in a huge overhead. Therefore it will time out.

I will attach a cachegrind file I generated while profiling this behaviour. As you will see, 6.8 seconds of the total 10.4 seconds this request took was spent resolving all the dependencies and running for each attribute to the database.
cachegrind.tar.gz

@w1ll-i-code
Copy link

@Thomas-Gelf I have a few ideas on how to improve the loading times, each differing in required work, achieved speed up and code quality.

Proposition 1

This is IMO the easiest to implement and use, however the code will probably not be very clean. The easiest solution would be to implement a small cache that saves already resolved items. So if for example a host is used for multiple services, it can be shared amongst them and does not need to be fetched for each individually.

Proposition 2

To externalize the caching process, a redis cache between the database and the icingaweb2 service would catch all lowhanging fruits in terms of optimization. It could resolve most of the requests in a far shorter time span and could also cache results over several requests, speeding up all other requests in the process. This is a cleaner option that Proposition 1 IMO because the caching is done by a dedicated system. It can also improve the overall speed of most tasks, if the object is in cache. The only tricky part will be the cache invalidation on update.

Proposition 3:

Since the IcingaObject already always resolves all dependencies, let's generate one single query that fetches all necessary data and then map that internally to the objects. This will probably be the biggest commitment in terms of required work, but would also reap the most benefits. I would go even go far as to claim that it will get rid of many future performance problems related to database access as well, improving the performance of almost all parts of the application, all while also massively simplifying the code base in the process.

I already have an architecture in mind that should do the trick and would allow to reduce the overhead caused by round-trip times and repeated database calls compared to the current state to almost zero.

I'd love to have a more involved conversation about all of this if you think this is a path worth pursuing, but it will probably require a rewrite of parts the current logic to condense everything into a single query.

@w1ll-i-code
Copy link

I personally prefer my third proposition, since it delegates the dependency resolving to the database, which it was designed for. Let me go into a bit more detail explaining what I have in mind. The Query building relies on three steps:

  1. For trivial queries (Queries without foreign keys) we generate a simple query.
  2. For queries with foreign keys, we do a LEFT OUTER JOIN with the query generated by that object, to let the database resolve the dependencies.
  3. For each table we join in this way, we prefix the field name with the table name or object name (whichever we prefer)

I constructed three already working queries to demonstrate how this will look like. (I ignored the zone id for this example, but that will work in the same way.) So for a trivial query we will get:

SELECT icinga_command.id FROM icinga_command;

If we want to fetch this command as part of a host, we can simply integrate that query and rename the field:

SELECT icinga_host.id, icinga_command.id AS `check_command.id`
FROM icinga_host
LEFT OUTER JOIN (
    SELECT icinga_command.id
    FROM icinga_command
) AS icinga_command
ON icinga_command.id = icinga_host.check_command_id;

This will get in my test environment the result:

+----+------------------+
| id | check_command.id |
+----+------------------+
|  1 |             NULL |
|  3 |             NULL |
|  5 |             NULL |
...
|  2 |              192 |
|  4 |              192 |
|  6 |              192 |
+----+------------------+

To make an example how we can take this even further, lets create a query with the same three basic steps:

SELECT icinga_service.id, icinga_host.id AS `host.id`, icinga_host.`check_command.id` AS `host.check_command.id`
FROM icinga_service
LEFT OUTER JOIN (
    SELECT icinga_host.id, icinga_command.id AS `check_command.id`
    FROM icinga_host
    LEFT OUTER JOIN (
        SELECT icinga_command.id
        FROM icinga_command
    ) AS icinga_command
    ON icinga_command.id = icinga_host.check_command_id
) AS icinga_host
ON icinga_service.host_id = icinga_host.id;

with that the result looks as follows:

+-------+---------+-----------------------+
| id    | host.id | host.check_command.id |
+-------+---------+-----------------------+
...
|     3 |       3 |                  NULL |
|     5 |       3 |                  NULL |
|     7 |       3 |                  NULL |
|     9 |       3 |                  NULL |
|   122 |       4 |                   192 |
|   124 |       4 |                   192 |
|    16 |       5 |                  NULL |
|    17 |       5 |                  NULL |
...
+-------+---------+-----------------------+

Then once we get the query result, we can map the received fields with two simple steps:

  1. If the field has NO prefix, it is a member of this class.
  2. If the field HAS a prefix, it is a member of a child class

We can then construct the child classes with the following steps:

  1. Collect all fields with the same prefix.
  2. If the id is NULL skip the creation of the child class.
  3. Else strip that prefix and pass the result to the child class.
  4. Each child class will then to the same recursively until all dependencies are resolved.

I am well aware that this will be a big change to the process logic. It might have some downstream effects, and will have to be thoroughly test, but i do firmly believe that the long term benefits and the performance benefits are worth the work.

@bobapple
Copy link
Member

ref/IP/43981

@Thomas-Gelf Thomas-Gelf added this to the v1.11.0 milestone Apr 13, 2023
@Thomas-Gelf Thomas-Gelf self-assigned this Apr 13, 2023
@mcodato
Copy link
Contributor

mcodato commented May 9, 2023

Hi @Thomas-Gelf, we noticed that with the fix it is no longer possible to assign a service to a service group even if the filter is set/changed.
The problem seems to be related to this line


the onStore method is called after the modifiedProperties have been emptied, which causes the memeberShipShpuldBeRefreshed method to return always false.

What we propose is to call the onStore method before emptying the modifiedProperties. Let us know what you think and if in your opinion the proposal is valid, if so we will be happy to provide PR.

Mattia

@Thomas-Gelf Thomas-Gelf reopened this Jun 1, 2023
Thomas-Gelf added a commit that referenced this issue Jul 30, 2023
@Thomas-Gelf
Copy link
Contributor

Should now be fine

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants