-
Notifications
You must be signed in to change notification settings - Fork 202
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Servicegroup creation error 503 on large environment #2048
Comments
I tried it right now on a system with 36k Hosts and 200k Services. Took about five seconds and succeeded. However, only about 5k of those Services are single ones, the vast majority is being generated by Apply Rules and (applied) Service Sets. Are you sure about your assumption regarding those DB queries? What kind of queries are you experiencing while this is running? I see one thing that's going wrong: it loads all Services to pre-calculate group membership for applied Service Groups - even when that Group isn't applied at all. This "should" waste memory and CPU resources, but (in theory) it should fire only a very few fast queries. The whole behavior here will change in future, as we'll shift these kind of calculations to the background daemon. Still, I'd love to see this fixed also for the current code-base. 10k hosts shouldn't be an issue at all, neither should 90k services. More insight and details would therefore be highly appreciated As you're running CentOS 7 I guess you're running an old MySQL/MariaDB version? What InnoDB-related settings did you apply? Is the system under heavy load? |
Hi Thomas, we are running CentOS 7 with MariaDB 5.5.60. large-pages = true I'll report you back as soon as those settings are validated. |
Hi @Thomas-Gelf , we installed MariaDB 10 on our system and configured the following settings:
The problem persists. |
Hi @Thomas-Gelf and @lucafwp , I am seeing the same thing happening. If I add a new servicegroup it triggers 34396 queries on the database. On our production and acceptance environments this can then take more then 10 seconds. To be able to analyze this (I don have enough rights on the mysql cluster we are using) I have created a playground by creating a docker container using the jordan/icinga2:2.11.4 image. The data directory of mysql I have mounted to a tmpfs. The creation time is now 1 second. In this director I have put the same as on our PRD and ACC environments:
A small base. In mysql I've enabled logging during the creation of the servicegroup. It logs to a table. When I run this query (where I group the queries together and replace id's with hashes): select count(*) as count, regexp_replace(argument, '[0-9]', '#' ) as arg
from mysql.general_log
where thread_id = 31799
group by arg
order by count desc; I get a nice overview over which queries are executed:
If I look at the queries without replacing the id's: select count(*) as count, argument as arg
from mysql.general_log
where thread_id = 31799
group by arg
order by count desc; I get this: logs_grouped.xlsx Looks like there are a lot of queries called unnecessary. I hope this will help. Your Environment
|
Hi, @Thomas-Gelf. We are having the same issue. This happens mostly on systems with a "high" count on manually created services. To reproduce this I set up a new icinga environment and created the services as follows: icingacli director host create 'test-host.wp.bz.it' --check_command 'dummy'
for i in $(seq 1 100); do
icingacli director service create "service-$i" --host 'test-host.wp.bz.it' --check_command 'dummy'
done Your Environment
ProblemAs @astam already pointed out, the problem lies in the dependency resolution of the icingaweb2 objects that get fetched from the DB. To be specific here: I will attach a cachegrind file I generated while profiling this behaviour. As you will see, 6.8 seconds of the total 10.4 seconds this request took was spent resolving all the dependencies and running for each attribute to the database. |
@Thomas-Gelf I have a few ideas on how to improve the loading times, each differing in required work, achieved speed up and code quality. Proposition 1This is IMO the easiest to implement and use, however the code will probably not be very clean. The easiest solution would be to implement a small cache that saves already resolved items. So if for example a host is used for multiple services, it can be shared amongst them and does not need to be fetched for each individually. Proposition 2To externalize the caching process, a redis cache between the database and the icingaweb2 service would catch all lowhanging fruits in terms of optimization. It could resolve most of the requests in a far shorter time span and could also cache results over several requests, speeding up all other requests in the process. This is a cleaner option that Proposition 1 IMO because the caching is done by a dedicated system. It can also improve the overall speed of most tasks, if the object is in cache. The only tricky part will be the cache invalidation on update. Proposition 3:Since the IcingaObject already always resolves all dependencies, let's generate one single query that fetches all necessary data and then map that internally to the objects. This will probably be the biggest commitment in terms of required work, but would also reap the most benefits. I would go even go far as to claim that it will get rid of many future performance problems related to database access as well, improving the performance of almost all parts of the application, all while also massively simplifying the code base in the process. I already have an architecture in mind that should do the trick and would allow to reduce the overhead caused by round-trip times and repeated database calls compared to the current state to almost zero. I'd love to have a more involved conversation about all of this if you think this is a path worth pursuing, but it will probably require a rewrite of parts the current logic to condense everything into a single query. |
I personally prefer my third proposition, since it delegates the dependency resolving to the database, which it was designed for. Let me go into a bit more detail explaining what I have in mind. The Query building relies on three steps:
I constructed three already working queries to demonstrate how this will look like. (I ignored the zone id for this example, but that will work in the same way.) So for a trivial query we will get: SELECT icinga_command.id FROM icinga_command; If we want to fetch this command as part of a host, we can simply integrate that query and rename the field: SELECT icinga_host.id, icinga_command.id AS `check_command.id`
FROM icinga_host
LEFT OUTER JOIN (
SELECT icinga_command.id
FROM icinga_command
) AS icinga_command
ON icinga_command.id = icinga_host.check_command_id; This will get in my test environment the result:
To make an example how we can take this even further, lets create a query with the same three basic steps: SELECT icinga_service.id, icinga_host.id AS `host.id`, icinga_host.`check_command.id` AS `host.check_command.id`
FROM icinga_service
LEFT OUTER JOIN (
SELECT icinga_host.id, icinga_command.id AS `check_command.id`
FROM icinga_host
LEFT OUTER JOIN (
SELECT icinga_command.id
FROM icinga_command
) AS icinga_command
ON icinga_command.id = icinga_host.check_command_id
) AS icinga_host
ON icinga_service.host_id = icinga_host.id; with that the result looks as follows:
Then once we get the query result, we can map the received fields with two simple steps:
We can then construct the child classes with the following steps:
I am well aware that this will be a big change to the process logic. It might have some downstream effects, and will have to be thoroughly test, but i do firmly believe that the long term benefits and the performance benefits are worth the work. |
ref/IP/43981 |
Hi @Thomas-Gelf, we noticed that with the fix it is no longer possible to assign a service to a service group even if the filter is set/changed.
the onStore method is called after the modifiedProperties have been emptied, which causes the memeberShipShpuldBeRefreshed method to return always false.
What we propose is to call the Mattia |
Should now be fine |
Describe the bug
Adding a new servicegroup, after clicking on Add and have waited more than 30 seconds, you get ad error "Service Unavailable":
After this error the servicegroup is created correctly anyway.
To Reproduce
Expected behavior
The servicegroup should be created straight forward without performing a lot of, apparently, useless query on db.
Screenshots
Your Environment
icinga2 --version
): 2.10.5-1php --version
): 7.1.8The text was updated successfully, but these errors were encountered: