-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reactor - No matching sls found for 'orch.minion-onBoot' in env 'base' #47539
Comments
i'm having a hard time replicating this. Does this occur every time you create 2 minions? If you just restart two minions at the same time do you see the same behavior? That should trigger the reactor as well |
Hi @Ch3LL thank you for your involve in that case. To be honest I was surprised why this issue came up now, than I figure out that a 1 week ago I upgraded SALT to version 2018.3.0 (here was reported problem with generation token for Vault) but than I make downgrade to 2017.7.5 and the problem show up. After moving back to initial version 2017.7.4 all worked normally. I think the bug was introduced in version 2017.7.5, in my opinion. I will try to prepare some easy steps to reproduce it, because I noticed the same problem, when there was 10 hits to API which also calls the orchestration. |
yeah if you can include an easy use case that would be awesome. I tried replicating by using your states you provided and starting up 2 minions at the same time to try to hit the reactor at the exact time. Also in the meantime while you are preparing the use case @saltstack/team-core any ideas on why this might be occurring? |
ping @DmitryKuzmenko can you take a look here? |
I can confirm that the problem was introduced in 2017.7.5. Works fine in 2017.7.4. |
I can confirm getting the same error fairly consistently on 2018.3.0 when running an orchestrate in reaction to cloud created events while deploying a cloud in parallel. The error seems to go away if I configure |
I did a bit of digging into how this all works, and the problem might be that the global Edit: By this I mean in the ReactWrap class, which is specifically using threads. Changing this line to the following appears to alleviate my issues:
|
@calc84maniac This sounds like the issue here. In my case, I am trying to create 2 cloud servers in quick succession, where both salt/cloud/*/created events kick off orchestrations via the reactor. I have tried to set 'multiprocessing' to False like you suggested, but my reactor still dies. Our case relies heavily on the reactor for after cloud-instance creation processing which now fails because of the reactor throwing the error above. We also have other reactor events that are now not being processed :( the only thing that fixes the situation is a salt-master restart. Where would I set |
I am in the same boat and have disabled the multiprocessing for now. But cannot really say if it comes back again. |
Sorry, it does look like that didn't fully fix the problem after all, after further testing. But I feel like it's still some kind of race condition with the loader, regardless.
|
Hi @Ch3LL I am sorry for late answer, for last weeks I was very busy with implementing new project. First way to reproduce:I prepared some steps to reproduce this issue, but I am not sure if this is the best approach because it can be related with other issue/bug. I mentioned before that I had this issue before but with hits to API (which also using reactor), but I solve it by workaround, maybe this is useless, someone need to verify if this approach is value. Config of reactorrest_cherrypy:
port: 4080
host: 0.0.0.0
webhook_url: /hook
webhook_disable_auth: True
ssl_crt: /etc/pki/tls/certs/salt01-tech.inet.crt
ssl_key: /etc/pki/tls/private/salt01-tech.inet.key
log_access_file: /var/log/salt/access_api.log
log_error_file: /var/log/salt/error_api.log
reactor:
### bug test
- 'salt/netapi/hook/bugtest':
- /opt/salt/salt-pillar/reactor/bugtest.sls Reactorbugtest_event:
runner.state.orchestrate:
- args:
- mods: orch.bugtest
- pillar:
event_data: {{ data|json }} orchestration{%- set fqdn = salt.grains.get("fqdn") %}
{%- set hook = salt.pillar.get("event_data",False) %}
{%- import "system/module_slack.sls" as slack %}
{%- set service = "Salt-Orchestration - Percona" %}
{%- if hook != False %}
{%- set RemoteAddr = hook["headers"]["Remote-Addr"] %}
cmd.run:
salt.function:
- tgt: 'salt01-tech.inet'
- arg:
- echo test
{%- else %}
Echo_percona:
cmd.run:
- name: echo "No Data"
{% endif %} I started following command on 2 vm: for i in {1..100}; do curl -sSk https://salt01-tech.inet:4080/hook/bugtest -H 'Accept: application/x-yaml' -d 'cluster: sql-prd-shared1'; done Log from debug2018-06-10 18:59:08,358 [salt.utils.lazy :97 ][DEBUG ][24046] LazyLoaded jinja.render
2018-06-10 18:59:08,358 [salt.utils.event :728 ][DEBUG ][24046] Sending event: tag = salt/run/20180610185908347361/new; data = {'fun': 'runner.state.orchestrate', 'fun_args': [{'pillar': OrderedDict([('event_data', OrderedDict([('_stamp', '2018-06-10T16:54:34.530484'), ('body', ''), ('headers', OrderedDict([('Accept', 'application/x-yaml'), ('Content-Length', '24'), ('Content-Type', 'application/x-www-form-urlencoded'), ('Host', 'salt01-tech.inet:4080'), ('Remote-Addr', '10.201.34.125'), ('User-Agent', 'curl/7.29.0')])), ('post', OrderedDict([('cluster: sql-prd-shared1', '')]))]))]), 'mods': 'orch.bugtest'}], 'jid': '20180610185908347361', 'user': 'Reactor', '_stamp': '2018-06-10T16:59:08.357805'}
2018-06-10 18:59:08,362 [salt.loaded.int.rawmodule.state:2036][DEBUG ][25798] Remaining event matches: -975
2018-06-10 18:59:08,361 [salt.utils.lazy :97 ][DEBUG ][24046] LazyLoaded yaml.render
2018-06-10 18:59:08,361 [salt.config :1954][DEBUG ][24046] Reading configuration from /etc/salt/master
2018-06-10 18:59:08,364 [salt.fileclient :1072][DEBUG ][24046] Could not find file 'salt://orch/bugtest.sls' in saltenv 'base'
2018-06-10 18:59:08,369 [salt.fileclient :1072][DEBUG ][24046] Could not find file 'salt://orch/bugtest/init.sls' in saltenv 'base'
2018-06-10 18:59:08,369 [salt.template :48 ][DEBUG ][24046] compile template: False
2018-06-10 18:59:08,370 [salt.template :62 ][ERROR ][24046] Template was specified incorrectly: False
2018-06-10 18:59:08,377 [salt.utils.lazy :97 ][DEBUG ][24046] LazyLoaded config.get
2018-06-10 18:59:08,378 [salt.loaded.int.render.yaml:76 ][DEBUG ][24046] Results of YAML rendering:
OrderedDict([('include', ['directory.consul', 'directory.consul_cache']), ('directory', OrderedDict([('UID', OrderedDict([('saltmaster', OrderedDict([('uid', 'root')]))]))]))])
2018-06-10 18:59:08,378 [salt.template :26 ][PROFILE ][24046] Time (in seconds) to render '/opt/salt/salt-pillar/pillar/directory/saltmaster.sls' using 'yaml' renderer: 0.0239288806915
2018-06-10 18:59:08,380 [salt.template :48 ][DEBUG ][24046] compile template: /opt/salt/salt-pillar/pillar/directory/consul.sls
2018-06-10 18:59:08,379 [salt.utils.event :728 ][DEBUG ][24046] Sending event: tag = salt/run/20180610185848992206/ret; data = {'fun_args': [{'pillar': OrderedDict([('event_data', OrderedDict([('_stamp', '2018-06-10T16:54:33.650618'), ('body', ''), ('headers', OrderedDict([('Accept', 'application/x-yaml'), ('Content-Length', '24'), ('Content-Type', 'application/x-www-form-urlencoded'), ('Host', 'salt01-tech.inet:4080'), ('Remote-Addr', '10.201.34.125'), ('User-Agent', 'curl/7.29.0')])), ('post', OrderedDict([('cluster: sql-prd-shared1', '')]))]))]), 'mods': 'orch.bugtest'}], 'jid': '20180610185848992206', 'return': {'outputter': 'highstate', 'data': {'salt01-tech.inet_master': ["No matching sls found for 'orch.bugtest' in env 'base'"]}, 'retcode': 1}, 'success': False, '_stamp': '2018-06-10T16:59:08.379624', 'user': 'Reactor', 'fun': 'runner.state.orchestrate'} state.event[DEBUG ] Remaining event matches: -975
salt/run/20180610185848992206/ret {
"_stamp": "2018-06-10T16:59:08.379624",
"fun": "runner.state.orchestrate",
"fun_args": [
{
"mods": "orch.bugtest",
"pillar": {
"event_data": {
"_stamp": "2018-06-10T16:54:33.650618",
"body": "",
"headers": {
"Accept": "application/x-yaml",
"Content-Length": "24",
"Content-Type": "application/x-www-form-urlencoded",
"Host": "salt01-tech.omarsys.inet:4080",
"Remote-Addr": "10.201.34.125",
"User-Agent": "curl/7.29.0"
},
"post": {
"cluster: sql-prd-shared1": ""
}
}
}
}
],
"jid": "20180610185848992206",
"return": {
"data": {
"salt01-tech.omarsys.inet_master": [
"No matching sls found for 'orch.bugtest' in env 'base'"
]
},
"outputter": "highstate",
"retcode": 1
},
"success": false,
"user": "Reactor"
} Please do not judge my code and sense here, I just wanted to create something to reproduce the issue :) What is important:
Second way to reproduce:I have idea that You can put some task to the crontab on 2 vm what will be executing hits to the Reactor in the same time (if the NTP is synchronized) let's say every 2 minutes, this should work, but I didn't test it. I will try to do it in next few days and let you know about results. |
Thanks for the additional details that will help to dive in. ping @DmitryKuzmenko just a friendly reminder to add this to your bug list to fix :) |
@Ch3LL thank you. It's already in my icebox, just not the top priority at this moment, sorry. I'll work on it right after a few critical tasks. |
no worries just wanted to make sure it was on your radar. thanks :) |
ZD-2593 |
There is issue [1] with reactor in salt 2017, one of the workarounds is to set reactor workers to 1 [1] saltstack/salt#47539 Change-Id: I47d76cc1dc5d0afe6d8b215e2d32cdbab3ac1a8c Related-Prod: https://mirantis.jira.com/browse/PROD-21463
Is there any update on this issue please? Our infrastructure heavily relies on the reactor and I have to restart the salt-master process multiple times a day to get it working again :( |
not currently an update, but is assigned out to an engineer. i don't believe @DmitryKuzmenko is working on it currently as there are higher priority issues. |
Is there any more information needed? |
nope your right i need to re-label this thanks :) |
Currently salt doesn't allow to get confirmation on minion upon successfull reactor execution on event. However there can be issues with reactor in salt 2017.7 [1] or reactor register state can fail if pillar failed to render, so node registration confirmation maybe needed. In order to enable this functionality add node_confirm_registration parameter to event data with value true. [1] saltstack/salt#47539 Change-Id: I1abc4fdf172e018dcdb48abcf61532fb51ad3660 Related-Prod: https://mirantis.jira.com/browse/PROD-21463
@Ch3LL @DmitryKuzmenko is there any chance someone could look at this please? |
Unfortunately after creating many more minions in the cloud for my salt master, I have to report that setting |
Thank you very much for the updates @anitakrueger. This is definitely still on our radar and @DmitryKuzmenko will be working on this as soon as he can. |
@anitakrueger I believe this fix may help: https://github.com/saltstack/salt/pull/46641/files Are you willing to test it? |
This change was made between 2017.7.2 and 2017.7.3, #43817 Are any of yall experiencing this issue using multiple environments for your state or orchestrate files? Thanks, |
I have replicated this issue. |
@cachedout I've looked at https://github.com/saltstack/salt/pull/46641/files, but we are running 2018.3.2 and this seems to be already merged in there. @gtmanfred We only use the base environment. I could never work multiple environments out properly for our use case. |
So, here is what I have found. What is happening is that the master worker threads are getting overloaded. These are the ones that the master uses to respond to fileserver requests from the minions and anything that uses the remote fileserver. The So there is really only one work around, increase the worker_threads so that the salt master can handle all of the fileserver requests that you expect to come in at one time. As for solutions that we can do, we are talking about two.
|
If I can figure out why this has to call get_state twice, and get it to be not bad, and only call it once, using the local fileclient for state.orchestrate called from the master should be easy to implement.
|
I have been hit by this too, I've put in place a simple check to notify me when this happens until the fix is in place, I am sharing here in case someone might want to make use of it: reactor.conf:
salt/reactor/pong.sls:
check script:
|
Ok, so the other part of this problem that is causing the issue is with all these calls happening at exactly the same time, they are all trying to open file handlers to cache the same file to the same location, so we need to add some sort of file lock to this to make sure multiple reactor threads do not write to the cache at the same time. Then that should solve this issue totally. |
If the __role is master, then we should already have all the configs needed to get the files from the local machine, and from gitfs instead of having to request them from the fileserver on the master. Fixes the first part of saltstack#47539
Ok, we have figured out what is happening, and hopefully a fix for it, but it is going to take a bit to finish. @terminalmage is working on a fix. To render pillar files, we use the same logic that is used to render file_roots, so what we do is we munge the data in opts for file_roots, to be the pillar_roots, and run it through the same code, those options are overwritten here. As you can see above that line, we do a deep copy of the opts dictionary, so it should not affect the opts that is used when pulling states. However, even with this, it appears that this assignment is still affecting the rendering of states for the reactor sometimes. Here is what we found yesterday. In one thread when trying to grab the orchestration file, and sometimes the reactor state file, we see it trying to pull from the pillar roots.
So, our plan is to break this out so that it does not need to overwrite the file_roots to render pillar data, that is something Erik started earlier this year, but hasn't finished yet. We hope to put that code into 2018.3 for the 2018.3.4 release. Now, for a work around, there are two ways to get around this bug.
If you put your reactor into gitfs too, you can reference it with salt://
Thanks for your patience. |
Daniel, thank you SOOOO much for figuring this out! This has really been bugging us as we use the reactor heavily. I will give workaround #1 a go and report back. |
👍 happy to help. Erik has a commit that seems to fix the issue, and it should be included in 2018.3.4. |
I've opened #49987, see that PR for details. |
The above fix has been merged and it will be in the 2018.3.4 release. Thank you for your patience with helping us troubleshoot this. |
Description of Issue/Question
After my deep investigation, why I receiving randomly following message:
I figure out, that it happening only, when 2 or more minion hits in the reactor in the same time.
It looks like that some cache is cleared or something and not generated on the time, when second minion hit to the Reactor
I have Auto-Scaling-Group from AWS, which deploys VM to cloud and make basic setup for minion(update master address, setup grains and start service).
After that SALT-Master make the highstate. The problem is that the Reactor works perfectly when there is only 1 new minion, but if I will deploy 2 or more new machine in the same time, then Reacotor will works only for first one which makes attempts, for others Reactor will throw error that some SLS was not found.
For me it looks like that after hit in the Reactor there is some cache clearing and not regenrated on the time.
Flow how looks like the "communication"
minion -> reactor -> orchestrate -> highstate for minion
Setup
Reactor config
reactor: - 'salt/minion/*/start': - /opt/salt/salt-pillar/reactor/minion-onStart.sls
Reactor state:
Orchestrate
Steps to Reproduce Issue
Create 2-3 new VM
Versions Report
The text was updated successfully, but these errors were encountered: