-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[fast-reboot] fast-reboot downtime period exceed 30 seconds on master/202012 branches #7140
Comments
@kcudnik can you please check this issue? |
@yxieca hey can you help here also ? |
there are 3 files for recordings sairedis.rec-201911 sairedis.rec-202012-sync_off sairedis.rec-202012-sync_on |
yes, this is exactly the meaning. |
from 2019:
from init view 15:53:06 to 1st get response after switch create 15:53:26, so entire process takes about 20 seconds, and switch create about 18 seconds in 202012 sync off:
from init view 09:57:25 to first get response 09:57:47, it takes 22 seconds, in which create switch about 19 seconds (ps. init view takes about 3-4 seconds here, seems strange, since during that time syncd is basically clearing std::map which is empty at that time, maybe not running yet ? syslog would be helpful) in 202012 sync on
from init view 10:48:08 to first get response 10:48:30 it takes about 22 seconds, and switch create takes about 18 seconds in all 3 cases switch create takes less than 30 seconds, and i assume create_switch api when called is actually causing dataplace to be disrupted (unless sai_api_initialize is causing also something else) i don't have deeper understanding how those disruptions are measured, but from create_switch perspective everything seems fine, and after that everything should be already available to user, like GET api's, ports should be up etc ps. it would be also helpful if we would get syslogs recordings from that |
im assuming we are talking here about warm boot scenario, since dataplance downtime is considered on warm boot scenario, but no "warm boot" was mentioned in main post also how this time 44.903033 is actually measured ? |
I don't have the syslogs but I will re-run the tests and add them. |
i think we will need @yxieca here to jump in and explain fast boot scenario to me |
maybe this document will help to understand the scenario |
to help solve this i will need exact moments when dataplane is disrupted and when it gets full operational again, and whether those 2 moments can be determined from sairedis.rec files |
@kcudnik I have added to the ticket the test log and full system dump with both 'synchronous_mode' enabled and disabled. |
in both cases i see:
which takes about 2 seconds, so those messages seems to not reflect data plane downtime, i would need some syslog indicators on which i could see when dataplane goes down, and when goes up again after this long absence, then i could investigate what else is going on also from syncd process start (before api_initialize) to return first GET to orchagent, in both cases syncd/async takes less than 30 seconds
in async mode, it takes about 18 seconds, in sync mode it takes about 19 seconds to return first GET, after this i assume dataplane is already up, since SAI api was already initialized, and switch was created and syncd was already responding to GET, so at this point this would suggest, that dataplane is down in different time period then only syncd process - api initialize and switch create i dont fully understand what determines whether data plane is running or not, does the ports needs to be explicitly brought up? i would need @yxieca to jump in here and explain |
@shi-su , does this match your observation? |
The fact that 202012 has longer dataplane downtime than 201911 and some of the operations are taking longer matches my observation. On the other hand, I am not sure if the synchronous mode makes a difference by 5 seconds. I did see that sync mode is slower, but the difference appears to be smaller. I checked the recording file in the ticket. I am not sure why there are no route entries in the recording of async mode, so I took the creation of the first neighbor with IP 10.0.0.x as a milestone. The difference between sync and async mode seems to be below 2 seconds.
Some of the downtime data I collected from 202012 fast reboot tests is below.
|
so how this time is measured ? based on port notification arrive ? |
What i noticed, that in results.gs in 2019 sairedis recording, there is no single route created, similar in 2012 when sync mode is off, when sync mode is on routes are actually created. Why OA didn't created routes at all ? and how dataplane then recovered to pass the test ? |
My suspect here is part of OA or vendor sai execution apis (including switch create), based on virtual switch on amd64 arch, sairedis/syncd pipeline is able to push 13k routes per second ins sync mode (end to end) on top of that there will be added processing time on vendorsai as well as quickly OA will be able to push data to sairedis library. I can push performance timers on syncd per each api to see how long it takses (route_entry) apis already have those performance timers and display them in syslog every 10k routes being pushed. But from OA side, i will need maintainers of that code to take a look. |
@mlnx, one observation with 201911 image is that, create switch is taking 9 sec vs ~18sec in 202012. Would you please help us understand why such a big delay in create switch? Logs from 201911:
|
Adding some more logs based on ACS-MSN2700 - 201911 ACS-MSN4600C - 201911 ACS-MSN2700 - 202012 ACS-MSN4600C - 202012 |
can you describe which one is sync/async mode ? |
They are all with sync mode enabled. |
@shlomibitton hey, if you will have time this week, we can setup another session, and maybe try to do syncd + saiplayer only to eliminate OA and other stuff, and just see how fast all notifications will arrive, this will take some time but we can do this over meeting session, what do you think? |
hey @kcudnik, I will try to find a free slot this week and update you. |
hey, can you specify some slot this week? |
@kcudnik sorry I didn't find time this week. |
sure |
hey @kcudnik, sorry for the late respond I couldn't find an empty time slot... |
sure, could you send invite? |
Sent |
Update on the latest findings for this issue: I measured timings and profiled this process: Profiling the daemon proof that:
From bootchart we can see this is running exactly when orchagent reset the ASIC, probably on 201911 branch the timing is different and this is why we don’t see this issue. The conclusion is delaying this daemon for 90 seconds (fast-reboot control plane and data plane maximum recover time) will improve the fast-reboot flow in ~6-7 seconds. |
thanks for indepth investigation, just to clarify that 660ms is process running, and this time is not "included" in OA/syncd time, it just slows down entire system, it would be nice to have timeline of each run of this process + cpu load and each notification arrive at the same graph plotted so we visually could observe whats the cpu load between 1st notification and last notification for port UP also not sure, but probably hostcfgd could be slowwed down to run + some fixed sleep time to give time for OA or maybe better to set nice -19 o this process, ans for syncd and OA we could add nice +19 for let say 1 min, to grant them highest priority for running on start, and after 1min we could bring back nice to default, actually maybe even syncd would be enough - what do you think? lets try OA+syncd on nice+20 ? |
looks like the delay is related to https://github.com/Azure/sonic-buildimage/blob/master/src/sonic-host-services/scripts/hostcfgd#L930 |
@shlomibitton thanks for providing 202012 4 dumps from logs 2700, on syncd mode on and off. in case for async mode, all hostif oper status are one after another, scheduler_group GET still happens, in same amount, but it's before all hostif oper status SET between last PORT_ATTR_ADMIN_STATE=true to last HOSTIF_OPER_STATUS=true,
|
201911 just for comparison, assuming that this and all previous attached logs were performed on the same 2700 device (different than on first reported on issue in github) Here create_switch takes 12 seconds (which is significantly more compared to ~8 sec previously, it could be to different number of ports From create switch, to last oper status 2021-06-22.12:40:43.416433|c|SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000 Where create switch takes 12 seconds, and rest another 9 seconds, in total about 21 seconds 2021-06-21.14:23:47.540164|c|SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000 Where create switch takes 14 seconds, and the rest another 14 seconds, in total about 28 seconds And 202012 sync mode 2021-06-20.14:03:18.867744|c|SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000 Where create switch takes 14 seconds, rest about 19 seconds, and total about 34 seconds During debug session with Shlomi, we noticed that
I have suspicion here, that because of this fdb recreate which blocks for 2 whole seconds entire OA flow, some producers could already produce something in redis database, this wakens up “select” which listens on a list of sub-orch agent in main orch, and potentially, epoll/select will not guarantee order, when sub orch is popped up, is processed, and then entire list is processed in order for doTask(), and tasks are added by refillToSync() and addToSync() methods in OA. I’m investigating sairedis and swss logs, whether this is the case, and comparing 201911 branch with 202012 whether we use same features on both branches to see if this make sense to compare async 201911 directly to async 202012, will share results later |
One big thing that I noticed, as mentioned, because OA is listening on notifications from sairedis, and not relying on port_state_change notification switch callback, the notification will be processed by OA only after select will return, and it happens much later after processing all doTask() operations: Sync mode 202012 (total is 28 notifications, I just recorded first and last): 2021-06-20.14:03:41.139289|n|port_state_change|[{"port_id":"oid:0x100000000088b","port_state":"SAI_PORT_OPER_STATUS_UP"}]| … On OA notifications are processed in void PortsOrch::doTask(NotificationConsumer &consumer) (portsorch.cpp) and log line shows up in syslog for each port: Jun 20 14:03:47.494839 arc-switch1004 NOTICE swss#orchagent: :- doTask: Get port state change notification id:100000000088b status:1 … And here is where hostif is brought up: 2021-06-20.14:03:47.495207|s|SAI_OBJECT_TYPE_HOSTIF:oid:0xd00000000097a|SAI_HOSTIF_ATTR_OPER_STATUS=true … Notice here: all notifications arrived between 41-42 seconds (sairedis record) but are processed between 47-52sec on doTask (and also that brings up hostif iinterface up) In swss.log there is 11 seconds gap between this: 2021-06-20.14:03:40.379744|PORT_TABLE:Ethernet78|SET|alias:etp20b|description:etp20b|index:20|lanes:78,79|mtu:9100|pfc_asym:off|speed:50000|admin_status:down And OA is busy during that time (also maybe lack of any logs, because all necessary tasks were already queued and now its only processing) You can also see where 500 fdb entries are queued: And when they are processed (in sync mode takes 2sec to push them): 2021-06-20.14:03:45.346041|c|SAI_OBJECT_TYPE_FDB_ENTRY:{"bvid":"oid:0x260000000009c0","mac":"72:06:00:01:00:00","switch_id":"oid:0x21000000000000 … I think this seems normal, since all the “known” configuration will be queued right away during OA start, and it will start accepting new entries when it process current queue, and similar thing happens to SCHEDULER_GRUPS: 2021-06-20.14:03:38.344780|QUEUE|Ethernet32|0|SET|scheduler:[SCHEDULER|scheduler.0] … And processed: And here is main difference between async/sync mode, since in sync mode, we are waiting for each specific api status code to return from syncd, in case of async mode, all create/remove/set are almost instantaneous (like 200ms fdb entries vs 2seconds) and OA is able to much faster process those tasks, since it don’t need to wait for response, This OA processing is actually by design seen in orchadeamon.cpp: 571 auto *c = (Executor *)s; For each execute() on specific select result, it executes doTask() on all m_orchList, and it will take some time until it process all of that, and it will pick up first notification from queue And comparing longest create/set flow 2021-06-22.12:40:58.996114|g|SAI_OBJECT_TYPE_PORT:oid:0x1000000000429|SAI_PORT_ATTR_SUPPORTED_SPEED=25:0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 2.1k operations takes 0.9 seconds + 2 seconds wait to finish execute on syncd =~ 3sec total 202020 async: 2021-06-21.14:24:07.683322|g|SAI_OBJECT_TYPE_PORT:oid:0x10000000003a5|SAI_PORT_ATTR_SUPPORTED_SPEED=25:0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 2k operations takes 0.8 seconds + 3.2 seconds to wait to finish execute on syncd =~ 4sec total 202020 sync: 2021-06-20.14:03:40.172097|g|SAI_OBJECT_TYPE_PORT:oid:0x10000000003a5|SAI_PORT_ATTR_SUPPORTED_SPEED=25:0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 2k operations takes 7.3 seconds (there is no wait time, since all apis are synchronous, we could try to use zmq sync mode to see if that will be faster, but because of OA bug (sonic-net/sonic-swss#1145) notifications with zmq will not trigger OA select, so hostif will not be set to UP |
We spent some time with Shlomi to figure out what’s happening during fast boot sync mode, and why it was extending over 30 seconds, and here are our findings so far: Just looked at those (those are 202012 with qos orch removed from OA for testing), and in
PS. those above times could be even lower (estimated 1-2 seconds, just by disabling our added XXX debug logs, since they pollute syslog (adding unnecessary 30k syslog log lines which impact overall system performance) so we are already under 30 seconds, here (ideally we would like to be around 20-22 seconds to have some time on any hiccups during fast boot previously (with qos enabled):
a little background on OA daemon high level overview
so what we found:
execution stats for each orch (with at least 10ms), based on 202012 latest Shlomi logs with preserved execution order (ASYNC): time in seconds (SYNC): time in seconds In our case it so happens that we started to receive notifications during QosOrch executions, and we must wait until it will finish, We need a meeting about OA how we could address this issue with processing notifications when they arrive, I am also aware, that it maybe not beneficial that processing port notifications as they arrive (since we maybe in the middle of port processing etc) but in that case user would need to explicitly put checks during doTask in orch. PS: @shlomibitton I found this interesting stuff on async log:
Seems like syslog was not fast enough and some logs were lost – missing “after doTask” (since from docker syslog is sent via UDP) and 4 seconds of syslog dissappeard |
Fixes provided on PR's: |
On meeting with Prince i noticed that there are a lot of redundancy:
This one is called 7 times !!!m and not only this one, for all schedulers groups
if this could be eliminated then we would cut QOS from 6sec to less than 1 second, which would satisfy fast boot |
Description
Fast-reboot downtime period exceed 30 seconds when running 202012/master based images.
It is working very well when running on top of 201911 based image.
Attached to this ticket sairedis.rec and bootchart graphs for 201911/202012 branches while for 202012 there are two versions, synchronous_mode enabled/disabled.
results.gz
When enabling synchronous_mode on 202012 the result will be ~5 seconds more comparing to disabling synchronous_mode.
Timing analyze is provided by sairedis.rec files between 201911 and 202012-synchronous_mode disabled:
Steps to reproduce the issue:
Describe the results you received:
Summary:
Longest downtime period was 0:00:44.903033
Reboot time was 0:01:32.419481
Expected downtime is less then 0:00:30
How many packets were received back when control plane was down: 500 Expected: 500
Additional info:
INFO:10.213.84.68:PortChannel interface state changed 1 times
INFO:10.213.84.65:PortChannel interface state changed 1 times
INFO:10.213.84.67:PortChannel interface state changed 1 times
INFO:10.213.84.66:PortChannel interface state changed 1 times
Fails:
FAILED:dut:Longest downtime period must be less then 30 seconds. It was 0:00:44.903033
Describe the results you expected:
Downtime should be less than 30 seconds, on 201911 it is around ~23 seconds.
dumps + test logs
sonic_dump_202012_sync_on.tar.gz
fast-reboot_test_202012_sync_on.log
sonic_dump_202012_sync_off.tar.gz
fast-reboot_test_202012_sync_off.log
The text was updated successfully, but these errors were encountered: