[fast-reboot] fast-reboot downtime period exceed 30 seconds on master/202012 branches #7140

shlomibitton · 2021-03-24T16:19:14Z

Description

Fast-reboot downtime period exceed 30 seconds when running 202012/master based images.
It is working very well when running on top of 201911 based image.
Attached to this ticket sairedis.rec and bootchart graphs for 201911/202012 branches while for 202012 there are two versions, synchronous_mode enabled/disabled.
results.gz
When enabling synchronous_mode on 202012 the result will be ~5 seconds more comparing to disabling synchronous_mode.
Timing analyze is provided by sairedis.rec files between 201911 and 202012-synchronous_mode disabled:

Function / Branch / Time	201911	201911 Time	202012 – sync off	202012 Time
INIT_VIEW – start on sairedis	15:53:06.392936	-	09:57:25.165558	-
SAI_STATUS_SUCCESS/SAI_SWITCH_ATTR_SRC_MAC_ADDRESS – reset ASIC	15:53:26.454913	~20	09:57:47.307895	~22
SAI_OBJECT_TYPE_PORT – end of first ports config	15:53:28.246901	~2	09:57:56.549223	~9
SAI_PORT_ATTR_QOS_EGRESS_BUFFER_PROFILE_LIST - end	15:53:28.390692	0	09:57:57.018736	~1.5
SAI_INGRESS_PRIORITY_GROUP_ATTR_BUFFER_PROFILE - start	15:53:28.350775	0	09:58:01.851796	~5
FDB and ACL SAI creation - start	-	-	09:58:01.852823	0
FDB and ACL SAI creation - end	-	-	09:58:02.614289	~1
SAI_QUEUE_TYPE_UNICAST - start	-	-	09:58:05.566537	~3
SAI_OBJECT_TYPE_SCHEDULER_GROUP - start	15:53:30.652369	~2	-	-
SAI_OBJECT_TYPE_SCHEDULER_GROUP - end	15:53:31.253234	~1	-	-
"port_state":"SAI_PORT_OPER_STATUS_UP"}] – last interface	15:53:37.844297	~6	09:58:09.972716	~4

Steps to reproduce the issue:

Install an image based on 202012 branch.
Run fast-reboot test and observe the downtime period.

Describe the results you received:

Summary:
Longest downtime period was 0:00:44.903033
Reboot time was 0:01:32.419481
Expected downtime is less then 0:00:30
How many packets were received back when control plane was down: 500 Expected: 500
Additional info:
INFO:10.213.84.68:PortChannel interface state changed 1 times
INFO:10.213.84.65:PortChannel interface state changed 1 times
INFO:10.213.84.67:PortChannel interface state changed 1 times
INFO:10.213.84.66:PortChannel interface state changed 1 times
Fails:
FAILED:dut:Longest downtime period must be less then 30 seconds. It was 0:00:44.903033

Describe the results you expected:

Downtime should be less than 30 seconds, on 201911 it is around ~23 seconds.

dumps + test logs
sonic_dump_202012_sync_on.tar.gz
fast-reboot_test_202012_sync_on.log
sonic_dump_202012_sync_off.tar.gz
fast-reboot_test_202012_sync_off.log

shlomibitton · 2021-03-24T16:38:06Z

@kcudnik can you please check this issue?

kcudnik · 2021-03-25T13:15:12Z

@yxieca hey can you help here also ?

kcudnik · 2021-03-25T14:02:51Z

there are 3 files for recordings sairedis.rec-201911 sairedis.rec-202012-sync_off sairedis.rec-202012-sync_on
could you explain what are meanings of each of them ? does sync_off mean asynchronous mode? and sync_on synchronous mode ?

shlomibitton · 2021-03-25T14:09:00Z

yes, this is exactly the meaning.

kcudnik · 2021-03-25T14:15:47Z

from 2019:

2021-03-17.15:53:06.392936|a|INIT_VIEW
2021-03-17.15:53:08.756374|A|SAI_STATUS_SUCCESS
2021-03-17.15:53:08.765730|c|SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000...
2021-03-17.15:53:08.765895|g|SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000|SAI_SWITCH_ATTR_SRC_MAC_ADDRESS=60:FA:57:73:C6:55
2021-03-17.15:53:26.454913|G|SAI_STATUS_SUCCESS|SAI_SWITCH_ATTR_SRC_MAC_ADDRESS=1C:34:DA:1C:02:00

from init view 15:53:06 to 1st get response after switch create 15:53:26, so entire process takes about 20 seconds, and switch create about 18 seconds

in 202012 sync off:

2021-03-17.09:57:25.165558|a|INIT_VIEW
2021-03-17.09:57:28.931283|A|SAI_STATUS_SUCCESS
2021-03-17.09:57:28.933383|c|SAI_OBJECT_TYPE_SWITCH:oid:0x210000000...
2021-03-17.09:57:28.933537|g|SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000|SAI_SWITCH_ATTR_SRC_MAC_ADDRESS=00:00:00:00:00:00
2021-03-17.09:57:47.307895|G|SAI_STATUS_SUCCESS|SAI_SWITCH_ATTR_SRC_MAC_ADDRESS=1C:34:DA:1C:02:00

from init view 09:57:25 to first get response 09:57:47, it takes 22 seconds, in which create switch about 19 seconds (ps. init view takes about 3-4 seconds here, seems strange, since during that time syncd is basically clearing std::map which is empty at that time, maybe not running yet ? syslog would be helpful)

in 202012 sync on

2021-03-17.10:48:08.559871|a|INIT_VIEW
2021-03-17.10:48:12.347474|A|SAI_STATUS_SUCCESS
2021-03-17.10:48:12.349804|c|SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000...
2021-03-17.10:48:30.664982|g|SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000|SAI_SWITCH_ATTR_SRC_MAC_ADDRESS=01:00:00:00:00:00
2021-03-17.10:48:30.667361|G|SAI_STATUS_SUCCESS|SAI_SWITCH_ATTR_SRC_MAC_ADDRESS=1C:34:DA:1C:02:00

from init view 10:48:08 to first get response 10:48:30 it takes about 22 seconds, and switch create takes about 18 seconds

in all 3 cases switch create takes less than 30 seconds, and i assume create_switch api when called is actually causing dataplace to be disrupted (unless sai_api_initialize is causing also something else)

i don't have deeper understanding how those disruptions are measured, but from create_switch perspective everything seems fine, and after that everything should be already available to user, like GET api's, ports should be up etc

ps. it would be also helpful if we would get syslogs recordings from that

kcudnik · 2021-03-25T14:29:34Z

im assuming we are talking here about warm boot scenario, since dataplance downtime is considered on warm boot scenario, but no "warm boot" was mentioned in main post

also how this time 44.903033 is actually measured ?

shlomibitton · 2021-03-25T14:32:08Z

I don't have the syslogs but I will re-run the tests and add them.
this is fast-reboot not warm-reboot, dataplane downtime on this scenario should be less than 30 seconds.

kcudnik · 2021-03-25T14:32:54Z

i think we will need @yxieca here to jump in and explain fast boot scenario to me

shlomibitton · 2021-03-25T14:34:49Z

maybe this document will help to understand the scenario
https://github.com/Azure/SONiC/blob/master/doc/fast-reboot/fastreboot.pdf

kcudnik · 2021-03-25T14:42:51Z

to help solve this i will need exact moments when dataplane is disrupted and when it gets full operational again, and whether those 2 moments can be determined from sairedis.rec files

shlomibitton · 2021-03-29T19:38:19Z

@kcudnik I have added to the ticket the test log and full system dump with both 'synchronous_mode' enabled and disabled.
To determine when the dataplane erupted you can search on syslog for something like "system is ready for reset" and "system is ready after reset"
If you need any assistance please let me know.

kcudnik · 2021-03-30T10:40:50Z

in both cases i see:

Mar 29 19:29:10.977706 r-tigon-04 INFO kernel: [   29.813444] system is ready after reset [waited 2172 msec]
Mar 29 19:29:10.977706 r-tigon-04 INFO kernel: [   29.813444] system is ready after reset [waited 2172 msec]


Mar 29 18:55:16.058782 r-tigon-04 INFO kernel: [   26.875255] system is ready for reset [waited 0 msec], performing reset now
Mar 29 18:55:18.554712 r-tigon-04 INFO kernel: [   29.369662] system is ready after reset [waited 2168 msec]

which takes about 2 seconds, so those messages seems to not reflect data plane downtime, i would need some syslog indicators on which i could see when dataplane goes down, and when goes up again after this long absence, then i could investigate what else is going on

also from syncd process start (before api_initialize) to return first GET to orchagent, in both cases syncd/async takes less than 30 seconds

(async mode)
Mar 29 19:29:08.224251 r-tigon-04 DEBUG syncd#syncd: :> syncd_main: enter
Mar 29 19:29:26.681778 r-tigon-04 NOTICE swss#orchagent: :- main: Get switch virtual router ID 3000000000002

(sync mode)
Mar 29 18:55:15.778341 r-tigon-04 DEBUG syncd#syncd: :> syncd_main: enter
Mar 29 18:55:34.069823 r-tigon-04 NOTICE swss#orchagent: :- main: Get switch virtual router ID 3000000000002

in async mode, it takes about 18 seconds, in sync mode it takes about 19 seconds to return first GET, after this i assume dataplane is already up, since SAI api was already initialized, and switch was created and syncd was already responding to GET, so at this point this would suggest, that dataplane is down in different time period then only syncd process - api initialize and switch create

i dont fully understand what determines whether data plane is running or not, does the ports needs to be explicitly brought up? i would need @yxieca to jump in here and explain

lguohan · 2021-03-30T15:22:13Z

@shi-su , does this match your observation?

shi-su · 2021-03-30T17:22:29Z

The fact that 202012 has longer dataplane downtime than 201911 and some of the operations are taking longer matches my observation.

On the other hand, I am not sure if the synchronous mode makes a difference by 5 seconds. I did see that sync mode is slower, but the difference appears to be smaller. I checked the recording file in the ticket. I am not sure why there are no route entries in the recording of async mode, so I took the creation of the first neighbor with IP 10.0.0.x as a milestone. The difference between sync and async mode seems to be below 2 seconds.

	202012 sync mode	202012 async mode
SAI_STATUS_SUCCESS/SAI_SWITCH_ATTR_SRC_MAC_ADDRESS	10:48:30.664982	09:57:47.307895
"port_state":"SAI_PORT_OPER_STATUS_UP"}] – last interface	10:48:52.411869 (+21.75s)	09:58:09.972716 (+22.67s)
SAI_OBJECT_TYPE_NEIGHBOR_ENTRY:"ip":"10.0.0.57"	10:49:01.878144 (+31.21s)	09:58:16.677333 (+29.37s)

Some of the downtime data I collected from 202012 fast reboot tests is below.

Attempt	202012 sync mode	202012 async mode
1	43.36	43.03
2	42.53	34.44
3	31.48	37.48
4	37.76	35.86
5	40.07	42.81
Average	39.04	38.72

kcudnik · 2021-03-30T18:08:55Z

so how this time is measured ? based on port notification arrive ?

kcudnik · 2021-03-30T20:47:01Z

What i noticed, that in results.gs in 2019 sairedis recording, there is no single route created, similar in 2012 when sync mode is off, when sync mode is on routes are actually created. Why OA didn't created routes at all ? and how dataplane then recovered to pass the test ?

kcudnik · 2021-04-13T17:13:05Z

My suspect here is part of OA or vendor sai execution apis (including switch create), based on virtual switch on amd64 arch, sairedis/syncd pipeline is able to push 13k routes per second ins sync mode (end to end) on top of that there will be added processing time on vendorsai as well as quickly OA will be able to push data to sairedis library. I can push performance timers on syncd per each api to see how long it takses (route_entry) apis already have those performance timers and display them in syslog every 10k routes being pushed. But from OA side, i will need maintainers of that code to take a look.

prsunny · 2021-04-13T19:45:43Z

@mlnx, one observation with 201911 image is that, create switch is taking 9 sec vs ~18sec in 202012. Would you please help us understand why such a big delay in create switch?

Logs from 201911:

2021-04-13.18:26:34.379392|c|SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000|SAI_SWITCH_ATTR_INIT_SWITCH=true|SAI_SWITCH_ATTR_FDB_EVENT_NOTIFY=0x556c8940da40|SAI_SWITCH_ATTR
2021-04-13.18:26:34.379630|g|SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000|SAI_SWITCH_ATTR_SRC_MAC_ADDRESS=60:DA:40:89:6C:55
2021-04-13.18:26:43.868553|G|SAI_STATUS_SUCCESS|SAI_SWITCH_ATTR_SRC_MAC_ADDRESS=1C:34:DA:EB:BC:80

shlomibitton · 2021-04-20T15:57:27Z

Adding some more logs based on
SAI 1.18.3
SDK 4.4.2522
FW X.2008.2520

ACS-MSN2700 - 201911
fast-reboot.log
syslog.log
sairedis.log

ACS-MSN4600C - 201911
fast-reboot.log
syslog.log
sairedis.log

ACS-MSN2700 - 202012
fast-reboot.log
syslog.log
sairedis.log

ACS-MSN4600C - 202012
fast-reboot.log
syslog.log
sairedis.log

kcudnik · 2021-04-20T16:09:16Z

can you describe which one is sync/async mode ?

shlomibitton · 2021-04-20T16:38:08Z

They are all with sync mode enabled.

kcudnik · 2021-05-10T15:20:37Z

@shlomibitton hey, if you will have time this week, we can setup another session, and maybe try to do syncd + saiplayer only to eliminate OA and other stuff, and just see how fast all notifications will arrive, this will take some time but we can do this over meeting session, what do you think?

shlomibitton · 2021-05-11T08:16:14Z

hey @kcudnik, I will try to find a free slot this week and update you.

kcudnik · 2021-05-17T15:16:38Z

hey, can you specify some slot this week?

shlomibitton · 2021-05-19T10:52:06Z

@kcudnik sorry I didn't find time this week.
We can have a session next week, I will update you and set a meeting.

kcudnik · 2021-05-19T17:35:04Z

sure

shlomibitton · 2021-06-02T14:12:57Z

hey @kcudnik, sorry for the late respond I couldn't find an empty time slot...
Do you want to have another session on Monday 7.6 ?

kcudnik · 2021-06-02T15:35:59Z

sure, could you send invite?

shlomibitton · 2021-06-03T12:46:08Z

Sent

shlomibitton · 2021-06-10T09:59:34Z

Update on the latest findings for this issue:
It seems like 'hostcfgd' is starting at the same time as the 'orchagent' 'create_switch' function is running, which means at the same time we reset the ASIC and this is consuming important CPU time at this phase.
This can be observed on the log and in addition to the attached 'bootchart' graph on 202012 branch.

I measured timings and profiled this process:
There are 14 feature elements we enable/disable when the ‘hostcfgd’ daemon is running, each call is reloading the systemd daemon resulting in parsing the whole services files we have (which is a lot) and each call like this cost ~660ms:
root@arc-switch1025:/# time systemctl daemon-reload
real 0m0.661s
user 0m0.012s
sys 0m0.012s

Profiling the daemon proof that:

ncalls	tottime	percall	cumtime	percall	filename:lineno(function)
14	0.001	0.000	12.217	0.873	hostcfgd:513(update_feature_state)

From bootchart we can see this is running exactly when orchagent reset the ASIC, probably on 201911 branch the timing is different and this is why we don’t see this issue.

The conclusion is delaying this daemon for 90 seconds (fast-reboot control plane and data plane maximum recover time) will improve the fast-reboot flow in ~6-7 seconds.

kcudnik · 2021-06-10T10:20:03Z

thanks for indepth investigation, just to clarify that 660ms is process running, and this time is not "included" in OA/syncd time, it just slows down entire system, it would be nice to have timeline of each run of this process + cpu load and each notification arrive at the same graph plotted so we visually could observe whats the cpu load between 1st notification and last notification for port UP

also not sure, but probably hostcfgd could be slowwed down to run + some fixed sleep time to give time for OA or maybe better to set nice -19 o this process, ans for syncd and OA we could add nice +19 for let say 1 min, to grant them highest priority for running on start, and after 1min we could bring back nice to default, actually maybe even syncd would be enough - what do you think? lets try OA+syncd on nice+20 ?

lguohan · 2021-06-10T16:42:13Z

looks like the delay is related to https://github.com/Azure/sonic-buildimage/blob/master/src/sonic-host-services/scripts/hostcfgd#L930

kcudnik · 2021-06-22T10:56:12Z

@shlomibitton thanks for providing 202012 4 dumps from logs 2700, on syncd mode on and off.
What i noticed, that on 3 runs on sync mode, between 1st and 2nd SAI_HOSTIF_ATTR_OPER_STATUS=true there is 4 second gap, in all of 3 runs of tests, and between this gap there is a scheduler_group GET and SET for scheduler profile. single api in syncd mode is fairly quick, it takes about ~1-2ms RTT, but between 1st and 2nd hostif there is in total 4000 lines logged which 95% is GET and response for GET

in case for async mode, all hostif oper status are one after another, scheduler_group GET still happens, in same amount, but it's before all hostif oper status SET

between last PORT_ATTR_ADMIN_STATE=true to last HOSTIF_OPER_STATUS=true,

async takes 8 seconds (total operation 28 seconds)
sync takes 12 seconds (4 seconds slower) (total operation 32-34 seconds)

kcudnik · 2021-06-22T14:02:33Z

201911 just for comparison, assuming that this and all previous attached logs were performed on the same 2700 device (different than on first reported on issue in github)

Here create_switch takes 12 seconds (which is significantly more compared to ~8 sec previously, it could be to different number of ports

From create switch, to last oper status

2021-06-22.12:40:43.416433|c|SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000
2021-06-22.12:40:55.871859|G|SAI_STATUS_SUCCESS|SAI_SWITCH_ATTR_SRC_MAC_ADDRESS=EC:0D:9A:FA:C1:00
2021-06-22.12:41:04.203965|s|SAI_OBJECT_TYPE_HOSTIF:oid:0xd000000000982|SAI_HOSTIF_ATTR_OPER_STATUS=true

Where create switch takes 12 seconds, and rest another 9 seconds, in total about 21 seconds
Which seems extremely fast compared to async 202012 branch:

2021-06-21.14:23:47.540164|c|SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000
2021-06-21.14:24:01.396118|G|SAI_STATUS_SUCCESS|SAI_SWITCH_ATTR_SRC_MAC_ADDRESS=EC:0D:9A:FA:C1:00
2021-06-21.14:24:15.014972|s|SAI_OBJECT_TYPE_HOSTIF:oid:0xd000000000984|SAI_HOSTIF_ATTR_OPER_STATUS=true

Where create switch takes 14 seconds, and the rest another 14 seconds, in total about 28 seconds

And 202012 sync mode

2021-06-20.14:03:18.867744|c|SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000
2021-06-20.14:03:32.947669|G|SAI_STATUS_SUCCESS|SAI_SWITCH_ATTR_SRC_MAC_ADDRESS=EC:0D:9A:FA:C1:00
2021-06-20.14:03:52.135295|s|SAI_OBJECT_TYPE_HOSTIF:oid:0xd00000000098d|SAI_HOSTIF_ATTR_OPER_STATUS=true

Where create switch takes 14 seconds, rest about 19 seconds, and total about 34 seconds

During debug session with Shlomi, we noticed that

"query scheduler groups and set scheduler profile” takes about 3000 lines of sairedis logs which is about 1500 SAI operations since it’s GET operation + response for all port child, and its anyway always synchronized, and it takes ~3 sec on 202012 async and ~4 seconds on 202012 syncd and ~2.8sec on 201911 branch, this operation is happening during port notifications arrive and before (async) setting hostif_oper_status=true, and after first (sync) status set to true, which could be “moved” after last oper status set, and it could take off ~4 seconds of time from 202012 sync mode. Problem here is how we decide when last hostif status is set to true, since in test condition all ports are up and operational, but in production environment this may not be the case, since device on other end may not be operational
We have about 500 fdb entries which are re created, and on async mode it takes 200ms, but on sync mode it takes 2seconds (10x! the time), since we are waiting for each one to be executed. We need those entries to be present to correctly push the traffic and avoid flood and fdb learn notifications, but there is a room for improvement here, we could use bulk API in orchagent, to take off RTT of calling 500x the calls which on average is ~1-2ms per call

I have suspicion here, that because of this fdb recreate which blocks for 2 whole seconds entire OA flow, some producers could already produce something in redis database, this wakens up “select” which listens on a list of sub-orch agent in main orch, and potentially, epoll/select will not guarantee order, when sub orch is popped up, is processed, and then entire list is processed in order for doTask(), and tasks are added by refillToSync() and addToSync() methods in OA.
This potentially could process “waken” FD tasks that are on the beginning of the list, and process other tasks like port_state_change_notification later on. Not all of those tasks are critical to traffic flow (eg scheduler_group) eventually all the task will be processed anyway but sacrifice here is time. Adding Qi here so maybe he could shine some light on orchagent internals.

I’m investigating sairedis and swss logs, whether this is the case, and comparing 201911 branch with 202012 whether we use same features on both branches to see if this make sense to compare async 201911 directly to async 202012, will share results later

kcudnik · 2021-06-22T16:57:19Z

One big thing that I noticed, as mentioned, because OA is listening on notifications from sairedis, and not relying on port_state_change notification switch callback, the notification will be processed by OA only after select will return, and it happens much later after processing all doTask() operations:

Sync mode 202012 (total is 28 notifications, I just recorded first and last):

2021-06-20.14:03:41.139289|n|port_state_change|[{"port_id":"oid:0x100000000088b","port_state":"SAI_PORT_OPER_STATUS_UP"}]| …
2021-06-20.14:03:42.241961|n|port_state_change|[{"port_id":"oid:0x10000000005d6","port_state":"SAI_PORT_OPER_STATUS_UP"}]|

On OA notifications are processed in void PortsOrch::doTask(NotificationConsumer &consumer) (portsorch.cpp) and log line shows up in syslog for each port:

Jun 20 14:03:47.494839 arc-switch1004 NOTICE swss#orchagent: :- doTask: Get port state change notification id:100000000088b status:1 …
Jun 20 14:03:52.135147 arc-switch1004 NOTICE swss#orchagent: :- doTask: Get port state change notification id:10000000005d6 status:1

And here is where hostif is brought up:

2021-06-20.14:03:47.495207|s|SAI_OBJECT_TYPE_HOSTIF:oid:0xd00000000097a|SAI_HOSTIF_ATTR_OPER_STATUS=true …
2021-06-20.14:03:52.135295|s|SAI_OBJECT_TYPE_HOSTIF:oid:0xd00000000098d|SAI_HOSTIF_ATTR_OPER_STATUS=true

Notice here: all notifications arrived between 41-42 seconds (sairedis record) but are processed between 47-52sec on doTask (and also that brings up hostif iinterface up)
So there is a 5 seconds delay between last notification arrive, and first notification being processed – this is because OA is busy doing all other “doTasks” in a loop, those tasks include setting those fdb entres, setting up ACL etc

In swss.log there is 11 seconds gap between this:

And OA is busy during that time (also maybe lack of any logs, because all necessary tasks were already queued and now its only processing)

And when they are processed (in sync mode takes 2sec to push them):

2021-06-20.14:03:45.346041|c|SAI_OBJECT_TYPE_FDB_ENTRY:{"bvid":"oid:0x260000000009c0","mac":"72:06:00:01:00:00","switch_id":"oid:0x21000000000000 …
2021-06-20.14:03:47.223452|c|SAI_OBJECT_TYPE_FDB_ENTRY:{"bvid":"oid:0x260000000009c0","mac":"72:06:00:01:04:99","switch_id":"oid:0x21000000000000"}…

I think this seems normal, since all the “known” configuration will be queued right away during OA start, and it will start accepting new entries when it process current queue, and similar thing happens to SCHEDULER_GRUPS:

And here is main difference between async/sync mode, since in sync mode, we are waiting for each specific api status code to return from syncd, in case of async mode, all create/remove/set are almost instantaneous (like 200ms fdb entries vs 2seconds) and OA is able to much faster process those tasks, since it don’t need to wait for response,

This OA processing is actually by design seen in orchadeamon.cpp:

571 auto *c = (Executor *)s;
572 c->execute();
578 for (Orch *o : m_orchList)
579 o->doTask();

For each execute() on specific select result, it executes doTask() on all m_orchList, and it will take some time until it process all of that, and it will pick up first notification from queue

And comparing longest create/set flow
201911 async:

2.1k operations takes 0.9 seconds + 2 seconds wait to finish execute on syncd =~ 3sec total

202020 async:

2k operations takes 0.8 seconds + 3.2 seconds to wait to finish execute on syncd =~ 4sec total

202020 sync:

2k operations takes 7.3 seconds (there is no wait time, since all apis are synchronous,
so on top on async branch, seems like 3.3 seconds is added here just for RTT, which is about 1.6ms waist per 1 api call (seems high) ☹

we could try to use zmq sync mode to see if that will be faster, but because of OA bug (sonic-net/sonic-swss#1145) notifications with zmq will not trigger OA select, so hostif will not be set to UP

kcudnik · 2021-06-25T13:15:12Z

We spent some time with Shlomi to figure out what’s happening during fast boot sync mode, and why it was extending over 30 seconds, and here are our findings so far:

Just looked at those (those are 202012 with qos orch removed from OA for testing), and in

async mode hostif oper status is right after port ntfs, create switch at 06:36:46 and last hostif oper status at 06:37:11, in total 25 seconds
sync mode hostif oper status is about 2 seconds after last port state (07:40:32 and 07:40:34), create switch at 07:40:10, last hostif at 07:40:37 – in total 27 seconds

PS. those above times could be even lower (estimated 1-2 seconds, just by disabling our added XXX debug logs, since they pollute syslog (adding unnecessary 30k syslog log lines which impact overall system performance)

so we are already under 30 seconds, here (ideally we would like to be around 20-22 seconds to have some time on any hiccups during fast boot

previously (with qos enabled):

on 201911 async create switch at 12:40:43 and last hostif oper status at 12:41:04 – total 21 seconds
on 202012 async create switch at 14:23:47 and last hostif oper status at 14:24:15 – total 28 seconds
on 202012 syncd create switch at 14:03:18 and last hostif oper status at 14:03:52 – total 34 seconds

a little background on OA daemon high level overview

while (1) {
  s = select();
  s.execute(); // only place where notification can be processed if arrived
  for (orch in orchList)
     orch.doTask()
}

so what we found:

syncd mode is blocking, so there is no batch producer send for create/remove/set operations, each api needs to wait for response, which impacts overall speed of executing everything on OA, in async, those were put in queue, we assumed success, and flushed only once ether after getting 128 entires in queue or by GET api, which speed up approach
syncd mode is adding on average 1-2ms per SAI api call, because we need to wait for SAI status, + time the api will actually execute on HW (for example it can take 1.5sec, to push 500 fdb entries to ASIC 1by1
currently notification could be processed only when execute() method is called, and it may wait for that to all orch.doTask() finish, and possible select() could return multiple awaken selectables, and execute them first before getting the one with notification
we added logs to see how long it could take to doTask on each orch, and it turned out that for example QosOrch on start can take 6 seconds ! just that single task, and this seems to be non-critical task, since qos,queues and scheduler profiles could be updated after traffic is flowing, (internally it handles scheduler groups, queues, lag, pfc, vlans) and there is a lot of GET calls and assigning profiles to queues.
Since multiple orchs are added to orchList, currently there is no way to process port notification (to bring hostif up) during that execution list
there is no easy way right now to quantify for example QosOrch, to check for notifications, to not wait 6 seconds until finished, for example, since QosOrch is processing all ports, it could internally check/sync for notificaions and execute them for example processing every port, but user would need manually insert check/syncd for notifications inside each orch
since port notification have from our perspective have highest priority (since UP can enable hostif oper status and allow traffic flow), it should be prioritized when executing other tasks, and right now ntf check is synchronized with select in main loop and waits for it turn, no matter how long this turn can take

execution stats for each orch (with at least 10ms), based on 202012 latest Shlomi logs with preserved execution order

(ASYNC): time in seconds
0.018 PortsOrch
0.965 PortsOrch
0.060 PortsOrch
0.011 IntfsOrch
0.307 NeighOrch
0.188 FdbOrch
0.045 AclOrch
0.012 VNetRouteOrch

(SYNC): time in seconds
0.013 PortsOrch
0.013 PortsOrch
0.508 PortsOrch
0.531 PortsOrch
0.166 IntfsOrch
3.250 NeighOrch
0.055 CoppOrch
0.146 TunnelDecapOrch
0.011 WatermarkOrch
1.883 FdbOrch
0.478 AclOrch
0.015 ChassisOrch
0.017 EvpnNvoOrch
0.013 VxlanTunnelMapOrch
0.011 EvpnRemoteVniOrch
QosOrch: ~6sec

In our case it so happens that we started to receive notifications during QosOrch executions, and we must wait until it will finish,
Also notice for example time difference between fdborch 0.18sec vs 1.8 sec (10x!) since in async mode those fdb_entry create will be queued and flushed later on, so actual execution method is really fast, in syncd mode we must wait for each entry to be explicitly put into hardware, which takes time

We need a meeting about OA how we could address this issue with processing notifications when they arrive, I am also aware, that it maybe not beneficial that processing port notifications as they arrive (since we maybe in the middle of port processing etc) but in that case user would need to explicitly put checks during doTask in orch.

PS: @shlomibitton I found this interesting stuff on async log:

17720 Jun 25 06:37:09.112736 arc-switch1004 NOTICE swss#orchagent: :- start: XXX after doTask for orch = 7NatOrch
17721 Jun 25 06:37:09.112736 arc-switch1004 NOTICE swss#orchagent: :- start: XXX before doTask for orch = 9FgNhgOrch
17722 Jun 25 06:37:09.112736 arc-switch1004 NOTICE swss#orchagent: :- processNotifications: XXX peek success
17723 Jun 25 06:37:09.112861 arc-switch1004 NOTICE swss#orchagent: :- doTask: XXX processing notification
17724 Jun 25 06:41:37.170746 arc-switch1004 NOTICE swss#orchagent: :- start: XXX select wakeup, ret = 0
17725 Jun 25 06:41:37.170992 arc-switch1004 NOTICE swss#orchagent: :- start: XXX before execute...

Seems like syslog was not fast enough and some logs were lost – missing “after doTask” (since from docker syslog is sent via UDP) and 4 seconds of syslog dissappeard

shlomibitton · 2021-07-27T16:07:23Z

Fixes provided on PR's:
#7987
#8157
#7965
#8117
sonic-net/sonic-swss#1803
sonic-net/sonic-swss#1804

kcudnik · 2021-08-26T18:16:51Z

On meeting with Prince i noticed that there are a lot of redundancy:

2021-07-29.12:08:19.971719|g|SAI_OBJECT_TYPE_SCHEDULER_GROUP:oid:0x170000000002b4|SAI_SCHEDULER_GROUP_ATTR_CHILD_COUNT=0
2021-07-29.12:08:19.973672|G|SAI_STATUS_SUCCESS|SAI_SCHEDULER_GROUP_ATTR_CHILD_COUNT=0
2021-07-29.12:08:19.973897|g|SAI_OBJECT_TYPE_SCHEDULER_GROUP:oid:0x170000000002b6|SAI_SCHEDULER_GROUP_ATTR_CHILD_COUNT=0
2021-07-29.12:08:19.975877|G|SAI_STATUS_SUCCESS|SAI_SCHEDULER_GROUP_ATTR_CHILD_COUNT=0
2021-07-29.12:08:19.976039|g|SAI_OBJECT_TYPE_SCHEDULER_GROUP:oid:0x170000000002b8|SAI_SCHEDULER_GROUP_ATTR_CHILD_COUNT=0
2021-07-29.12:08:19.977956|G|SAI_STATUS_SUCCESS|SAI_SCHEDULER_GROUP_ATTR_CHILD_COUNT=0
2021-07-29.12:08:19.978211|g|SAI_OBJECT_TYPE_SCHEDULER_GROUP:oid:0x170000000002ba|SAI_SCHEDULER_GROUP_ATTR_CHILD_COUNT=0
2021-07-29.12:08:19.980129|G|SAI_STATUS_SUCCESS|SAI_SCHEDULER_GROUP_ATTR_CHILD_COUNT=0
2021-07-29.12:08:19.980329|g|SAI_OBJECT_TYPE_SCHEDULER_GROUP:oid:0x170000000002bc|SAI_SCHEDULER_GROUP_ATTR_CHILD_COUNT=0
2021-07-29.12:08:19.982536|G|SAI_STATUS_SUCCESS|SAI_SCHEDULER_GROUP_ATTR_CHILD_COUNT=0
2021-07-29.12:08:19.982760|g|SAI_OBJECT_TYPE_SCHEDULER_GROUP:oid:0x170000000002b5|SAI_SCHEDULER_GROUP_ATTR_CHILD_COUNT=0
2021-07-29.12:08:19.984847|G|SAI_STATUS_SUCCESS|SAI_SCHEDULER_GROUP_ATTR_CHILD_COUNT=2
2021-07-29.12:08:19.985080|g|SAI_OBJECT_TYPE_SCHEDULER_GROUP:oid:0x170000000002b5|SAI_SCHEDULER_GROUP_ATTR_CHILD_LIST=2:oid:0x0,oid:0x0
2021-07-29.12:08:19.987716|G|SAI_STATUS_SUCCESS|SAI_SCHEDULER_GROUP_ATTR_CHILD_LIST=2:oid:0x150000000002a3,oid:0x150000000002ab
2021-07-29.12:08:19.987998|s|SAI_OBJECT_TYPE_SCHEDULER_GROUP:oid:0x170000000002b5|SAI_SCHEDULER_GROUP_ATTR_SCHEDULER_PROFILE_ID=oid:0x16000000000c10
2021-07-29.12:08:19.990828|s|SAI_OBJECT_TYPE_QUEUE:oid:0x150000000002a3|SAI_QUEUE_ATTR_WRED_PROFILE_ID=oid:0x13000000000c13
2021-07-29.12:08:19.995217|g|SAI_OBJECT_TYPE_SCHEDULER_GROUP:oid:0x170000000002b0|SAI_SCHEDULER_GROUP_ATTR_CHILD_COUNT=2177648
2021-07-29.12:08:19.997113|G|SAI_STATUS_SUCCESS|SAI_SCHEDULER_GROUP_ATTR_CHILD_COUNT=0
2021-07-29.12:08:19.997166|g|SAI_OBJECT_TYPE_SCHEDULER_GROUP:oid:0x170000000002b2|SAI_SCHEDULER_GROUP_ATTR_CHILD_COUNT=0
2021-07-29.12:08:19.999724|G|SAI_STATUS_SUCCESS|SAI_SCHEDULER_GROUP_ATTR_CHILD_COUNT=0
2021-07-29.12:08:19.999777|g|SAI_OBJECT_TYPE_SCHEDULER_GROUP:oid:0x170000000002b4|SAI_SCHEDULER_GROUP_ATTR_CHILD_COUNT=0

This one is called 7 times !!!m and not only this one, for all schedulers groups

29.12:08:19.999777|g|SAI_OBJECT_TYPE_SCHEDULER_GROUP:oid:0x170000000002b4|SAI_SCHEDULER_GROUP_ATTR_CHILD_COUNT=0

if this could be eliminated then we would cut QOS from 6sec to less than 1 second, which would satisfy fast boot

vaibhavhd added the Issue for 202012 label Mar 30, 2021

anshuv-mfst added the Triaged this issue has been triaged label Mar 31, 2021

anshuv-mfst assigned kcudnik Mar 31, 2021

shlomibitton closed this as completed Jul 27, 2021

kcudnik mentioned this issue Aug 26, 2021

[fast-reboot] long downtime period possibly exceed 30 seconds 202012 on mlnx2700 (sync mode) #8428

Open

vaibhavhd mentioned this issue Apr 28, 2022

Fast-reboot Flow Improvements HLD sonic-net/SONiC#980

Merged

[fast-reboot] fast-reboot downtime period exceed 30 seconds on master/202012 branches #7140

[fast-reboot] fast-reboot downtime period exceed 30 seconds on master/202012 branches #7140

Comments

shlomibitton commented Mar 24, 2021 • edited Loading

Description

Steps to reproduce the issue:

Describe the results you received:

Describe the results you expected:

shlomibitton commented Mar 24, 2021

kcudnik commented Mar 25, 2021

kcudnik commented Mar 25, 2021

shlomibitton commented Mar 25, 2021

kcudnik commented Mar 25, 2021

kcudnik commented Mar 25, 2021 • edited Loading

shlomibitton commented Mar 25, 2021

kcudnik commented Mar 25, 2021

shlomibitton commented Mar 25, 2021

kcudnik commented Mar 25, 2021

shlomibitton commented Mar 29, 2021

kcudnik commented Mar 30, 2021 • edited Loading

lguohan commented Mar 30, 2021

shi-su commented Mar 30, 2021

kcudnik commented Mar 30, 2021

kcudnik commented Mar 30, 2021

kcudnik commented Apr 13, 2021

prsunny commented Apr 13, 2021 • edited Loading

shlomibitton commented Apr 20, 2021 • edited Loading

kcudnik commented Apr 20, 2021

shlomibitton commented Apr 20, 2021

kcudnik commented May 10, 2021

shlomibitton commented May 11, 2021

kcudnik commented May 17, 2021

shlomibitton commented May 19, 2021

kcudnik commented May 19, 2021

shlomibitton commented Jun 2, 2021

kcudnik commented Jun 2, 2021

shlomibitton commented Jun 3, 2021

shlomibitton commented Jun 10, 2021 • edited Loading

kcudnik commented Jun 10, 2021

lguohan commented Jun 10, 2021

kcudnik commented Jun 22, 2021

kcudnik commented Jun 22, 2021

kcudnik commented Jun 22, 2021

kcudnik commented Jun 25, 2021

shlomibitton commented Jul 27, 2021 • edited Loading

kcudnik commented Aug 26, 2021

shlomibitton commented Mar 24, 2021 •

edited

Loading

kcudnik commented Mar 25, 2021 •

edited

Loading

kcudnik commented Mar 30, 2021 •

edited

Loading

prsunny commented Apr 13, 2021 •

edited

Loading

shlomibitton commented Apr 20, 2021 •

edited

Loading

shlomibitton commented Jun 10, 2021 •

edited

Loading

shlomibitton commented Jul 27, 2021 •

edited

Loading