-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[chassis][202405]: orchagent crash in NotificationSwitchAsicSdkHealthEvent::executeCallback while handling SAI notification #19760
Comments
seems to be triggerred from new feature - sairedis::NotificationSwitchAsicSdkHealthEvent::executeCallback |
@anamehra will debug more |
lower priority for 202405. Need to debug for master. |
Will investigate, in the backtrack number 5 and 6 are OA code? Symbols not loaded? |
If I remember correctly, 5 and 6 pointed somewhere in data segment pointing to, I think, notification function pointers. I loaded symbol files dbg debians for swss, libsairedis and meta but those did not help. |
callback execution on executeCallback suggests that deserialization succeeded correctly on the string, and later on crash happens on ::put, im invaestigating OA code 202405 which SAI is using ? just tested on sairedis callback execution and it works fine |
i reproduced that locally, problem is this line: time_ss << std::put_time(std::localtime(&t), "%Y-%m-%d %H:%M:%S"); value tv_sec 172479515853275099 is too big and it crashes, thats about 5469289569 years, this is not correct |
We hit the issue on 202405 which is using SAI 1.14 |
no, there should not be compatibility issue with v1.13 and v1.14, can you check how timestamp is populated ? |
commit sonic-net/sonic-swss#3020 + @stephenxs can you take a look and handle this test case ? it's your code, and i made repro static void on_switch_asic_sdk_health_event(
_In_ sai_object_id_t switch_id,
_In_ sai_switch_asic_sdk_health_severity_t severity,
_In_ sai_timespec_t timestamp,
_In_ sai_switch_asic_sdk_health_category_t category,
_In_ sai_switch_health_data_t data,
_In_ const sai_u8_list_t description)
{
SWSS_LOG_ENTER();
printf("swtch: %lx %ld %d\n", switch_id, timestamp.tv_sec,timestamp.tv_nsec);
const std::time_t &t = (std::time_t)timestamp.tv_sec;
std::stringstream time_ss;
time_ss << std::put_time(std::localtime(&t), "%Y-%m-%d %H:%M:%S");
}
TEST(NotificationSwitchAsicSdkHealthEvent, executeCallback)
{
sai_switch_notifications_t ntfs;
auto str = "{\"category\":\"SAI_SWITCH_ASIC_SDK_HEALTH_CATEGORY_ASIC_HW\",\"data.data_type\":\"SAI_HEALTH_DATA_TYPE_GENERAL\",\"description\":\"16:123,10,9,34,100,97,116,97,34,58,32,34,48,34,10,125\",\"severity\":\"SAI_SWITCH_ASIC_SDK_HEALTH_SEVERITY_WARNING\",\"switch_id\":\"oid:0x21000000000000\",\"timestamp\":\"{\\\"tv_nsec\\\":\\\"12\\\",\\\"tv_sec\\\":\\\"172479515853275099\\\"}\"}";
NotificationSwitchAsicSdkHealthEvent nn(str);
ntfs.on_switch_asic_sdk_health_event = &on_switch_asic_sdk_health_event;
nn.executeCallback(ntfs);
} |
Thanks @kcudnik ! I will check and fix the vendor side code. |
I would like to add an exception handler here |
ok,i think what could be also problem that causes this time sec to be so big, which SAI headers are used in swss at 202405 ? since sairedis i using v1.14, if swss is using later headers then this was breaking change: since it's added extra field sai_health_data_t data, which looks like that: typedef struct _sai_switch_health_data_t
{
/** Type of switch health data */
sai_health_data_type_t data_type;
/** @passparam data_type */
sai_health_data_t data;
} sai_switch_health_data_t; but that's after v1.14 on v1.14 it looks like this: typedef struct _sai_switch_health_data_t
{
/** Type of switch health data */
sai_health_data_type_t data_type;
} sai_switch_health_data_t; which changes struct size, and since this struct is used in param to callback: 3297 typedef void (*sai_switch_asic_sdk_health_event_notification_fn)(
3298 _In_ sai_object_id_t switch_id,
3299 _In_ sai_switch_asic_sdk_health_severity_t severity,
3300 _In_ sai_timespec_t timestamp,
3301 _In_ sai_switch_asic_sdk_health_category_t category,
3302 _In_ sai_switch_health_data_t data, /// <-here
3303 _In_ const sai_u8_list_t description); since sizeof (data) have 2 different sizes on 2 different SAI versions, but since stacks is pushed in reverse (or in x64 is put in registers) this could lead to load wrong value to the specific register,i will investigate what could be changed |
swss does not specify SAI header. It references the SAI header specified in sairedis. |
@anamehra @stephenxs i performed such a test: typedef struct _sai_switch_health_data_t_v14
{
/** Type of switch health data */
sai_health_data_type_t data_type;
/** @passparam data_type */
sai_health_data_t data;
} sai_switch_health_data_t_v14;
typedef struct _sai_switch_health_data_t_v13
{
/** Type of switch health data */
sai_health_data_type_t data_type;
} sai_switch_health_data_t_v13;
typedef void (*on_switch_asic_sdk_health_event_v13_fn)(
_In_ sai_object_id_t switch_id,
_In_ sai_switch_asic_sdk_health_severity_t severity,
_In_ sai_timespec_t timestamp,
_In_ sai_switch_asic_sdk_health_category_t category,
_In_ sai_switch_health_data_t_v13 data,
_In_ const sai_u8_list_t description);
static void on_switch_asic_sdk_health_event_v13(
_In_ sai_object_id_t switch_id,
_In_ sai_switch_asic_sdk_health_severity_t severity,
_In_ sai_timespec_t timestamp,
_In_ sai_switch_asic_sdk_health_category_t category,
_In_ sai_switch_health_data_t_v13 data,
_In_ const sai_u8_list_t description)
{
SWSS_LOG_ENTER();
printf("swtch v14: %lx %ld %d, count = %d\n", switch_id, timestamp.tv_sec,timestamp.tv_nsec,
description.count);
EXPECT_TRUE(timestamp.tv_sec == 1731848814);
const std::time_t &t = (std::time_t)timestamp.tv_sec;
std::stringstream time_ss;
time_ss << std::put_time(std::localtime(&t), "%Y-%m-%d %H:%M:%S");
}
typedef void (*on_switch_asic_sdk_health_event_v14_fn)(
_In_ sai_object_id_t switch_id,
_In_ sai_switch_asic_sdk_health_severity_t severity,
_In_ sai_timespec_t timestamp,
_In_ sai_switch_asic_sdk_health_category_t category,
_In_ sai_switch_health_data_t_v14 data,
_In_ const sai_u8_list_t description);
static void on_switch_asic_sdk_health_event_v14(
_In_ sai_object_id_t switch_id,
_In_ sai_switch_asic_sdk_health_severity_t severity,
_In_ sai_timespec_t timestamp,
_In_ sai_switch_asic_sdk_health_category_t category,
_In_ sai_switch_health_data_t_v14 data,
_In_ const sai_u8_list_t description)
{
SWSS_LOG_ENTER();
printf("swtch v14: %lx %ld %d, count = %d\n", switch_id, timestamp.tv_sec,timestamp.tv_nsec,
description.count);
EXPECT_TRUE(timestamp.tv_sec == 1731848814);
const std::time_t &t = (std::time_t)timestamp.tv_sec;
std::stringstream time_ss;
time_ss << std::put_time(std::localtime(&t), "%Y-%m-%d %H:%M:%S");
}
TEST(NotificationSwitchAsicSdkHealthEvent, version)
{
SWSS_LOG_ENTER();
sai_object_id_t switchId = 0x21000000000000;
sai_switch_asic_sdk_health_severity_t severity = SAI_SWITCH_ASIC_SDK_HEALTH_SEVERITY_WARNING;
sai_switch_asic_sdk_health_category_t category = SAI_SWITCH_ASIC_SDK_HEALTH_CATEGORY_ASIC_HW;
sai_timespec_t timestamp = {1731848814, 123456};
sai_switch_health_data_t_v13 data13 = { };
sai_switch_health_data_t_v14 data14 = { };
data13.data_type = SAI_HEALTH_DATA_TYPE_GENERAL;
data14.data_type = SAI_HEALTH_DATA_TYPE_GENERAL;
uint8_t list[] = {123,10,9,34,100,97,116,97,34,58,32,34,48,34,10,125};
sai_u8_list_t desc;
desc.count = 16;
desc.list = list;
{
on_switch_asic_sdk_health_event_v13_fn v13 = &on_switch_asic_sdk_health_event_v13;
v13(switchId, severity, timestamp, category, data13, desc); // v13 call v13
on_switch_asic_sdk_health_event_v14_fn v14 = &on_switch_asic_sdk_health_event_v14;
v14(switchId, severity, timestamp, category, data14, desc); // v14 call v14
}
{
printf("call other\n");
on_switch_asic_sdk_health_event_v13_fn v13 = reinterpret_cast<on_switch_asic_sdk_health_event_v13_fn>((void*)&on_switch_asic_sdk_health_event_v14);
v13(switchId, severity, timestamp, category, data13, desc); // v13 call v14
on_switch_asic_sdk_health_event_v14_fn v14 = reinterpret_cast<on_switch_asic_sdk_health_event_v14_fn>((void*)&on_switch_asic_sdk_health_event_v13);
v14(switchId, severity, timestamp, category, data14, desc); // v14 call v13
}
} and results are like this:
so the timestamp is correct event when v13 calls v14 and viceversa, but the count is atully affected, because of sizeof(data) depending of version for if in your case crash is in time/put and not deserialization of description, that means the headers are probably ok, but what's most likely is that device is returning wrong number of tv_sec in timestamp |
Hi @kcudnik , I figured out an issue with the timestamp in our platform SAI code for this API. I am working on fixing it. I think that should fix the timestamp issue. On Sonic side, we may add handling to prevent crash. |
@stephenxs can you add exception handling for this in swss as you mentioned before? |
Sure. will do it. |
@stephenxs i noticed taht put_time crashes after the tv_sec is around 67767710400000001, which causes to print Year over 2^31, which overlaps year and some internal crash, unfortunetly it does not throw any std exception and just segfault
i think you can assume that if tv_sec is more than 1k years, then this value is invalid or as you said, a year in future from now |
Related to sonic-net/sonic-buildimage#19760. Tests added to rule out v13 and v14 headers health_data stack shift influencing timestamp.
I added this tests in sairedis: sonic-net/sonic-sairedis#1463 to test v13 and v14 missmatch |
Related to sonic-net/sonic-buildimage#19760. Tests added to rule out v13 and v14 headers health_data stack shift influencing timestamp.
…ic-net#1463) Related to sonic-net/sonic-buildimage#19760. Tests added to rule out v13 and v14 headers health_data stack shift influencing timestamp.
Hi @stephenxs , do you have any PR for handling this case on Sonic side? |
@anamehra WIP and will finish in this month. |
…ic-net#1463) Related to sonic-net/sonic-buildimage#19760. Tests added to rule out v13 and v14 headers health_data stack shift influencing timestamp.
…ic-net#1463) Related to sonic-net/sonic-buildimage#19760. Tests added to rule out v13 and v14 headers health_data stack shift influencing timestamp.
|
Description
Intermittent orchagent crash seen on LCs during config reload while running sonic-mgmt nightly runs:
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/usr/bin/orchagent -d /var/log/swss -b 1024 -s -f swss.asic1.rec -j sairedis.as'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x00007f3551c6e61b in ?? () from /lib/x86_64-linux-gnu/libc.so.6
[Current thread is 1 (Thread 0x7f3550c9e6c0 (LWP 181))]
(gdb) bt
#0 0x00007f3551c6e61b in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f3551c70908 in strftime_l () from /lib/x86_64-linux-gnu/libc.so.6
#2 0x00007f3551f4c90d in std::__timepunct::_M_put(char*, unsigned long, char const*, tm const*) const () from /lib/x86_64-linux-gnu/libstdc++.so.6
#3 0x00007f3551fa4fdf in std::time_put<char, std::ostreambuf_iterator<char, std::char_traits > >::do_put(std::ostreambuf_iterator<char, std::char_traits >, std::ios_base&, char, tm const*, char, char) const ()
from /lib/x86_64-linux-gnu/libstdc++.so.6
#4 0x00007f3551fa36fb in std::time_put<char, std::ostreambuf_iterator<char, std::char_traits > >::put(std::ostreambuf_iterator<char, std::char_traits >, std::ios_base&, char, tm const*, char const*, char const*) const ()
from /lib/x86_64-linux-gnu/libstdc++.so.6
#5 0x00005623dc9812de in ?? ()
#6 0x00005623dc7d4302 in ?? ()
#7 0x00007f35529902d2 in sairedis::NotificationSwitchAsicSdkHealthEvent::executeCallback(_sai_switch_notifications_t const&) const () from /lib/x86_64-linux-gnu/libsaimeta.so.0
#8 0x00007f3552aa34b2 in sairedis::RedisRemoteSaiInterface::handleNotification(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, std::vector<std::pair<std::__cxx11::basic_string<char, std::char_traits, std::allocator >, std::__cxx11::basic_string<char, std::char_traits, std::allocator > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits, std::allocator >, std::__cxx11::basic_string<char, std::char_traits, std::allocator > > > > const&) () from /lib/x86_64-linux-gnu/libsairedis.so.0
#9 0x00007f3552af2878 in sairedis::RedisChannel::notificationThreadFunction() () from /lib/x86_64-linux-gnu/libsairedis.so.0
#10 0x00007f3551f584a3 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#11 0x00007f3551c2d134 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#12 0x00007f3551cad7dc in ?? () from /lib/x86_64-linux-gnu/libc.so.6
syslogs:
orchagent_segfault.log
The following SAI notification from SDK triggers the crash:
SAI redis recoreds before the above notification related to this feature:
Steps to reproduce the issue:
Describe the results you received:
Describe the results you expected:
Output of
show version
:Output of
show techsupport
:Additional information you deem important (e.g. issue happens only occasionally):
The text was updated successfully, but these errors were encountered: