Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Notification NOS running SAI and SDK, about HW/FW failures #1777

Merged
merged 1 commit into from
Jun 20, 2023

Conversation

amandyuk
Copy link
Contributor

@amandyuk amandyuk commented Mar 9, 2023

Health event is a way to inform NOS about some HW issue in the switch. When some error on switch occur, SAI should in some way inform a NOS about the status. In order to provide flexibility depending on different types issues SAI generates health event with several parameters that describes that event. Hence, to inform NOS - sai_switch_asic_sdk_health_event_notification_fn callback is invoked.

inc/saiswitch.h Outdated
_In_ sai_timespec_t timestamp,
_In_ sai_switch_asic_sdk_health_category_t category,
_In_ sai_switch_health_data_t data,
_In_ sai_u8_list_t description);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this probably should be const since contains pointer to data

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now data is passed as value not as pointer
In sai_switch_health_data_t data,
Do you suggest changing to
In const sai_switch_health_data_t *data
?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i was talking about sai_u8_list_t, since u8 list has pointer inside, which without const can be modified

@itaibaz
Copy link
Collaborator

itaibaz commented Mar 9, 2023

@JaiOCP
Can you please review
We added the ability to extend and provide per event data, so
#1307
for example, can be implemented with the health event
Please review the suggested extension in the .MD
Thanks

@JaiOCP
Copy link
Contributor

JaiOCP commented Mar 9, 2023

@JaiOCP
Can you please review
We added the ability to extend and provide per event data, so
#1307
for example, can be implemented with the health event
Please review the suggested extension in the .MD
Thanks

Thanks Itai


/** @passparam data_type */
sai_health_data_t data;
} sai_health_t;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't see this defined in header files

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not defined in the headers since SER is a separate PR
The intention is to show the health event can be used as infrastructure for additional future event
For each event, we will need separate PR. The struct with specific details will be defined in the PR, and will added to the union.
I am not trying to address every possible health event at the moment, just create an extendable infrastructure which can easily be extended

_In_ sai_switch_asic_sdk_health_severity_t severity,
_In_ sai_timespec_t timestamp,
_In_ sai_switch_asic_sdk_health_category_t category,
_In_ sai_switch_health_data_t data,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from definitions so far, this "data" is actually a data type, and only one type of data SAI_HEALTH_DATA_TYPE_GENERAL can be passed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes currently we don't need to use data for the general type
but we created extendable infrastructure, per Jai's request, so additional events can be added


2) For all types that have additional data, a struct should be created. The first time a new type that has actual struct data will be created, union should be created as well. Each struct should be added to that union.

For example if we want to add a SER item (https://github.com/opencomputeproject/SAI/pull/1307), the flow will be :
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Example looks not complete. It should show how the attribute and parameters should be assigned and passed to function

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be addressed in SER specific PR
It is out of scope of this PR which is the general infrastructure

@rlhui
Copy link
Collaborator

rlhui commented Mar 13, 2023

Could we please look into supporting the data defined in json? sai_json_t is available.

@itaibaz
Copy link
Collaborator

itaibaz commented Mar 13, 2023

@rlhui
I am not sure I follow the requirement for json
In the mentioned PR
https://github.com/sonic-net/SONiC/blob/5677242770e3774a5eab0f951dc14642d81422ec/doc/event-alarm-framework/events-producer.md
Sonic is moving to this model, to define the contract between Sonic and external tools on top of Sonic

I don't see how it's related to SAI layer

The parameters for the health event are well typed, other than the description field which is free text and vendor specific
To generalize, we did add the category (on top of the initial severity we had), per community request
Can you please explain what exactly are you looking for ?

@rlhui
Copy link
Collaborator

rlhui commented Mar 16, 2023

@rlhui I am not sure I follow the requirement for json In the mentioned PR https://github.com/sonic-net/SONiC/blob/5677242770e3774a5eab0f951dc14642d81422ec/doc/event-alarm-framework/events-producer.md Sonic is moving to this model, to define the contract between Sonic and external tools on top of Sonic

I don't see how it's related to SAI layer

The parameters for the health event are well typed, other than the description field which is free text and vendor specific To generalize, we did add the category (on top of the initial severity we had), per community request Can you please explain what exactly are you looking for ?

Please see #1324 as an example. For the categories added, they can also be defined as part of json format.

@rlhui
Copy link
Collaborator

rlhui commented Mar 16, 2023

Can we please update PR to add description

@rlhui
Copy link
Collaborator

rlhui commented Mar 16, 2023

Can we pleae also update PR title ? Currently it's truncated.

@amandyuk amandyuk changed the title Implement a mechanism that to enable a NOS running SAI and SDK, to de… Notification NOS running SAI and SDK, about HW/FW failures Mar 16, 2023
@itaibaz
Copy link
Collaborator

itaibaz commented Mar 16, 2023

Added description to the PR
Also modified the title to avoid truncation

@rlhui
Copy link
Collaborator

rlhui commented Mar 17, 2023

build is failing

@kcudnik
Copy link
Collaborator

kcudnik commented Mar 17, 2023

WARNING: first line doxygen comment should be with '/**': saiswitch.h: /** 
WARNING: line ends in whitespace saiswitch.h 2875:      
WARNING: Header doesn't meet style requirements (most likely ident is not 4 or 8 spaces) saiswitch.h 2875:     
WARNING: line ends in whitespace saiswitch.h 2876:     /** 

Signed-off-by: Andrii Mandiuk <amandiuk@nvidia.com>
SAI_SWITCH_ASIC_SDK_HEALTH_SEVERITY_WARNING,

/** Switch event severity notice */
SAI_SWITCH_ASIC_SDK_HEALTH_SEVERITY_NOTICE
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how do we know if a specific HEALTH issue is cleared. Also if its correctable/non-correctable?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re if it's correctable or not, the severity field can be used as indication
If it's fatal, then it's not correctable
If it's warning or notice, then it is correctable

Re clearing of health issue
The concept is for one time event
It's not an alarm where a certain threshold is crossed and then can later be cleared when going under the threshold

Copy link
Collaborator

@prsunny prsunny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@rlhui rlhui merged commit e06d81e into opencomputeproject:master Jun 20, 2023
typedef struct _sai_switch_health_data_t
{
/** Type of switch health data */
sai_health_data_type_t data_type;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since this struct contains only one entry, is it not better to use data_type directly in notification ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's an extendable placeholder
check out possible extension for SER in the .MD above
The ask in the community was to be able to extend and add info for specific features in the future

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we want to make structs constant then this would not expand any more

JaiOCP pushed a commit to JaiOCP/SAI that referenced this pull request Jun 26, 2023
…teproject#1777)

Health event is a way to inform NOS about some HW issue in the switch. When some error on switch occur, SAI should in some way inform a NOS about the status. In order to provide flexibility depending on different types issues SAI generates health event with several parameters that describes that event. Hence, to inform NOS - sai_switch_asic_sdk_health_event_notification_fn callback is invoked.

Signed-off-by: Andrii Mandiuk <amandiuk@nvidia.com>
@JaiOCP JaiOCP mentioned this pull request Mar 6, 2024
@JaiOCP JaiOCP mentioned this pull request Mar 7, 2024
@JaiOCP JaiOCP mentioned this pull request Apr 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants