Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dualtor-aa][sanity_check] Add checker to verify dualtor-aa interfaces are active on both sides #11637

Merged
merged 1 commit into from
Feb 12, 2024

Conversation

zjswhhh
Copy link
Contributor

@zjswhhh zjswhhh commented Feb 6, 2024

Description of PR

Summary:
Fixes # (issue)
Add checker to verify dualtor-aa interfaces are active on both sides. If not, try recovering by configuring mux mode, restart nic_simulator etc.

sign-off: Jing Zhang zhangjing@microsoft.com

Type of change

  • Bug fix
  • Testbed and Framework(new/improvement)
  • Test case(new/improvement)

Back port request

  • 201911
  • 202012
  • 202205
  • 202305
  • 202311

Approach

What is the motivation for this PR?

Tests are failing at setup due to dualtor-aa interfaces left in non-active status.

How did you do it?

  1. Added checker to confirm both sides' mux status == 1 (active).
  2. Moved mux config change before config reload and saved auto config explicitly.
  3. Skip mux_simulator restart when there is no active-standby interfaces to avoid unnecessary errors.

How did you verify/test it?

Ran bgp/test_bgp_update_timer.py::test_bgp_update_timer_session_down:

  1. with mux config == auto, sanity check passed, test passed.
  2. with mux config == standby on both sides, sanity check triggered recover, test passed.
01:31:47 checks._check                            L0653 WARNING| Inconsistent mux status for active-active ports on dualtors,                                                    please check output of "show mux status"
01:32:00 recover.recover                          L0185 WARNING| Try to recover svcstr-7050-acs-1 using method adaptive
01:32:00 recover.adaptive_recover                 L0169 WARNING| Restoring {'failed': True, 'failed_reason': 'Inconsistent mux status for active-active ports on dualtors,                                                    please check output of "show mux status"', 'check_item': 'mux_simulator', 'action': <function check_mux_simulator.<locals>._recover at 0x7f7229443670>, 'hosts': ['svcstr-7050-acs-1', 'svcstr-7050-acs-2']} with proposed action: config_reload, final action: config_reload
01:32:00 config_reload.config_reload              L0093 INFO   | reloading running_golden_config
... ...
01:34:34 recover.recover                          L0185 WARNING| Try to recover svcstr-7050-acs-2 using method adaptive
01:34:34 recover.adaptive_recover                 L0169 WARNING| Restoring {'failed': True, 'failed_reason': 'Inconsistent mux status for active-active ports on dualtors,                                                    please check output of "show mux status"', 'check_item': 'mux_simulator', 'action': <function check_mux_simulator.<locals>._recover at 0x7f7229443670>, 'hosts': ['svcstr-7050-acs-1', 'svcstr-7050-acs-2']} with proposed action: config_reload, final action: config_reload
01:34:34 config_reload.config_reload              L0093 INFO   | reloading running_golden_config
... ...
01:36:55 __init__.sanity_check                    L0269 INFO   | Run sanity check again after recovery
01:36:55 checks._check_processes_on_dut           L0773 INFO   | Checking process status on svcstr-7050-acs-1...
01:36:55 checks._check_processes_on_dut           L0773 INFO   | Checking process status on svcstr-7050-acs-2...
01:36:56 checks._check_processes_on_dut           L0778 INFO   | networking_uptime=90 seconds, timeout=210 seconds, interval=20 seconds
01:36:56 checks._check_processes_on_dut           L0778 INFO   | networking_uptime=247 seconds, timeout=53 seconds, interval=20 seconds
01:37:07 checks._check_processes_on_dut           L0811 INFO   | Done checking processes status on svcstr-7050-acs-1
01:37:07 parallel.on_terminate                    L0085 INFO   | process _check_processes_on_dut--<MultiAsicSonicHost svcstr-7050-acs-1> terminated with exit code None
01:37:07 checks._check_processes_on_dut           L0811 INFO   | Done checking processes status on svcstr-7050-acs-2

Any platform specific information?

Supported testbed topology if it's a new test case?

Documentation

@zjswhhh zjswhhh requested review from lolyu and yxieca February 6, 2024 02:00
Copy link
Contributor

@lolyu lolyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -519,6 +519,12 @@ def _verify_inconsistent_mux_status(duts_parsed_mux_status, dut_upper_tor, dut_l
err_msg_from_mux_status.append('Inconsistent mux status for active-standby ports on dualtors, \
please check output of "show mux status"')
dut_wrong_mux_status_ports.append(port_idx)
if cable_type == CableType.active_active:
logger.debug('Verify that active-active ports:{}'.format(duts_parsed_mux_status))
if (upper_tor_mux_status[port_idx]['status'] != 1 or lower_tor_mux_status[port_idx]['status'] != 1):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not for this PR. but suggest define constants for active/standby/...

@yxieca yxieca merged commit 77e2a67 into sonic-net:master Feb 12, 2024
13 checks passed
@zjswhhh zjswhhh deleted the bgp_update_timer_master_public branch February 12, 2024 19:01
mssonicbld pushed a commit to mssonicbld/sonic-mgmt that referenced this pull request Mar 22, 2024
@mssonicbld
Copy link
Collaborator

Cherry-pick PR to 202311: #12091

@mssonicbld
Copy link
Collaborator

Cherry-pick PR to 202305: #12092

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants