Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[xcvrd][sfputil] sfputil response is extremely slow when used upon on a xcvr of type cmis #12133

Open
vivekrnv opened this issue Sep 20, 2022 · 4 comments
Assignees
Labels
Arista Help Wanted 🆘 Triaged this issue has been triaged

Comments

@vivekrnv
Copy link
Contributor

Description

The inefficiency is in SfpBase, xcvr_mem_maps etc and this also affects xcvrd, since both xcvrd and sfputil use the same api's of SfpBase such as get_transceiver_info & get_transceiver_bulk_status & get_transceiver_threshold_info. On a device with 30 front-panel ports and 30 QSFP-DD xcvrs, i've seen pmon CPU usage reaching upto 35% with a period of 10-20 sec. pmon usage can get progressively worse if we have multiple front panel ports

Steps to reproduce the issue:

  1. Plug in a cable of type CMIS eg: QSFP-DD
  2. Run sfputil

Describe the results you received:

root@r-leopard-58:/home/admin# time sfputil show eeprom -p Ethernet0
Cannot get Module EEPROM data: Invalid argument
Ethernet0: SFP EEPROM detected
        Active Firmware Version: 0.0
        CMIS Revision: 4.0
        Identifier: QSFP-DD Double Density 8X Pluggable Transceiver
                Specification compliance: passive_copper_media_interface
        Vendor Date Code(YYYY-MM-DD Lot): 2020-12-19
        Vendor Name: Mellanox
        Vendor OUI: 00-02-c9
        Vendor PN: MCP1660-W00AE30
        Vendor Rev: A3
        Vendor SN: MT2051VS03513

real    0m4.875s
user    0m1.179s
sys     0m0.562s

In comparison:

QFFP-28
root@r-leopard-58:/home/admin# time sfputil show eeprom -p Ethernet248
Ethernet248: SFP EEPROM detected
        Application Advertisement: N/A
        Connector: No separable connector
        Encoding: 64B/66B
        Extended Identifier: Power Class 1 Module (1.5W max.), No CLEI code present in Page 02h, No CDR in TX, No CDR in RX
        Extended RateSelect Compliance: Unknown
        Identifier: QSFP28 or later
        Length Cable Assembly(m): 2.0
        Nominal Bit Rate(100Mbs): 255
        Specification compliance:
                10/40G Ethernet Compliance Code: Unknown
                Extended Specification Compliance: 100GBASE-CR4, 25GBASE-CR CA-25G-L or 50GBASE-CR2 with RS
                Fibre Channel Link Length: Unknown
                Fibre Channel Speed: Unknown
                Fibre Channel Transmission Media: Unknown
                Fibre Channel Transmitter Technology: Unknown
                Gigabit Ethernet Compliant Codes: 1000BASE-CX
                SAS/SATA Compliance Codes: Unknown
                SONET Compliance Codes: Unknown
        Vendor Date Code(YYYY-MM-DD Lot): 2016-12-31
        Vendor Name: Mellanox
        Vendor OUI: 00-02-c9
        Vendor PN: MCP7H00-G01AR
        Vendor Rev: A1
        Vendor SN: MT1710VS04177

real    0m0.691s
user    0m0.275s
sys     0m0.110s

Triage

A single get_transciever_info() is resulting in 31 calls to read_eeprom and the read_eeprom for a lot of platforms uses either a subprocess call or a file open/read operations. Thus making it extremely slow. Calling get_transciever_domI() can result in an addition of 40+ calls to read eeprom.
Note: These stats were taken for MSN4700 platform

root@r-leopard-01:/home/admin# python3 -m cProfile -s tottime /usr/local/bin/sfputil show eeprom -p Ethernet0  | grep eeprom
root@r-leopard-01:/home/admin# cat pre_opt.txt
       31    0.002    0.000    4.387    0.142 sfp.py:350(_read_eeprom_specific_bytes)
       29    0.001    0.000    4.059    0.140 xcvr_eeprom.py:15(read)
       31    0.000    0.000    4.387    0.142 sfp.py:374(read_eeprom)
        1    0.000    0.000    4.411    4.411 main.py:611(eeprom)
       29    0.000    0.000    0.000    0.000 xcvr_eeprom.py:29(<dictcomp>)
        1    0.000    0.000    0.000    0.000 eeprom_dts.py:3(<module>)
        1    0.000    0.000    0.000    0.000 xcvr_eeprom.py:10(__init__)
        1    0.000    0.000    0.000    0.000 xcvr_eeprom.py:1(<module>)
        1    0.000    0.000    0.000    0.000 xcvr_eeprom.py:9(XcvrEeprom)

SfpBase, Xcvr_Api, MemMap and the associated classed must be optimized.
Ideal optimization target should be to drastically reduce calls to read_eeprom.

@prgeor prgeor self-assigned this Sep 20, 2022
@prgeor prgeor added the Triaged this issue has been triaged label Sep 20, 2022
@dgsudharsan
Copy link
Collaborator

@prgeor Can you please provide an ETA for the fix?

@prgeor
Copy link
Contributor

prgeor commented Nov 7, 2022

@dgsudharsan there is an inherent issue where mlnx platform make several ethool command call via process call that make sfputil much slower in mlnx platform. Do you still see the issue after this fix

@vivekrnv
Copy link
Contributor Author

vivekrnv commented Nov 7, 2022

@dgsudharsan there is an inherent issue where mlnx platform make several ethool command call via process call that make sfputil much slower in mlnx platform. Do you still see the issue after this fix

That fix significantly reduces the response time but the current approach still involves making multiple file open and read calls. I think SfpBase and the others can be optimized to reduce read_eeprom calls.

@prgeor
Copy link
Contributor

prgeor commented Mar 1, 2023

@andywongarista lets discuss the fix for this SFP-refactor introduced issue

@prgeor prgeor added the Arista label Mar 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arista Help Wanted 🆘 Triaged this issue has been triaged
Projects
None yet
Development

No branches or pull requests

3 participants