Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The timeout duration for starting/stopping a switch may be a little bit short. #58

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

tang-mouren
Copy link

@tang-mouren tang-mouren commented Sep 3, 2023

I've encountered a problem when creating/stopping a topology containing 144 hosts and 13 switches with p4utils. There is a high probability that the program will raise a ChildProcessError which saying "P4 swich xx did not start/stop correctly.". I digged into the source code and found that the error was from the if sentence: "if not wait_condition(self.switch_status, True, timeout=SWITCH_START_TIMEOUT".
To my understanding, the time limit for a child process to create or stop a P4 switch may not be long enough when creating a larger topology, so I changed the global variables "SWITCH_START_TIMEOUT" and "SWITCH_STOP_TIMEOUT" from 10 to 60. This solution worked and the problem didn't occured again after I changed the code.
I'm not certained whether my understanding is correct, so I make this pull request. I will appreciate it if you can give me a reply, and I hope this adjustment to the code can help others meeting the same problem

@tang-mouren
Copy link
Author

In my commit, I made a compromise, so the variables value are chagned into 30.

@edgar-costa
Copy link
Collaborator

HI @tang-mouren,

Thanks for looking into this. I never had this problem, even with topologies of that size (probably with fewer hosts). I guess that is the problem, although given everything is started sequentially, I am surprised that 10 seconds are not enough.

The only problem I see with raising the timeout is that when the problem comes from something else (i.e., a bug in the switch code), the waiting times might be too high. Was that the reason why you went from 60 to 30 in the pull request?

@tang-mouren
Copy link
Author

Emm, later I changed the value from 60 to 30 and the problem didn't occur. I haven't done enough experiments to test the problem to get the most suitable time so I choose the value 30, in case that a bug might cause the infiniteloop letting developers waste more time waiting for the timeout.
Up till now, in my opinion, we can give the variable a larger size. And in the future, we can add more information in the error report, telling developers that changing the value of the time limit may address the problem

@tang-mouren
Copy link
Author

I found that the program may stick at a moment when allocating more interfaces to a switch, and this may cause some time. The timeout event happens randomly. It doesn't always occur when I run the program.

@tang-mouren
Copy link
Author

So, I suggest the value of the time limit can be changed larger to avoid this problem to a certain degree.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants