-
Notifications
You must be signed in to change notification settings - Fork 311
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
adds ssh debugger for selenium test #498
Conversation
@robacarp I suspect you'll get what you need from this running in the PR, but if you need to merge into main for testing, go right ahead |
3fe08f4
to
9e162a5
Compare
d66f151
to
fb396f0
Compare
fb396f0
to
d0a02b5
Compare
f44b3a8
to
072e381
Compare
@krschacht Alright, here's what I think is happening. I don't have a solution yet, but it's a race condition somewhere between chromedriver and selenium. In #483 you specc'd for a bigger runner because there was suspicion that it was a memory issue, but that probably made the problem worse because now ruby boots faster but chromedriver still takes forever? A short term reduction in pain might be to spec-down the runner. The Net::ReadTimeout is coming from capybara (aka selenium) failing to hit chromedriver when attempting to set up the server. This stack trace is all sorts of interesting:
I added wait-for-it.sh but the problem isn't that chromedriver isn't ready, because it still fails when waitforit passes. This is also an interesting failure: https://github.com/AllYourBot/hostedgpt/pull/498/checks#step:9:37 It seems like everything booted up in the right order, but the tests just never ran (I cancelled the job after ~3min). It makes me wonder if there's some confusion in capybara connecting to chromedriver, and it's waiting for chromedriver? Is the PORT 9515 assumption in the chromedriver github action plugin wrong? (Or is it not correctly installed?) |
@robacarp another big clue in all this is that it all used to work fine, for many months, and then this all started one day. Does that suggest anything else to you or a possible solution? I even tried rolling back to the point in our commit history where everything was working reliably but it no longer does. I wonder if a dependency has a new version rolled out but if we could pin to an older version this might fix? But rolling back should have done that with gemfile.lock so maybe it’s a dependency that’s somehow outside of that, like part of GitHub action runner config? |
I don't think this is anything to do with the code here, unfortunately. I think the fact that it ever worked was luck. I added a DEBUG environment variable to the run here and there's a lot more output from many of the tools in the process, but nothing jumps out. Another thing I forgot to mention that suggests an inter-process race condition, when I SSH into the job mid-run, it sometimes fails or hangs for a bit. But if I interrupt the process (^c) and then re-run it, it goes fine. |
Dang, it still seems really odd that for ~6 months these tests never hit this issue and the one day it starts hitting the issue regularly. That's what makes me think that an updated chromedriver or ubuntu-latest system dependency—something outside of gemfile.lock—changed and that introduced this issue. That's the only way I can explain rolling back the repo and still getting the error. Also, I never get this issue locally on my dev machine. I only get it in github actions. Initially, I suspected this was because my dev machine has more resources (cores, etc) so that's what prompted me to upgrade the github actions server. And notably, that did help. When I was using the default In terms of working around this, would it be possible to study the stack trace and find a place within the selenium or capybara code where you could re-open the class, re-write a method to add a simple "sleep 10" or something hacky like that in the right place within the boot cycle to work around the race? |
This sort of works -- but if capybara doesn't start the server until after the tests finish, it still hangs. |
Sorry, this -- I didn't realize I rolled back the retry code: def self.test(name, &block)
begin
super(name, &block)
rescue Net::ReadTimeout => e
puts "(caught) Net::ReadTimeout: #{e}"
retry_count ||= 0
retry_count += 1
retry if retry_count < 3
end
end |
Is there some place you think you could introduce a delay to guarantee capybara gets the server started? |
Not that I've seen. I don't think it's a delay in chromedriver -- I added a wait to ensure that gets started, but it's still failing. It's so strange that it's attempting to run the tests before capybara has started the server. |
Might be worth looking at Cuprite to see if it may get around the issue. See this article and the cuprite project. |
Hi @robacarp, Matthew's suggestion is an interesting one but I also got another lead. I opened an issue on Selenium and someone explained how they resolved this by pinning a chromium version: |
Just to keep capture this idea, on that other Issue someone claims that he solved a bunch of issues by switching to Playwright and linked to an article: teamcapybara/capybara#2770 (comment) |
No description provided.