Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Session crashes (without errors in log) right after the O sends a tx #2152

Closed
0xVires opened this issue Dec 27, 2021 · 14 comments · Fixed by #2163
Closed

Session crashes (without errors in log) right after the O sends a tx #2152

0xVires opened this issue Dec 27, 2021 · 14 comments · Fixed by #2163
Assignees
Labels
status: core contributors working on it in progress type: bug Something isn't working

Comments

@0xVires
Copy link

0xVires commented Dec 27, 2021

Seeing some strange behavior with the latest livepeer version:
All my sessions crash without anything showing up in the logs. All the transcoding just seems to stop:

Dec 26 23:53:24 livepeer[129136]: I1226 23:53:24.710594  129136 orchestrator.go:766] manifestID=19782171-0479-4561-be44-2c3417c1c94d seqNo=370 orchSessionID=304da1aa Successfully received results from remote transcoder=x:59400 segments=2 taskId=737779 fname=https://x:8935/stream/304da1aa/370.tempfile dur=463.42726ms
Dec 26 23:53:26 livepeer[129136]: I1226 23:53:26.322271  129136 orchestrator.go:766] manifestID=09dcsceke8mmnotk seqNo=135 orchSessionID=66d7a6b4 Successfully received results from remote transcoder=x:44522 segments=4 taskId=737781 fname=https://x:8935/stream/66d7a6b4/135.tempfile dur=109.798038ms
Dec 27 00:29:26 livepeer[129136]: 2021/12/27 00:29:26 http: TLS handshake error from 143.244.33.78:59807: EOF

Only in the T logs I see a few Transcode loop timed out for key=ebf7dc72_0 and Deleted transcode session for key=ebf7dc72_0, but also no errors. Some more info here: https://discord.com/channels/423160867534929930/426114749370204170/924479708605861920

The weird thing is that both of those crashes happened just a minute or two after the O sent (or was supposed to send) a tx. The first time right after the start of a new round, the second time right after the O claimed a winning ticket.

Crash 1 (new round started 4min before 19:00):
image

Crash 2 (winning ticket tx was mined at 23:50):
image

Any guesses why my sessions crash without anything showing up in the logs?

@Strykar
Copy link
Contributor

Strykar commented Dec 27, 2021

I see similar behavior approximately 6 minutes after a winning ticket was claimed.
Dec 26 16:32:21 i5 livepeer[3604]: Invoking transaction: "redeemWinningTicket". Inputs: "_ticket: { Recipient: 0x1a...
Screenshot from 2021-12-27 07-51-17

@ChuckChain
Copy link

Just to add that we're seeing the same behavior. O/T "crashed" just after redeeming a winning ticket, but no errors in the log. Restarted the O/T this morning (about 2 hours ago), and it's having trouble keeping sessions since.
Screenshot 2021-12-27 102332

@0xVires
Copy link
Author

0xVires commented Dec 27, 2021

Happened for the third time - again right after redeeming a winning ticket
image
@ChuckChain the sessions will eventually come back, but it might take a few hours to recover

@ChuckChain
Copy link

@0xVires You're right, it took a couple of hours, but it went back to normal.

@yondonfu
Copy link
Member

Just a heads up that this issue will be investigated this week, but progress may be a bit slow given that we're in a holiday period.

@Franck-UltimaRatio
Copy link

Franck-UltimaRatio commented Jan 2, 2022

I can confirm that, our 0 crashed just after sending a TX for redeem our last ticket (T / O split, ubuntu)

@yondonfu yondonfu assigned yondonfu and leszko and unassigned yondonfu Jan 3, 2022
@0xVires
Copy link
Author

0xVires commented Jan 3, 2022

adding to all the cases where it happens after redeeming a winning ticket: It also seems to happen in some cases when a new round starts and the O is supposed to send the reward claim tx.

I've already described this in my first post and it just happened again in round 2419. All my sessions were gone approx. a minute after the new round started (so when the O was supposed to send the reward claim tx - but it never did):
image

I don't know under which circumstances it happens - I've also rounds where the O successfully sent the reward tx without session crashes. And rounds where I had to manually send it due to my maxGasPrice setting, but also without crashing the sessions.

But this is the second time now so I don't think it's a coincidence...

@criticaltv
Copy link
Contributor

criticaltv commented Jan 3, 2022

I experienced a complete loss of traffic today, which coincided with the start of the new round.

I also recently had problems restarting my O, only to find that the nvidia drivers had somehow uninstalled themselves. It also proceeded the first time my node did not automatically call reward in a long while. I reinstalled latest drivers, then re-keylase and everything is fine again now (aside from the above complete loss of traffic).

@leszko
Copy link
Contributor

leszko commented Jan 4, 2022

I started working on the issue. The good news is that I managed to reproduce it locally with the following steps:

  • Set up local Livepeer Geth
  • Start Orchestrator
  • Update max gas price to 1000000010
  • After a few minutes there is deadlock in TimeWatcher

The deadlock prevents the lastSeenBlock to get refreshed and as a result, there is an error TicketParams expired.

I'm still not sure how the change in maxGasPrice is related to the deadlock, but I think I'll manage to nail it soon.

@yondonfu
Copy link
Member

yondonfu commented Jan 5, 2022

This should be fixed in the v0.5.24 release. If anyone continues to encounter this issue please re-open the issue and provide a report.

@0xVires
Copy link
Author

0xVires commented Jan 12, 2022

Had another session crash right after redeeming a winning ticket (https://etherscan.io/tx/0x59fd2abf5ec03aa1623d506e20a1bd8a5e2c358a8946333a8e1f72f9f27c3886).

image

Again, no errors in the logs. But with the newest version, the O recovered itself without needing a restart. Still odd, especially since another winning ticket a few minutes earlier (https://etherscan.io/tx/0xc943a4a30f76e1ab369949124587dbb8ad3581fdbf823ea0f5635962ac6c39ce) didn't cause any problems.

@yondonfu
Copy link
Member

Hm might be related to #2168 cc @leszko

@leszko
Copy link
Contributor

leszko commented Jan 13, 2022

Yeah, my rough guess is that it's related. Reopening the issue. The weird thing is that the version v0.5.24 reverted all the tx related changes, so it should behave exactly the same as the version v0.5.22.

@0xVires do you experience more this issue in the version v0.5.22 than you did in v0.5.24?

@leszko
Copy link
Contributor

leszko commented Feb 4, 2022

Closing, since I believe it is fixed by #2208. Reopen if you still encounter the issue after the next go-livepeer release.

@leszko leszko closed this as completed Feb 4, 2022
@leszko leszko mentioned this issue Mar 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status: core contributors working on it in progress type: bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants