Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add lock around save in pusher to avoid race condition #1665

Closed
wants to merge 2 commits into from
Closed

Add lock around save in pusher to avoid race condition #1665

wants to merge 2 commits into from

Conversation

jules2689
Copy link

The problem

When gems are pushed at approximately the same time, 2 simultaneous request will happen. These requests end up both pushing an identical version to Rubygems. When Bundler tries to validate the version, things go 💥 and the SHA256 checksum fails.

This often happens in matrix builds in Travis

How this fixes it

A lock around the functionality was discussed in the linked issues. I chose to take this path.
I did not use an active record lock as we do not want to prevent other changes to the gem (like download stats, edits, etc) while something is uploading. Instead, we lock the save functionality directly.

This means that a call to save must first be able to get a lock and will 409 CONFLICT otherwise.

Should the system crash during a save, the lock will expire automatically in 10 minutes as a backup method.

Questions

  • Is there a locking library I should use instead?
  • Rails.cache locking is supposedly not the best, I have little experience with it. Maybe it gives us enough benefit without additional overhead?

Issues this solves

Fixes #1564
Fixes #1551

cc @indirect @dwradcliffe

@indirect
Copy link
Member

indirect commented Aug 9, 2017

I think I'm okay with this. The race condition is decreased from "the entire time it takes to upload the .gem file to s3" down to the time between Rails.cache.read and Rails.cache.write, right here:

    if Rails.cache.exist?(lock_key)
      notify("We are already processing this gem with the specified version.", 409)
      return false
    end

    Rails.cache.write(lock_key, true, expires_in: 10.minutes)

It's possible for exist to return false, but for another web request to set the value before the call to write a few lines later. Since there's (apparently) no way to prevent this kind of race condition in Rails.cache, I assume that's the reason for viewing Rails.cache as not a very good lock.

Putting the lock in Rails.cache doesn't remove the race condition, but it decreases the amount of time the race condition can happen inside by a huge amount.

@dwradcliffe are you okay with this?

@jules2689 can you rubocop -a and then force push? that should get the tests green.

Final question: how should we handle checksum mismatches if this (much smaller) race condition gets triggered, and we end up with an s3/db mismatch?

@jules2689
Copy link
Author

@indirect I'm not sure it's worth spending a ton of time preventing mismatches. I think it would be worth more to us to simply fix them automatically. If possible, we could detect when a checksum mismatch happens and fix in a background job

We could also run an hourly task that takes all gems uploaded in the last hour and checks the checksum?

@indirect
Copy link
Member

indirect commented Aug 9, 2017

True. At that point, we could maybe even just have every save enqueue a background job for execution in 5 minutes that calls GemcutterTaskshelper.recalculate_sha256. 👍🏻

Copy link
Member

@dwradcliffe dwradcliffe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@arthurnn arthurnn self-assigned this Aug 28, 2017
@arthurnn
Copy link
Member

did not use an active record lock as we do not want to prevent other changes to the gem (like download stats, edits, etc)

I am still thinking why we can't use DB locking for this. What do you mean by that?
An exclusive lock on the versions table should only effect the updates on that version row. not anything else.

@jules2689
Copy link
Author

jules2689 commented Aug 29, 2017

@arthurnn the problem we are solving here is that jobs (particularly from parallelized CI systems like Travis matrix builds) are attempting to simultaneously upload the same version twice. This means that we will not find and use a previous row, but create 2 new rows as per https://github.com/jules2689/rubygems.org/blob/ddc02e8951acb084e15fbadfa36fcaa379efbdb6/app/models/pusher.rb#L86-L109

This means that the row lock would not be effective as we would just make 2 of them and they would be totally separate from each other - I think, I'm not the best versed in DB locks so I could easily be mistaken :)

@arthurnn
Copy link
Member

Indeed we don't want to upload two independent versions for the same spec.
When we upload a version of a gem, the spec has its version. So before tying to create it in the db, we do a query to see if the version already exist.
If on that query, we would use a exclusive lock(this is a mysql link, but I bet postgres has that too) and hold that until we save the version to the db, means another exclusive lock on the index will be blocked until the first transaction is finished. making the second one return a entry for that version and return the 'Repushing of gem versions is not allowed.' error.

I think the code will have to change a bit, to wrap most of the process method to happen inside a db transaction. I guess only the save method would be outside the transaction, as thats the one that does the upload to S3 and we don't want to hold the lock for that long. However, we need to save the version before we get to that point, which will release the db lock.
Also, something that might be simpler, and might work is: change the versions.new to a create call. That should create the row inside that transaction, but it should put a gap lock on the index. (at least thats what would happen in mysql), so another process, would be blocked trying to do the find for the same version(As long as we put a Exclusive lock on the select).

Sorry for the back and forth, the memecache solution is a good one and it works. I just think we can solve this with a db locking, which is the level we want to synchronize data at.
This process already relies on S3 and db, adding memcache(which is a cache tool) seems unnecessary.

@jules2689
Copy link
Author

jules2689 commented Aug 29, 2017

Sorry for the back and forth, the memecache solution is a good one and it works. I just think we can solve this with a db locking, which is the level we want to synchronize data at.

I originally tried that path, so if we can get this going then Im 👍

This process already relies on S3 and db, adding memcache(which is a cache tool) seems unnecessary.

Technically it does already through background jobs - I think.

Happy to refactor this though

@bundlerbot
Copy link
Collaborator

☔ The latest upstream changes (presumably 101ce1b) made this pull request unmergeable. Please resolve the merge conflicts.

@jules2689
Copy link
Author

I've been away for the past few weeks at RubyKaigi. I intend to look at this again this week or next.

@jules2689
Copy link
Author

jules2689 commented Oct 6, 2017

Superseded by #1681

@jules2689 jules2689 closed this Oct 6, 2017
@jules2689 jules2689 deleted the race-condition-version branch October 6, 2017 17:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

race condition can lead to double gem push Mismatched checksums on a couple gems
5 participants