Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CKAN's GitHub downloads are breaking the rules #2210

Closed
HebaruSan opened this issue Dec 8, 2017 · 22 comments · Fixed by #3054
Closed

CKAN's GitHub downloads are breaking the rules #2210

HebaruSan opened this issue Dec 8, 2017 · 22 comments · Fixed by #3054
Labels
Cmdline Issues affecting the command line GUI Issues affecting the interactive GUI Netkan Issues affecting the netkan data Network Issues affecting internet connections of CKAN

Comments

@HebaruSan
Copy link
Member

HebaruSan commented Dec 8, 2017

GitHub downloading needs a rewrite

(I debated whether to add this as a comment to #1817, but it seems like too much text and detail for that.)

Problems

Currently if CKAN downloads many files from GitHub at the same time, they often fail with HTTP status code 403-Forbidden. #1817 contains an example, but these reports are common and I've definitely seen it happen myself several times.

Background

The GitHub API uses 403 codes for throttling; you get 60 unauthenticated requests per hour, and any beyond that return a 403. I encountered this while working on an unrelated project, and I had to use a GitHub token to allow 5000/hour, passed in the HTTP request headers:

Authorization: token <OAuth token here>

Currently CKAN's downloads do not go through the GitHub API, so this does not necessarily indicate exactly what's going on with them. However, it establishes that 403-Forbidden is sometimes used for throttling, and it becomes more relevant later in discussion of the API.

Sample API data for releases, minus the author and uploader fields since they're long and not relevant to this issue:

{
  "url": "https://api.github.com/repos/HebaruSan/Astrogator/releases/7538924",
  "assets_url": "https://api.github.com/repos/HebaruSan/Astrogator/releases/7538924/assets",
  "upload_url": "https://uploads.github.com/repos/HebaruSan/Astrogator/releases/7538924/assets{?name,label}",
  "html_url": "https://github.com/HebaruSan/Astrogator/releases/tag/v0.7.8",
  "id": 7538924,
  "tag_name": "v0.7.8",
  "target_commitish": "master",
  "name": "Frictionless toilet",
  "draft": false,
  "prerelease": false,
  "created_at": "2017-08-28T06:00:32Z",
  "published_at": "2017-08-28T06:09:07Z",
  "assets": [
    {
      "url": "https://api.github.com/repos/HebaruSan/Astrogator/releases/assets/4682631",
      "id": 4682631,
      "name": "Astrogator.zip",
      "label": null,
      "content_type": "application/zip",
      "state": "uploaded",
      "size": 114372,
      "download_count": 5243,
      "created_at": "2017-08-28T06:08:01Z",
      "updated_at": "2017-08-28T06:08:02Z",
      "browser_download_url": "https://github.com/HebaruSan/Astrogator/releases/download/v0.7.8/Astrogator.zip"
    }
  ],
  "tarball_url": "https://api.github.com/repos/HebaruSan/Astrogator/tarball/v0.7.8",
  "zipball_url": "https://api.github.com/repos/HebaruSan/Astrogator/zipball/v0.7.8",
  "body": "Fix glitches when the settings file is invalid"
}

The zip file that we want to download is associated with assets[0], and there are two fields for it, url and browser_download_url. This becomes important later.

Investigation summary

I used the "Contact GitHub" link to reach out to GitHub about how their download throttling works. Surprisingly, the person who replied understood exactly what I was talking about and how to fix it 👍. It turns out that these problems happen because CKAN is not using GitHub as intended. From my conversation with the very helpful support person:

If I understood your message correctly, it seems like you're programmatically downloading resources from github.com, is that right? If that's so, then you shouldn't be doing that. GitHub.com wasn't build for programmatic use like that, it was built for humans. For programmatic use, you should be using the API. The API has well defined rate limits and caching behavior you can rely on, while GitHub.com doesn't. That doesn't mean that github.com doesn't have any rate limits, it only means that you can be rate limited at any time and without warning.

So, we'd like to ask you to switch and use the API for downloading the data you need, and respect the defined rate limits (that's what a good citizen app should be doing, instead of hitting github.com):

("good citizen" was my phrasing in my original message, so don't take that as an unprovoked criticism of our civic virtues.)

If I'm interpreting that code snippet correctly, you're using the browser_download_url link, which, as the name suggests, is intended to be used by human users via a browser.

For downloading release assets via the API, you should be using this endpoint:

https://developer.github.com/v3/repos/releases/#get-a-single-release-asset

Notice this note: "To download the asset's binary content, set the Accept header of the request to application/octet-stream. The API will either redirect the client to the location, or stream it directly if possible. API clients should handle both a 200 or 302 response."

That would be the "url" field of a particular asset (which are listed when you fetch a release e.g. via https://developer.github.com/v3/repos/releases/#get-a-single-release), but with the addition of the special Accept header.

Key points:

  • The URL from the field we're using currently (browser_download_url) is for users and browsers only, not applications. It can be throttled, but there is no explicit policy or workaround.

  • We should be using the GitHub API for downloads. Currently we use it in the Netkan code that finds new releases, but for downloads we effectively impersonate a browser.

  • This can be done by requesting the url field instead of browser_download_url and setting a custom HTTP header:

    Accept: application/octet-stream

    I tested this with wget, and setting the Accept header did indeed give me the download. Without this header, it returns a JSON object describing the asset.

Changes needed to stop abusing browser_download_url

GitHub-specific downloading metadata & logic

When downloading from GitHub, we need to send the custom HTTP header. This cannot be accomplished simply by swapping out the bad URL for the good URL in the download metadata field.

Proposed new metadata field:

  • github_download - The assets[0].url value from the API

Specific changes:

  • The spec/schema would need to be updated to allow this field.
  • Netkan would need to be updated to generate this field.
  • CKAN would need to be updated to check for the presence of this field, which would then trigger an alternate download method that sets the custom header.

UI to handle 403 statuses

If a GitHub download returns a 403 status, we should handle the exception and notify the user that their downloads are being throttled. We could direct them to the setting (see below) and web page dealing with GitHub auth tokens, and/or advise them to wait 60 minutes for their limit to reset. https://api.github.com/rate_limit can be used to get the exact limit and timing numbers.

GitHub token handling

Users will be limited to 60 GitHub downloads per hour, because this is the limit of the GitHub API. 140+ mod installs are pretty commonly mentioned on the forums, and reinstalling everything from scratch is a common method for dealing with compatible upgrades, so some users would probably encounter this limit and not appreciate the 60-minute wait to be able to download more. The only way around this is to use a GitHub auth token, which boosts the limit to 5000/hour per token.

It would be nice to ship a single internal auth token for all of CKAN, since then users would have the 5000/hour limit by default without having to worry about any of the details. More responses from the GitHub contact person:

Including a single pre-defined token with the app so that this token is used by all users of the app is possible. You could create a scopeless token here https://github.com/settings/tokens/new and include that. A scopeless token doesn't have any special permissions -- it can be used for read-only access of public data. So, it would be safe in that way. However, someone could easily take your token from the app, and then drain the API quota for the user who owns the token by making lots of unnecessary API requests. At that point, the app would stop working for everyone who uses the app.

Deliberate abuse like that is unlikely, but assuming 200 downloads per active user per hour, a 5000/hour limit across all CKAN users would support 25 active users in a given hour. The number of active users at a given time isn't known, but the latest CKAN release has over 60000 downloads, so it's probably more than 25. If we were able to determine the limit we needed per hour, we could divide it by 5000 and then generate that many tokens and pick one randomly per request, but that might not be in the spirit of the API's rules.

A setting

We could create a new settings field called GitHub Auth Token, where the user could fill in their own tokens to allow more downloads. This could be instead of or in addition to any built-in tokens we may or may not use, and it should support all the UIs.

Multi-pass approach

  1. Try with no authentication at all. This would succeed for the first 60 requests per user per hour, probably the majority but not all.
  2. For the remaining requests that fail, retry with a single hard coded auth token. As long as we only use this as a fallback, the 5000/hour limit would only apply to downloads in excess the 60/hour.

Migration concerns

If Netkan was updated to use this new scheme tomorrow, current CKAN clients would break unless the old download field was still populated. So we should not remove support for the old metadata immediately; GitHub downloads should use both download and github_download until all clients are updated.

Or just download serially

The API docs say:

Make requests for a single user or client ID serially. Do not make requests for a single user or client ID concurrently.

So even with a token, CKAN's parallel download method would still be in violation of the letter of the law.

As a halfway measure, we could try scaling back the parallelization of downloads.

  1. Check whether a download URL contains "github.com"
  2. If so, add it to a pool of downloads to be handled serially
  3. Handle all other downloads normally
  4. When a download finishes, if it contains "github.com", then start a new download from the pool

This might solve the problem without messing with all the API/token stuff. We would still technically be misusing GitHub, but users should no longer encounter failed downloads as frequently.

@HebaruSan HebaruSan added Core (ckan.dll) Issues affecting the core part of CKAN Cmdline Issues affecting the command line GUI Issues affecting the interactive GUI Infrastructure Issues affecting everything around CKAN (the GitHub repos, build process, CI, ...) Metadata Netkan Issues affecting the netkan data Schema Issues affecting the schema Spec Issues affecting the spec labels Dec 8, 2017
@politas
Copy link
Member

politas commented Dec 8, 2017

Well, we're already on track for a point release with a spec update, so it's a good time to do this. We've had projects before to go through old metadata and update it, so adding GitHub download info isn't impossible. (Though it may be impracticable) Even if we just force the current releases to be repopulated, and all new ones going forward, we'll be ahead of the game, and if our trigger for the different download behaviour is the existence of the github_download URL, the worst that will happen is that things work as they do now.

I like the multi-pass approach of starting with no token and failing over to a common token. The number of users that would still have issues is probably pretty small (There aren't that many that have issues currently), and adding the ability for users to add their own API token would probably work for that minority of users wanting three figure mod downloads.

Could you post something on the CKAN thread explaining what you've found and what users can do to avoid the download throttling in the meantime? (ie, download contents in blocks of 60 mods and wait in between)

@HebaruSan
Copy link
Member Author

I would love to give users more info, however right now there is no useful info to give.

The 60/hour limit assumes use of the API, which is not how the currently released CKAN client works. It instead uses the browser download URL, which has secret and unknown throttling characteristics. This was what I was hoping to learn from the GitHub contact person in the first place, and he only confirmed that they're not going to explain it publicly.

There's also the bit at the end about parallel downloads; it's quite possible that the secret throttling rules would trigger for (far) fewer than 60 simultaneous downloads, and in fact I think I've seen it do that. So unfortunately I'm not able to outline an effective workaround for end users, other than downloading mods one at a time.

@techman83
Copy link
Member

We'd want to be exceptionally careful about baking tokens into the client, firstly for them getting taken and abused by a third party and secondly it may be against the terms of service rotating the tokens to work around the limits.

Probably in the interim we can start generating URLs that point to the API download and use some logic to use it instead, but I'm guessing the unauthenticated limit is quite low?

We also have the client implementation part of #1682, which would allow for FOSS mods to fallback to the Internet Archive to get the download. Which is one of the driving factors in building that system.

@politas
Copy link
Member

politas commented Dec 14, 2017

Ok, if I'm synthesizing the data above correctly, we should be doing this:

  1. Strip Github downloads into a separate list to other downloads
  2. Attempt to download all non-GitHub mods in parallel.
  3. Use GitHub API to serially download all GitHub downloads
  4. For any failed download (GitHub or not), try to download from the Internet Archive.
  5. If any Internet Archive downloads fail, alert user asking if they wish to retry the download attempts, cancel while maintaining the selections list, or cancel and clear the selections list.

And as a separate exercise, implement a way for users to enter their own GitHub token for the GitHub downloads and throw a message advising of the option when we get a GitHub 403 error.

@HebaruSan HebaruSan added Network Issues affecting internet connections of CKAN and removed Core (ckan.dll) Issues affecting the core part of CKAN Infrastructure Issues affecting everything around CKAN (the GitHub repos, build process, CI, ...) labels Dec 16, 2017
@dbent
Copy link
Member

dbent commented Jan 22, 2018

Is the only reason to have a github_download field so that the HTTP client knows to set the Accept header? In that case I'd rather just bump the spec version, dump that information in the download field, and just always set the Accept header to application/zip, application/octet-stream.

@dbent
Copy link
Member

dbent commented Jan 22, 2018

Actually, since we store the content type in the metadata as download_content_type, the Accept header should default to: ${download_content_type}, application/octet-stream.

@dbent
Copy link
Member

dbent commented Jan 22, 2018

Actually, throw in some quality values to assert that we would prefer the specified content type over some arbitrary binary stream and we should have something that works for anything: ${download_content_type};q=1.0,application/octet-stream;q=0.9.

@HebaruSan
Copy link
Member Author

Essentially it's because older clients will fail to download the API URLs, and the metadata is shared between old and new clients. If we switch all the URLs over in CKAN-meta, then everyone who hasn't upgraded to the newest CKAN will get download errors for all GitHub URLs, every time.

@dbent
Copy link
Member

dbent commented Jan 22, 2018

The way that I would roll it out would be as follows:

  1. Publish new CKAN and NetKAN clients that always send an appropriate Accept header for all downloads. Let's call this version 1.xx. Update the spec to say that clients that abide by spec > 1.xx must send appropriate Accept headers. Update documentation to state that all GitHub downloads should use a spec version of at least 1.xx.
  2. Wait a reasonable amount of time for people to upgrade.
  3. Do a mass edit on all .netkan files with github kref and set their spec to 1.xx. This will trigger a mass commit to the CKAN-meta repo for all the most recent versions.
  4. Do a mass edit on all remaining .ckan files with a download url that points to github.com to change their spec to 1.xx.
  5. Wait a reasonable amount of time for people who did not upgrade in step 2 to stop complaining about being unable to download anything from github.
  6. Upgrade NetKAN to generate API download URLs.
  7. Programmatically alter old .ckan files to use API download URLs.
  8. ???
  9. Profit!

One thing I've been meaning to do is add a SpecVersionTransformer to NetKAN that will automatically spit out the oldest possible spec version that supports the features used in the .netkan file. Shouldn't be that difficult. Standard practice for .netkan files would then be to not specify a spec_version and have it automatically be populated.

@HebaruSan
Copy link
Member Author

Sounds like a good plan to me. We can probably add the Accept header in 1.24...

@HebaruSan
Copy link
Member Author

Wait, if the spec version requirement prevents them from downloading, then what do steps 3-5 buy us as compared to just switching the URLs immediately? GitHub downloads break either way, but changing the URLs would actually fix something as well.

@dbent
Copy link
Member

dbent commented Jan 22, 2018

I suppose we don't need step 5 and can do 3-7 pretty much simultaneously.

@HebaruSan
Copy link
Member Author

OK, sorry, I forgot some pieces of the puzzle since writing this up.

The github_download property isn't needed (only) for old clients. It's because of the OAuth token stuff to make API throttling less severe. If we want users to be able to download more than 60 API URLs from GitHub per hour, then we need some way of sending an auth token that is either hard coded, configured, or whatever. This would require special handling just for GitHub downloads, which would be tied to the github_download property.

@dbent
Copy link
Member

dbent commented Jan 22, 2018

Couldn't that be specially handled in the client by checking for the presence of a api.github.com host in the download URL, rather than having extra metadata?

@HebaruSan
Copy link
Member Author

Yeah, that makes sense. In fact, we can have a Dictionary<string, string> to store a mapping from any download host to an auth token, in case other servers add this in the future.

Where would we store the tokens? registry.json or GUIConfig.xml would be instance-specific, which wouldn't be great.

@dbent
Copy link
Member

dbent commented Jan 22, 2018

Where would we store the tokens?

Right now it would be the registry, which is where user configuration data is stored. I'm not really a fan of that, however, and have been meaning to replace it with a file-based solution with JSON files.

@dbent
Copy link
Member

dbent commented Jan 22, 2018

Accept can take multiple comma-delimited values. Use application/zip;q=1.0,application/octet-stream;q=0.9, which tells the server "I prefer application/zip, but will also accept application/octet-stream if you can provide it".

 curl -L -H "Accept: application/zip;q=1.0,application/octet-stream;q=0.9" https://api.github.com/repos/HebaruSan/Astrogator/releases/assets/4682631 -o Astrogator.zip

Works as expected.

@HebaruSan
Copy link
Member Author

Yeah, I realized that after I posted and deleted that comment. It does indeed work even in a patched CKAN.

@judemille
Copy link

Not sure how feasible this is, but maybe offer the option for users to use their own API keys?

@dbent
Copy link
Member

dbent commented Jan 27, 2018

Not sure how feasible this is, but maybe offer the option for users to use their own API keys?

#2263 will allow that.

@HebaruSan
Copy link
Member Author

As of today, most CKAN users will be on 1.24, which supports API URLs. We can consider migrating the metadata as we please now.

Asterisk: Mono users might still be able to run 1.22.6 and earlier, since it worked for me on Ubuntu even after #2293's GitHub Apocalypse. Maybe allow a few weeks for this group to upgrade.

@HebaruSan
Copy link
Member Author

Summary of current status:

Best of all, everyone now has these improvements, because #2293 broke all older clients.

Remaining to be done:

  • De-parallelize GitHub downloads (I have a prototype of this, but it's kind of a mess)
  • Migrate metadata to API URLs

However, complaints about the throttling have largely gone quiet lately. It may be wise to leave this as-is for a while lest we spoil a calm situation.

@HebaruSan HebaruSan removed Schema Issues affecting the schema Spec Issues affecting the spec labels Oct 2, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Cmdline Issues affecting the command line GUI Issues affecting the interactive GUI Netkan Issues affecting the netkan data Network Issues affecting internet connections of CKAN
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants