Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

octohat as a service #34

Closed
glasnt opened this issue Sep 24, 2015 · 11 comments
Closed

octohat as a service #34

glasnt opened this issue Sep 24, 2015 · 11 comments

Comments

@glasnt
Copy link
Collaborator

glasnt commented Sep 24, 2015

*dun dun dun*

@glasnt
Copy link
Collaborator Author

glasnt commented Sep 24, 2015

Some initial notes from brainstorms with @freakboy3742:

Octohat as a service will be interesting, because of the rate-limiting issue.

Per GitHub's rating limiting, you get 60 requests/hour unauthenticated, or 5000 requests/hour authenticated. Authentication requires an unscoped personal access token. Trying to circumvent the ratelimiting is bad.

I know I'm probably being slightly wasteful with the use of requests at the moment, and could cut back on duplicates/unrequired calls, but it would be interesting to see the requests required to get a total scope of a project. It'd be something like count(users) + (count(issues)) + overhead where overhead is fixed and the users depends on who's found in the issues.

Also, for sufficiently large repos, you can't get all the information in one pass. I've already implemented --limit to go check the last x number of issues, but ideally we'd want a full check of a project, and then update it at some regular interval.

I'd also like to be able to have nice things like octohat.com/user/repo endpoints, where you could trigger a build, and then come back later to see the data, and when it was last updated.

Now, this is assuming I stick with the API-driven version.

What I could do is use the GitHub archive. There are a few problems with this approach

  • the archive is good for getting changes in a day, but historical parsing will be harder. If I want to use just the GitHub archive, I'd have to parse all data ever collected. Plus, the data formats change over time, see the mothballing of the Open Source Report Card project
  • the data is event driven, so ensuring that all the events types (CommitComment, IssueComment, etc) match up to what's expected will be difficult.

I could take a hybrid approach - collect all the information once from the API, then update at x frequency from the archives.

@glasnt
Copy link
Collaborator Author

glasnt commented Oct 15, 2015

The hacktoberfest verify checker from @erikaheidi appears to do just want I want - a github groking as a service \o/

https://github.com/erikaheidi/hacktoberfest-verify

Gotta think about caping things for the webmodel, for usability and not-exploding-the-server-ness

@edunham
Copy link

edunham commented Feb 3, 2016

To deal with rate limiting, could you just make the service's user plug in their own API token? This could be done pretty transparently by having them oauth login to github when they start using the service, a la https://nightli.es/

@glasnt
Copy link
Collaborator Author

glasnt commented Feb 7, 2016

Oh yes of course! Setting up octohatrack as an application would solve authentication and limiting concerns from a centralisation standpoint!

@glasnt
Copy link
Collaborator Author

glasnt commented Feb 7, 2016

Now the question is timelength of results.

I'm tempted to say something like "Only look at the last month, or 20 issues", then detail the exact commands to run to use the CLI to get the whole version. I'm very worried about a bounce rate when it takes ~minutes to go through big repos in entirety, not to mention the larger scale caching issue.

@software-opal
Copy link

Could adding some degree of asynchronous-ity improve the speed(say through asyncio), and possibly the option of providing partial results as they become available. Because the major speed problem is loading the issues and all the comments associated with them.

We could also implement various caching policies to speed up re-requesting a repository:

  • Cache based on E-Tags or Last-Updated headers, which would prevent re-downloading chunks of data and save rate-limited requests.
  • By checking the 'updated_at' key(on an issue), we can further reduce the number of requests needed. If the updated_at is the same as our cached one, then we know that no new comments have been made so we don't even need to check them.

To speed up providing information about recent contributors we could sort the issues by 'updated' so we get the most recently updated ones first, prioritising loading those before the others.

We could also store information about each contributor's last contribution to provide a quick mechanism to only display recent contributors.

Initial requests for data(where we have nothing cached) could be handed off to a task runner like Celery, and respond indicating that the data is being fetched, poll-back later(or we could do some websocket magic, which would be cool and interesting but also hard).

If we wanted to get really efficient we could simply request the first page of the /contributor and /issues APIs:

  • If there are any changes in the /issues then start a task running to update.
  • If there are any changes in the /contributors then asses how many
    • more than a page, pass off a task
    • less than a page, just update the DB and respond.

@glasnt
Copy link
Collaborator Author

glasnt commented Feb 10, 2016

Added context here (because I didn't update this issue fast enough):

Lee has been awesome and made a proof of concept at https://github.com/leesdolphin/js-hatrack
All I did was fork this repo, and create a gh-pages branch for github hosting. so it's completely usable here: glasnt.github.io/js-hatrack

@glasnt
Copy link
Collaborator Author

glasnt commented Feb 10, 2016

@leesdolphin would you like me to make you an organisation contributor?

With that level of access, you can move your proof of concept into the labhr organisation, and we can consolidate our efforts on the 'as a service' model.

I want to mock up some UI that I've been mulling over. Also, I'd like to figure out how to make the oauth things mentioned by @edunham work, because at the moment throwing raw api keys around is not best practice.

@software-opal
Copy link

Yeah, I'd love to move js-labhr under LABHR. And I'd love to help with aaS model too.

@glasnt
Copy link
Collaborator Author

glasnt commented Feb 12, 2016

You should now have enough rights to enact a transfer of the repo into the organisation. Let me know if you need a hand :)

@glasnt
Copy link
Collaborator Author

glasnt commented Mar 4, 2016

@glasnt glasnt closed this as completed Mar 4, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants