-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Save commit data #5
Comments
Can you explain how they should be optimized? |
So.. right now, for issues, milestones and pull requests we query github for "all the entries" - since all of these can change state (they can get closed, reopened, updated etc.). What I suggest is that on every update we ask github for the next 5000 commits from where we left off at the last update (the API endpoint supports pagination but it will only give you at most 100 commits per page, so that means 50 requests). It seems that the API gives you commits in descending chronological order, but we want to start with oldest first (so we can have all of them and further updates only have to add the newest commits). |
Also, thinking more about this, we could just get the most recent 5000 commits every time we do an update and drop the old ones. I'm not sure really old commits provide us any value. What do you think? |
👍 on newer commits |
OK, then let's do that. :) Do you want to implement this? |
Sure i'll have a hack at it. |
Great! Let me know if I can help. I m not sure that the regex for the header processing holds for the On Tue, Mar 4, 2014 at 12:19 PM, Artur Daschevici
Mihnea Dobrescu-Balaur |
The commits endpoint returns the results based on sha order and branch |
The pagination aspect can be checked :). We can see how many we get and decide on a limit for that. Why do you want to join the events and the commits? |
They look somewhat similar, aside from the url steps are basically the same. On Thu, Mar 6, 2014 at 12:08 PM, Mihnea Dobrescu-Balaur <
Regards, |
Wait a sec - are we talking about the same thing? getData has a type parameter already and it's used for all the endpoints: |
Ok. |
Well.. this is an issue in the The github file you are referring to is there just with helpers for getting some data for prototype testing. I'll remove it since it's not used anymore. |
Ok. Seems I was looking in the totally wrong place. |
So do you want to get your hands dirty with some Java code? :D |
Well I could do it if there is nobody else up for it but don't know if it On Thu, Mar 6, 2014 at 1:06 PM, Mihnea Dobrescu-Balaur <
Regards, |
Ok, that's fine. Maybe something in the elasticboard repo catches your eye On Thu, Mar 6, 2014 at 12:10 PM, Artur Daschevici
Mihnea Dobrescu-Balaur |
Any reason this only goes for commits? While other items can get updated, using the "since" parameter will both get new and updated documents, at least for issues. |
For issues, we get all of them all the time and we just replace old instances. It could be optimized to save some API calls, of course. |
As well as some ElasticSearch space. My index seems to grow (slowly, though, even though I update every 10 minutes), probably because of ES' versioning. I guess this might also open up for quicker updates (i.e. every minute), as calls would be much smaller. This is probably not important to everyone, but i.e. seeing almost instant improvement of stats on a QA dashboard when fixing bugs is a good motivator. I might fork to play around and see if I can't contribute :) |
That sounds great, please do! :) |
@LosD please let me know if you need any help with the code or whatnot. |
Not really, I'm more or less done for issues and events, I've just been out travelling for a while. I've settled for using the last seen event as the time to check updates from. Mostly because all issues changes will fire an event, and because the Modified header in the reply is not ISO8601. Oh, there may be one thing: I'm considering a unit test for it, do you know of any pitfalls when unit testing from within river code? I guess not, since we just get passed a client. |
Sounds good, as long as we don't miss events (and we shouldn't miss any, I think). I don't know about unit testing river code. This my first ES plugin and the first Java code in a long time, so your knowledge is most probably above mine on this one. :) |
The missing stuff part is exactly why I'd like to unit test it. But it Maybe the best way to test it properly would be to just start both I'll figure something out :) |
Here's an idea:
So, dropping the full issues/commits/etc fetch and replacing it with very-often polling should lead to getting all events. And it is fine, because we don't exhaust the call limit, and we do it fast enough to not miss anything. The Events endpoint gives the last 300 events. I think it's safe to assume no repo has more than 300 events per.. 5 minutes. Or 1 minute, something like that. :) What do you think? |
I must admit that I completely missed that the Events endpoint works differently than the Issues endpoint. I had somehow convinced myself that the since parameter also worked for events. Anyway, sounds good! And I guess there is no reason to do the Issue part much differently than I had considered: Before running, check for the time of latest Event (or Issue, it doesn't really matter, the important part is that we get anything that has happened since we last checked) fetched, and set the since parameter for that. Oh, and skip entirely if there is no new events, as that would also mean no new or modified issues. In time, the events could be parsed, so the rest of the data could be fetched intelligently, but I guess the amount of open stuff is usually so low that it wouldn't matter, at least not unless someone get interested enough in closed stuff to do something about it. Which reminds me, a Pull Requests is also an Issue, I guess it should be possible to find out if PRs has been closed by looking at the corresponding Issue? I'll take a look sometime tomorrow (it's 10 in the evening here). By the way, this has gone pretty far away from the original scope of this issue (adding commits, which I'm not even looking at :)). Maybe we should move the discussion to a new issue? |
Yep, the events endpoint is pretty smart. But we need to query it using the Sounds good, indeed. There would be one more thing that would be cool to have - to support some amount of downtime (as much as possible). Basically, this means saving in the status entry of the river (which is in ES), or somewhere like that, the last checked timestamp. Such that, if the machine reboots for example, the river will carry on from where it stopped (in the limit of the last 300 events, of course..). Yes, a PR is also an issue and it has the same field and actions (closed/opened). No rush, it's even later here :). Sure, feel free to open a new issue! Might be worth your while to do a minor sketch/spec of what you plan to do. Thanks! |
Actually, we already have something close to that: The newest event stored. It's just a simple max aggregate on the event created_at. |
Indeed, but not sure how long that ends up taking when you have many On Fri, Nov 14, 2014 at 11:36 PM, Dennis Du Krøger <notifications@github.com
http://hootsuite.com Hootsuite empowers customers to transform messages into meaningful This email is being sent on behalf of Hootsuite Media, Inc Hootsuite Media Inc., 5 East 8th Avenue, Vancouver, BC, V5T 1R6. |
Hmmmm, that may be true. The ID is a simple integer, and as far as I know it is sequential. It should be very fast, so I guess it would be better to use it for the lookup. |
I actually got something working! Now I can load as often as GiHub permits, and new issues are loaded incrementally. Still have some issues:
I'll make a PR as soon as 1 and 3 has been taken case of. If you are curious, you can take a look in https://github.com/LosD/elasticsearch-river-github/tree/feature/incremental-update :) By the way, the ETag feature works for all resources, which would make it feasible to check all data for changes before reloading it, but I think it would require a bit of refactoring first. I already feel like I'm stretching the original design a bit too much with my changes. |
Nice, congrats!
We should definitely use the ETag stuff for all resources, don't worry about the design. This way is considerably better. I didn't use that approach in the first place because I went the MVP route. Now that we know this works and it is useful, it makes sense to improve the design. :) Thanks again for your help! |
Use http://developer.github.com/v3/repos/commits/#list-commits-on-a-repository together with the "since" parameter to perform optimized queries.
Should use the 'CommitData' type.
The text was updated successfully, but these errors were encountered: