Learn how to parse the DOM of a web page by using your favourite coding community as an example.
This repository/project is intended for
Educational Purposes ONLY.
The project and corresponding NPM module should not
be used for any purpose other than learning.
Please do not use it for any other reason
than to learn about DOM parsing
and definitely don't depend on it for anything important!
The nature of DOM parsing is that when the HTML/UI changes, the parser will inevitably fail ... GitHub have every right to change/improve their UI as they see fit. When they do change their UI the scraper will inevitably "break"! We have Travis-CI continuous integration to run our tests precisely to check that parsers for the various pages are working as expected. You can run the tests locally too, see "Run The Tests" section below.
Our initial reason for writing this set of scrapers was to satisfy the curiosity / question:
How can we discover which are the interesting people and projects on GitHub
(without manually checking dozens of GitHub profiles/repositories each day) ?
Our second reason for scraping data from GitHub is so that we can show people a "summary view" of all their issues in our Tudo project (which helps people track/manage/organise/prioritise their GitHub issues). See: dwyl/tudo#51
We needed a simple way of systematically getting data from GitHub (before people authenticate) and scraping is the only way we could think of.
We tried using the GitHub API to get records from GitHub, but sadly, it has quite a few limitations (see: "Issues with GitHub API" section below) the biggest limitation being the rate-limiting on API requests.
Thirdly we're building this project to scratch our own itch
... scraping the pages of GitHub has given us a unique insight into the features of the platform which has leveled-up our skills.
Don't you want to know what's "Hot" right now on GitHub...?
Having a way of extracting the essential data from GitHub is a solution to a surprisingly wide array of problems, here are a few:
- Who are the up-and-comming people (worth following) on GitHub?
- Which are the interesting projects (and why?!)
- What is the average age of an issue for a project?
- Is a project's popularity growing or plateaued?
- Are there (already) any similar projects to what I'm trying to build? (reduce duplication of effort which is rampant in Open Source!!)
- How many projects get started but never finished?
- Will my Pull Request ever get merged or is the module maintainer too busy and did I just waste 3 hours?
- insert your idea/problem here ...
- Associative Lists e.g: People who starred
abc
also likedxyz
This module fetches (public) pages from GitHub, "scrapes" the html to extract raw data and returns a JSON Object.
install from npm and save to your package.json
:
npm install github-scraper --save
var gs = require('github-scraper');
var url = '/iteles' // a random username
gs(url, function(err, data) {
console.log(data); // or what ever you want to do with the data
})
User profile has the following format https://github.com/{username}
example: https://github.com/iteles
var gs = require('github-scraper'); // require the module
var url = 'alanshaw' // a random username (of someone you should follow!)
gs(url, function(err, data) {
console.log(data); // or what ever you want to do with the data
})
Sample output:
{
"type": "profile",
"url": "/iteles",
"avatar": "https://avatars1.githubusercontent.com/u/4185328?s=400&v=4",
"name": "Ines Teles Correia",
"username": "iteles",
"bio": "Co-founder @dwyl | Head cheerleader @foundersandcoders",
"uid": 4185328,
"worksfor": "@dwyl",
"location": "London, UK",
"website": "http://www.twitter.com/iteles",
"orgs": {
"bowlingjs": "https://avatars3.githubusercontent.com/u/8825909?s=70&v=4",
"foundersandcoders": "https://avatars3.githubusercontent.com/u/9970257?s=70&v=4",
"docdis": "https://avatars0.githubusercontent.com/u/10836426?s=70&v=4",
"dwyl": "https://avatars2.githubusercontent.com/u/11708465?s=70&v=4",
"ladiesofcode": "https://avatars0.githubusercontent.com/u/16606192?s=70&v=4",
"TheScienceMuseum": "https://avatars0.githubusercontent.com/u/16609662?s=70&v=4",
"SafeLives": "https://avatars2.githubusercontent.com/u/20841400?s=70&v=4"
},
"repos": 28,
"projects": 0,
"stars": 453,
"followers": 341,
"following": 75,
"pinned": [
{ "url": "/dwyl/start-here" },
{ "url": "/dwyl/learn-tdd" },
{ "url": "/dwyl/learn-elm-architecture-in-javascript" },
{ "url": "/dwyl/tachyons-bootstrap" },
{ "url": "/dwyl/learn-ab-and-multivariate-testing" },
{ "url": "/dwyl/learn-elixir" }
],
"contribs": 878,
"contrib_matrix": {
"2018-04-08": { "fill": "#c6e48b", "count": 1, "x": "13", "y": "0" },
"2018-04-09": { "fill": "#c6e48b", "count": 2, "x": "13", "y": "12" },
"2018-04-10": { "fill": "#7bc96f", "count": 3, "x": "13", "y": "24" },
...etc...
"2019-04-11": { "fill": "#c6e48b", "count": 1, "x": "-39", "y": "48" },
"2019-04-12": { "fill": "#7bc96f", "count": 5, "x": "-39", "y": "60"}
}
}
How many people are following a given person on Github.
Url format: https://github.com/{username}/followers
example: https://github.com/iteles/followers
var gs = require('github-scraper'); // require the module
var url = 'iteles/followers' // a random username (of someone you should follow!)
gs(url, function(err, data) {
console.log(data); // or what ever you want to do with the data
})
Sample output:
{ entries:
[ 'tunnckoCore', 'OguzhanE', 'minaorangina', 'Jasonspd', 'muntasirsyed', 'fmoliveira', 'nofootnotes',
'SimonLab', 'Danwhy', 'kbocz', 'cusspvz', 'RabeaGleissner', 'beejhuff', 'heron2014', 'joshpitzalis',
'rub1e', 'nikhilaravi', 'msmichellegar', 'anthonybrown', 'miglen', 'shterev', 'NataliaLKB',
'ricardofbarros', 'boymanjor', 'asimjaved', 'amilvasishtha', 'Subhan786', 'Neats29', 'lottie-em',
'rorysedgwick', 'izaakrogan', 'oluoluoxenfree', 'markwilliamfirth', 'bmordan', 'nodeco', 'besarthoxhaj',
'FilWisher', 'maryams', 'sofer', 'joaquimserafim', 'vs4vijay', 'intool', 'edwardcodes', 'hyprstack',
'nelsonic' ],
url: 'https://github.com/iteles/followers' }
ok 1 iteles/followers count: 45
If the person has more than 51 followers they will have multiple pages of followers. The data will have a next_page key with a value such as: /nelsonic/followers?page=2 If you want to keep fetching these subsequent pages of followers, simply keep running the scraper: e.g:
var url = 'alanshaw/followers' // a random username (of someone you should follow!)
gs(url, function(err, data) {
console.log(data); // or what ever you want to do with the data
if(data.next_page) {
gs(data.next_page, function(err2, data2) {
console.log(data2); // etc.
})
}
})
Want to know the list of people this person is following
that's easy too!
The url format is: https://github.com/{username}/following
e.g: https://github.com/iteles/following or
https://github.com/nelsonic/following?page=2
(where the person is following more than 51 people ...)
Usage format is identical to followers
(above) so here's an example
of fetching page 3 of the results:
var gs = require('github-scraper'); // require the module
var url = 'nelsonic/following?page=3' // a random dude
gs(url, function(err, data) {
console.log(data); // or what ever you want to do with the data
})
Sample output:
{
entries:
[ 'kytwb', 'dexda', 'arrival', 'jinnjuice', 'slattery', 'unixarcade', 'a-c-m', 'krosti',
'simonmcmanus', 'jupiter', 'capaj', 'cowenld', 'FilWisher', 'tsop14', 'NataliaLKB',
'izaakrogan', 'lynnaloo', 'nvcexploder', 'cwaring', 'missinglink', 'alanshaw', 'olizilla',
'tancredi', 'Ericat', 'pgte' 'hyprstack', 'iteles' ],
url: 'https://github.com/nelsonic/following?page=3',
next_page: 'https://github.com/nelsonic/following?page=4'
}
The list of projects a person has starred a fascinating source of insight. url format: https://github.com/stars/{username} e.g: /stars/iteles
var gs = require('github-scraper'); // require the module
var url = 'stars/iteles'; // starred repos for this user
gs(url, function(err, data) {
console.log(data); // or what ever you want to do with the data
})
Sample output:
{
entries:
[ '/dwyl/repo-badges', '/nelsonic/learn-testling', '/joshpitzalis/testing', '/gmarena/gmarena.github.io',
'/dwyl/alc', '/nikhilaravi/fac5-frontend', '/foundersandcoders/dossier', '/nelsonic/health', '/dwyl/alvo',
'/marmelab/gremlins.js', '/docdis/learn-saucelabs', '/rogerdudler/git-guide', '/tableflip/guvnor',
'/dwyl/learn-redis', '/foundersandcoders/playbook', '/MIJOTHY/FOR_FLUX_SAKE', '/NataliaLKB/learn-git-basics',
'/nelsonic/liso', '/dwyl/learn-json-web-tokens', '/dwyl/hapi-auth-jwt2', '/dwyl/start-here',
'/arvida/emoji-cheat-sheet.com', '/dwyl/time', '/docdis/learn-react', '/dwyl/esta', '/alanshaw/meteor-foam',
'/alanshaw/stylist', '/meteor-velocity/velocity', '/0nn0/terminal-mac-cheatsheet',
'/bowlingjs/bowlingjs.github.io' ],
url: 'https://github.com/stars/iteles?direction=desc&page=2&sort=created',
next_page: 'https://github.com/stars/iteles?direction=desc&page=3&sort=created'
}
The second tab on the personal profile page is "Repositories" this is a list of the personal projects the person is working on, e.g: https://github.com/iteles?tab=repositories
We crawl this page and return an array containing the repo properties:
var url = 'iteles?tab=repositories';
gs(url, function(err, data) {
console.log(data); // or what ever you want to do with the data
})
sample output:
{
entries: [
{ url: '/iteles/learn-ab-and-multivariate-testing',
name: 'learn-ab-and-multivariate-testing',
lang: '',
desc: 'Tutorial on A/B and multivariate testing',
info: '',
stars: '4',
forks: '0',
updated: '2015-07-08T08:36:37Z' },
{ url: '/iteles/learn-tdd',
name: 'learn-tdd',
lang: 'JavaScript',
desc: 'A brief introduction to Test Driven Development (TDD) in JavaScript',
info: 'forked from dwyl/learn-tdd',
stars: '0',
forks: '4',
updated: '2015-06-29T17:24:56Z' },
{ url: '/iteles/practical-full-stack-testing',
name: 'practical-full-stack-testing',
lang: 'HTML',
desc: 'A fork of @nelsonic\'s repo to allow for PRs',
info: 'forked from nelsonic/practical-js-tdd',
stars: '0',
forks: '36',
updated: '2015-06-06T14:40:43Z' },
{ url: '/iteles/styling-for-accessibility',
name: 'styling-for-accessibility',
lang: '',
desc: 'A collection of \'do\'s and \'don\'t\'s of CSS to ensure accessibility',
info: '',
stars: '0',
forks: '0',
updated: '2015-05-26T11:06:28Z' },
{ url: '/iteles/Ultimate-guide-to-successful-meetups',
name: 'Ultimate-guide-to-successful-meetups',
lang: '',
desc: 'The ultimate guide to organizing successful meetups',
info: '',
stars: '3',
forks: '0',
updated: '2015-05-19T09:40:39Z' },
{ url: '/iteles/Javascript-the-Good-Parts-notes',
name: 'Javascript-the-Good-Parts-notes',
lang: '',
desc: 'Notes on the seminal "Javascript the Good Parts: byDouglas Crockford',
info: '',
stars: '41',
forks: '12',
updated: '2015-05-17T16:39:35Z' }
],
url: 'https://github.com/iteles?tab=repositories' }
Every person on GitHub has an RSS feed for their recent activity; this is the 3rd and final tab of the person's profile page.
it can be viewed online by visiting:
https://github.com/{username}?tab=activity
e.g: /iteles?tab=activity
The activity feed is published as an .atom xml string which contains a list of entries.
We use xml2js (which in turn uses the sax xml parser) to parse the xml stream. This results in a object similar to the following example:
{ '$':
{ xmlns: 'http://www.w3.org/2005/Atom',
'xmlns:media': 'http://search.yahoo.com/mrss/',
'xml:lang': 'en-US' },
id: [ 'tag:github.com,2008:/iteles' ],
link: [ { '$': [Object] }, { '$': [Object] } ],
title: [ 'itelesβs Activity' ],
updated: [ '2015-07-22T23:31:25Z' ],
entry:
[ { id: [Object],
published: [Object],
updated: [Object],
link: [Object],
title: [Object],
author: [Object],
'media:thumbnail': [Object],
content: [Object] },
{ id: [Object],
published: [Object],
updated: [Object],
link: [Object],
title: [Object],
author: [Object],
'media:thumbnail': [Object],
content: [Object] }
]
}
Each call to the atom feed returns the latest 30 enties. We're showing 2 here for illustration (so you get the idea...)
From this we extract only the relevant info:
'2015-07-22T12:33:14Z alanshaw pushed to master at alanshaw/david-www',
'2015-07-22T12:33:14Z alanshaw created tag v9.4.3 at alanshaw/david-www',
'2015-07-22T09:23:28Z alanshaw closed issue tableflip/i18n-browserify#6',
'2015-07-21T17:08:19Z alanshaw commented on issue alanshaw/david#71',
'2015-07-21T08:24:13Z alanshaw pushed to master at tableflip/score-board',
'2015-07-20T17:49:59Z alanshaw deleted branch refactor-corp-events at tableflip/sow-api-client',
'2015-07-20T17:49:58Z alanshaw pushed to master at tableflip/sow-api-client',
'2015-07-20T17:49:58Z alanshaw merged pull request tableflip/sow-api-client#2',
'2015-07-20T17:49:54Z alanshaw opened pull request tableflip/sow-api-client#2',
'2015-07-18T07:30:36Z alanshaw closed issue alanshaw/md-tokenizer#1',
'2015-07-18T07:30:36Z alanshaw commented on issue alanshaw/md-tokenizer#1',
Instead of wasting (what will be Giga) Bytes of space with key:value pairs by storing the entries as JSON, we are storing the activity feed entries as strings in an array. Each item in the array can be broken down into:
{date-time} {username} {action} {link}
As we can see from this there are several event types:
- pushed to master at
- created tag v9.4.3 at
- opened issue
- commented on issue
- closed issue
- deleted branch
- opened pull request
- merged pull request
- starred username/repo-name
For now we are not going to parse the event types, we are simply going to store them in our list for later analysis.
We have a good pointer when its time to start interpreting the data: https://developer.github.com/v3/activity/events/types/
One thing worth noting is that RSS feed is Not Real-Time ... sadly, it only gets updated periodically so we cannot rely on it to have the latest info.
Organization pages have the following url pattern: https://github.com/{orgname}
example: https://github.com/dwyl
var url = 'dwyl';
gs(url, function(err, data) {
console.log(data); // or do something way more interesting with the data!
});
sample data (entries
truncated for brevity):
{
entries:
[ { name: 'hapi-auth-jwt2',
desc: 'Secure Hapi.js authentication plugin using JSON Web Tokens (JWT)',
updated: '2015-08-04T19:30:50Z',
lang: 'JavaScript',
stars: '59',
forks: '11' },
{ name: 'start-here',
desc: 'A Quick-start Guide for People who want to DWYL',
updated: '2015-08-03T11:04:14Z',
lang: 'HTML',
stars: '14',
forks: '9' },
{ name: 'summer-2015',
desc: 'Probably the best Summer Sun, Fun & Coding Experience in the World!',
updated: '2015-07-31T11:02:29Z',
lang: 'CSS',
stars: '16',
forks: '1' },
],
website: 'http://dwyl.io',
url: 'https://github.com/dwyl',
name: 'dwyl - do what you love',
desc: 'Start here: https://github.com/dwyl/start-here',
location: 'Your Pocket',
email: 'github@dwyl.io',
pcount: 24,
avatar: 'https://avatars3.githubusercontent.com/u/11708465?v=3&s=200',
next_page: '/dwyl?page=2'
}
Note #1: sadly, this has the identical url format to Profile
this gets handled by the switcher
which infers what is an org vs. profile page
by checking for an known element on the page.
Note #2: when an organization has multiple pages of repositories you will see a next_page
key/value in the data
e.g: /dwyl?page=2 (for the second page of repos)
This is where things start getting interesting ...
example: https://github.com/nelsonic/adoro
var url = 'nelsonic/adoro';
gs(url, function(err, data) {
console.log(data); // or do something way more interesting with the data!
});
sample data:
{
url: 'https://github.com/nelsonic/adoro',
desc: 'The little publishing tool you\'ll love using. [work-in-progress]',
website: 'http://www.dwyl.io/',
watchers: 3,
stars: 8,
forks: 1,
commits: 12,
branches: 1,
releases: 1,
langs: [ 'JavaScript 90.7%', 'CSS 9.3%' ]
}
Annoyingly the number of issues and pull requests, contributors and issues are only rendered after the page has loaded (via XHR) so we do not get these three stats on page load.
Clicking on the issues icon/link in any repository takes us to the list of all the issues.
Visiting a project with more than a page worth of issues has pagination at the bottom of the page:
Which has a link to: https://github.com/dwyl/tudo/issues?page=2&q=is%3Aissue+is%3Aopen
List of issues for a repository:
var gs = require('github-scraper');
var url = '/dwyl/tudo/issues';
gs(url, function (err, data) {
console.log(data); // use the data how ever you like
});
sample output:
{ entries:
[
{
url: '/dwyl/tudo/issues/46',
title: 'discuss components',
created: '2015-07-21T15:34:22Z',
author: 'benjaminlees',
comments: 3,
assignee: 'izaakrogan',
milestone: 'I don\'t know what I\'m doing',
labels: [ 'enhancement', 'help wanted', 'question' ]
},
{
url: '/dwyl/tudo/issues/45',
title: 'Create riot components from HTML structure files',
created: '2015-07-21T15:24:58Z',
author: 'msmichellegar',
comments: 2,
assignee: 'msmichellegar',
labels: [ 'question' ]
}
], // truncated for brevity
open: 30,
closed: 20,
next: '/dwyl/tudo/issues?page=2&q=is%3Aissue+is%3Aopen',
url: '/dwyl/tudo/issues'
}
Each issue in the list would create a entry in the crawler (worker) queue:
2015-07-22T12:33:14Z issue /dwyl/tudo/issues/77
Should we include the "all issues by this author" link?
- created_by https://github.com/dwyl/tudo/issues/created_by/iteles
- assignee (assigned to): https://github.com/dwyl/tudo/issues?q=assignee%3Aiteles+is%3Aopen
The result of scraping dwyl/tudo#51
var gs = require('github-scraper');
var url = '/dwyl/tudo/issues/51';
gs(url, function (err, data) {
console.log(data); // use the data how ever you like
});
sample output:
{ entries:
[ { id: 'issue-96442793',
author: 'nelsonic',
created: '2015-07-22T00:00:45Z',
body: 'instead of waiting for people to perform the steps to authorise Tudo (to access their GitHub orgs/issues we could request their GitHub username on the login page and initiate the retrieval of their issues while they are authenticating... That way, by the time they get back to Tudo their issues dashboard is already pre-rendered and loaded! This is a wow-factor people won\'t be expecting and thus our app immediately delivers on our first promise!\n\nThoughts?' },
{ id: 'issuecomment-123807796',
author: 'iteles',
created: '2015-07-22T17:54:12Z',
body: 'I\'d love to test this out, this will be an amazing selling point if we can get the performance to work like we expect!' },
{ id: 'issuecomment-124048121',
author: 'nelsonic',
created: '2015-07-23T10:20:15Z',
body: '@iteles have you watched the Foundation Episode featuring Kevin Systrom (instagram) ?\n\n\nhttps://www.youtube.com/watch?v=nld8B9l1aRE\n\n\nWhat were the USPs that contributed to instagram\'s success (considering how many photo-related-apps were in the app store at the time) ?\n\ncc: @besarthoxhaj' },
{ id: 'issuecomment-124075792',
author: 'besarthoxhaj',
created: '2015-07-23T11:59:31Z',
body: '@nelsonic love the idea! Let\'s do it!' } ],
labels: [ 'enhancement', 'help wanted', 'question' ],
participants: [ 'nelsonic', 'iteles', 'besarthoxhaj' ],
url: '/dwyl/tudo/issues/51',
title: 'Pre-fetch people\'s issues while they are authenticating with GitHub',
state: 'Open',
author: 'nelsonic',
created: '2015-07-22T00:00:45Z',
milestone: 'Minimal Usable Product',
assignee: 'besarthoxhaj' }
By contrast using the GitHub API to fetch this issue see: https://developer.github.com/v3/issues/#get-a-single-issue
format:
/repos/:owner/:repo/issues/:number
curl https://api.github.com/repos/dwyl/tudo/issues/51
Milestones are used to group issues into logical units.
var gs = require('github-scraper');
var url = '/dwyl/tudo/milestones';
gs(url, function (err, data) {
console.log(data); // use the data how ever you like
});
Sample output:
{ entries:
[ { name: 'Test Milestone - Please Don\'t Close!',
due: 'Past due by 16 days',
updated: 'Last updated 5 days ago',
desc: 'This Milestone in used in our e2e tests to check for an over-due milestone, so please don\'t close it!',
progress: '0%',
open: 1,
closed: 0 },
{ name: 'Minimal Usable Product',
due: 'Due by July 5, 2016',
updated: 'Last updated 2 days ago',
desc: 'What is the absolute minimum we can do to deliver value to people using the app?\n(and thus make them want to come back and use it!)',
progress: '0%',
open: 5,
closed: 0 } ],
url: 'https://github.com/dwyl/tudo/milestones',
open: 2,
closed: 1 }
All repositories have a set of standard labels (built-in to GitHub) e.g: https://github.com/dwyl/tudo/labels is (currently) only using the "standard" labels.
Whereas the RethinkDB (which uses GitHub for all their project tracking) uses several custom labels: https://github.com/rethinkdb/rethinkdb/labels
We need to crawl these for each repo.
var gs = require('github-scraper');
var url = '/dwyl/time/labels';
gs(url, function (err, data) {
console.log(data); // use the data how ever you like
});
Here's the extraction of the standard labels:
[
{ name: 'bug',
style: 'background-color: #fc2929; color: #fff;',
link: '/dwyl/tudo/labels/bug',
count: 3 },
{ name: 'duplicate',
style: 'background-color: #cccccc; color: #333333;',
link: '/dwyl/tudo/labels/duplicate',
count: 0 },
{ name: 'enhancement',
style: 'background-color: #84b6eb; color: #1c2733;',
link: '/dwyl/tudo/labels/enhancement',
count: 11 },
{ name: 'help wanted',
style: 'background-color: #159818; color: #fff;',
link: '/dwyl/tudo/labels/help%20wanted',
count: 21 },
{ name: 'invalid',
style: 'background-color: #e6e6e6; color: #333333;',
link: '/dwyl/tudo/labels/invalid',
count: 1 },
{ name: 'question',
style: 'background-color: #cc317c; color: #fff;',
link: '/dwyl/tudo/labels/question',
count: 10 }
]
or a repo that has custom labels:
{ entries:
[ { name: '[alpha]',
style: 'background-color: #79CDCD; color: #1e3333;',
link: '/dwyl/time/labels/%5Balpha%5D',
count: 2 },
{ name: 'API',
style: 'background-color: #006b75; color: #fff;',
link: '/dwyl/time/labels/API',
count: 11 },
{ name: 'bug',
style: 'background-color: #fc2929; color: #fff;',
link: '/dwyl/time/labels/bug',
count: 5 },
{ name: 'chore',
style: 'background-color: #e11d21; color: #fff;',
link: '/dwyl/time/labels/chore',
count: 9 },
{ name: 'discuss',
style: 'background-color: #bfe5bf; color: #2a332a;',
link: '/dwyl/time/labels/discuss',
count: 43 },
{ name: 'Documentation',
style: 'background-color: #eb6420; color: #fff;',
link: '/dwyl/time/labels/Documentation',
count: 2 },
{ name: 'duplicate',
style: 'background-color: #cccccc; color: #333333;',
link: '/dwyl/time/labels/duplicate',
count: 0 },
{ name: 'enhancement',
style: 'background-color: #84b6eb; color: #1c2733;',
link: '/dwyl/time/labels/enhancement',
count: 27 },
{ name: 'external dependency',
style: 'background-color: #D1EEEE; color: #2c3333;',
link: '/dwyl/time/labels/external%20dependency',
count: 1 },
{ name: 'FrontEnd',
style: 'background-color: #f7c6c7; color: #332829;',
link: '/dwyl/time/labels/FrontEnd',
count: 26 },
{ name: 'help wanted',
style: 'background-color: #009800; color: #fff;',
link: '/dwyl/time/labels/help%20wanted',
count: 42 },
{ name: 'invalid',
style: 'background-color: #e6e6e6; color: #333333;',
link: '/dwyl/time/labels/invalid',
count: 0 },
{ name: 'investigate',
style: 'background-color: #fbca04; color: #332900;',
link: '/dwyl/time/labels/investigate',
count: 18 },
{ name: 'MVP',
style: 'background-color: #207de5; color: #fff;',
link: '/dwyl/time/labels/MVP',
count: 27 },
{ name: 'NiceToHave',
style: 'background-color: #fbca04; color: #332900;',
link: '/dwyl/time/labels/NiceToHave',
count: 7 },
{ name: 'Post MVP',
style: 'background-color: #fef2c0; color: #333026;',
link: '/dwyl/time/labels/Post%20MVP',
count: 24 },
{ name: 'question',
style: 'background-color: #cc317c; color: #fff;',
link: '/dwyl/time/labels/question',
count: 25 },
{ name: 'UI',
style: 'background-color: #bfdadc; color: #2c3233;',
link: '/dwyl/time/labels/UI',
count: 13 } ],
url: 'https://github.com/dwyl/time/labels' }
A much more effective way of collating all the issues relevant to a person is to search for them!
example: https://github.com/search?type=Issues&q=author%3Aiteles&state=open&o=desc&s=created
{
entries:
[
{ title: 'Remove flexbox from CSS',
url: '/dwyl/dwyl.github.io/issues/29',
desc: 'To ensure the site works across all devices, particularly Kindle/e-readers.',
author: 'iteles',
created: '2015-07-25T22:57:20Z',
comments: 2 },
{ title: 'CSS | Add indentation back into main.css (disappeared from master)',
url: '/dwyl/tudo/issues/77',
desc: 'All indentation has been removed from main.css in the latest commit. \n\nThis needs to be put back in as originally written by @msmichellegar and @iteles.',
author: 'iteles',
created: '2015-07-25T16:27:59Z' },
{ title: 'CSS | Investigate styling of issue label colours',
url: '/dwyl/tudo/issues/72',
desc: 'Labels can be given any colour so there is no predictable set that we can code into the CSS file.\n\nWe need to investigate what the best way to ensure we can provide the right colour of background to the ...',
author: 'iteles',
created: '2015-07-23T17:49:02Z',
comments: 4 }
],
next: '/search?o=desc&p=2&q=author%3Aiteles&s=created&state=open&type=Issues'
}
For the issues created across all their personal repositories use a search query of the form:
https://github.com/search?q=user%3A{username|org}
&state={state}
&type=Issues&s={relevance}
&o={order}
e.g: https://github.com/search?q=user%3Aiteles&state=open&type=Issues&s=updated&o=asc
Or to find all the issues where the person is the author use a query of the following format:
https://github.com/search?q=author%3A{username|org}
&state={state}
&type=Issues&s={relevance}
&o={order}
Or to find all the issues assigned to the person use a query of the following format:
https://github.com/search?q=assignee%3A{username|org}
&state={state}
&type=Issues&s={relevance}
&o={order}
&s={filter}
We can use a mentions (search) query to discover all the issues where a given person (username) was mentioned:
https://github.com/search?q=mentions%3A{username}&type=Issues&state={state}
e.g: https://github.com/search?q=mentions%3Aiteles&type=Issues&state=open
This could be more than the issues in the person's (own) repos or the repos the person has access to (via org). e.g: if Sally axks a clarifying question on a project she has not yet contributed to, the issue will not appear when we crawl the repos on her profile or orgs she has access to ...
There are many filters we can use to find issues, here are a few:
- created https://github.com/search?q=author%3Aiteles&s=created&type=Issues&o=desc&state=open
- updated: https://github.com/search?q=author%3Aiteles&s=updated&type=Issues&o=desc&state=open
- date range: https://github.com/dwyl/time/issues?q=is%3Aissue+is%3Aopen+updated%3A%3C2015-06-28
For way more details on searching & filters see:
- https://help.github.com/articles/searching-issues/
- https://help.github.com/articles/searching-github/#types-of-searches
- https://help.github.com/articles/search-syntax/
If you want even more examples of the pages you can scrape, take a look at our end-to-end tests where we test all the scrapers!
Would it be interesting to see/track:
- who makes the most commits to the project
- when (what time of day/night) people do their work
- what did the person contribute? (docs, code improvement, tests, typo, dependency update?)
Show your interest in this feature: #17
Contributions are always welcome!
We have a backlog of features (many pages we want to parse)
please see: https://github.com/nelsonic/github-scraper/issues
If anything interests you, please lave a comment on the issue.
Your first step to contributing to this project
is to run it on your localhost
.
In your terminal, clone the repository from GitHub:
git clone https://github.com/nelsonic/github-scraper.git && cd github-scraper
Ensure you have Node.js installed, see https://nodejs.org
Then run the following command to install the project dependencies:
npm install
You should see output in your terminal similar to the following:
added 162 packages from 177 contributors and audited 265 packages in 4.121s
That tells you that the dependencies were successfully installed.
In your terminal execute the following command:
npm test
You should see output similar to the following:
> github-scraper@6.7.1 test /Users/n/code/github-scraper
> istanbul cover ./node_modules/tape/bin/tape ./test/*.js | node_modules/tap-spec/bin/cmd.js
read list of followers for @jupiter (single page of followers)
- - - GitHub Scraper >> /jupiter/followers >> followers - - -
β jupiter/followers data.type: followers
β @jupiter/followers has 34 followers
β Nelson in jupiter/followers
β @jupiter/followers only has 1 page of followers
read list of followers for @iteles (multi-page)
- - - GitHub Scraper >> /iteles/followers >> followers - - -
β "followers": 51 on page 1
β iteles/followers multi-page followers
... etc ...
=============================================================================
Writing coverage object [/Users/n/code/github-scraper/coverage/coverage.json]
Writing coverage reports at [/Users/n/code/github-scraper/coverage]
=============================================================================
=============================== Coverage summary ===============================
Statements : 100% ( 192/192 )
Branches : 100% ( 63/63 )
Functions : 100% ( 22/22 )
Lines : 100% ( 192/192 )
================================================================================
total: 102
passing: 102
duration: 31.6s
The tests take around 30 seconds to run on my localhost
,
but your test execution time will vary depending on your location
(the further you are from GitHub's servers the slower the tests will run...).
Don't panic if you see some red in your terminal while the tests are running.
We have to simulate failure 404
and 403
errors
to ensure that we can handle them.
Pages some times disappear
e.g: a user leaves GitHub or deletes a project.
And our script needs to not freak out when that happens.
This is good practice in DOM parsing, the web changes a lot!
When the tests pass on your localhost
,
you know everything is working as expected.
Time to move on to the fun bit!
Note: This project follows Test Driven Development (TDD) because it's the only way we can maintain our sanity ... If we didn't have tests it would be chaos and everything would "break" all the time. If you are contributing to the project, please be aware that tests are required and any Pull Requests without tests will not be considered. (please don't take it personally, it's just a rule we have).
If you are new to TDD, please see: github.com/dwyl/learn-tdd
Once you have the project running on your localhost
,
it's time to pick a page to parse!
There are a bunch of features in the backlog. see: https://github.com/nelsonic/github-scraper/issues
Pick one that interests you and write a comment on it to show your interest in contributing.
We use Travis-CI (Continuous Integration), to ensure that our code works and all tests pass whenever a change is made to the code. This is essential in any project and even more so in a DOM parsing one.
If you are new to Travis-CI, please see: github.com/dwyl/learn-travis
When you attempt to commit code on your localhost
,
the tests will run before
your commit will register.
This is a precaution to ensure that the code we write is always tested.
There is no point writing code that is not being tested
as it will "break" almost immediately and be unmaintainable.
Simply wait a few seconds for the tests to pass and then push your work to GitHub.
If you are new to pre-commit hooks, please see: github.com/dwyl/learn-pre-commit
If you are the kind of person that likes to understand how something works, this is your section.
lib/switcher.js
handles inference.
We wanted to use a switch > case
construct but, ended up using if/else
because there are two types of checks we need to do so if/else
seemed simpler.
- GitHub has 10.3 Million users (at last count)
- yet the most followed person Linus Torvalds "only" has 28k followers (so its a highly distributed network )
- https://www.githubarchive.org/ attempts to archive all of GitHub
- http://octoboard.com/ shows stats for the past 24h
Must read up about http://en.wikipedia.org/wiki/Inverted_index so I understand how to use: https://www.npmjs.org/package/level-inverted-index
-
GitHub stats (node module): https://github.com/apiengine/ghstats (no tests or recent work/activity, but interesting functionality)
-
Hard Drive reliability stats: https://www.backblaze.com/blog/hard-drive-reliability-update-september-2014 (useful when selecting which drives to use in the storage array - Clear Winner is Hitachi 3TB)
-
RAID explained in layman's terms: http://uk.pcmag.com/storage-devices-reviews/7917/feature/raid-levels-explained
-
RAID Calculator: https://www.synology.com/en-global/support/RAID_calculator (if you don't already know how much space you get)
-
SQLite limits: https://www.sqlite.org/limits.html
- Summary of Most Active GitHub users: http://git.io/top
- Intro to web-scraping with cheerio: https://www.digitalocean.com/community/tutorials/how-to-use-node-js-request-and-cheerio-to-set-up-simple-web-scraping
- GitHub background info: http://en.wikipedia.org/wiki/GitHub
- GitHub Event Types: https://developer.github.com/v3/activity/events/types/
- Github Stats API: https://developer.github.com/v3/repos/statistics/
- GitHub Followers API: https://developer.github.com/v3/users/followers/
Example:
curl -v https://api.github.com/users/pgte/followers
[
{
"login": "methodmissing",
"id": 379,
"avatar_url": "https://avatars.githubusercontent.com/u/379?v=2",
"gravatar_id": "",
"url": "https://api.github.com/users/methodmissing",
"html_url": "https://github.com/methodmissing",
"followers_url": "https://api.github.com/users/methodmissing/followers",
"following_url": "https://api.github.com/users/methodmissing/following{/other_user}",
"gists_url": "https://api.github.com/users/methodmissing/gists{/gist_id}",
"starred_url": "https://api.github.com/users/methodmissing/starred{/owner}{/repo}",
"subscriptions_url": "https://api.github.com/users/methodmissing/subscriptions",
"organizations_url": "https://api.github.com/users/methodmissing/orgs",
"repos_url": "https://api.github.com/users/methodmissing/repos",
"events_url": "https://api.github.com/users/methodmissing/events{/privacy}",
"received_events_url": "https://api.github.com/users/methodmissing/received_events",
"type": "User",
"site_admin": false
},
etc...]
- The API only returns 30 results per query.
- X-RateLimit-Limit: 60 (can only make 60 requests per hour) ... 1440 queries per day (60 per hour x 24 hours) sounds like ample on the surface. But, if we assume the average person has at least 2 pages worth of followers (30<) it means on a single instance/server we can only track 720 people. Not really enough to do any sort of trend analysis. π If we are tracking people with hundreds of followers (and growing fast) e.g. 300< followers. the number of users we can track comes down to 1440 / 10 = 140 people... (10 requests to fetch complete list of followers) we burn through 1440 requests pretty quickly.
- There's no guarantee which order the followers will be in (e.g. most recent first?)
- Results are Cached so they are not-real time like they are in the Web. (seems daft, but its true.) Ideally they would have a Streaming API but sadly, GitHub is built in Ruby-on-Rails which is "RESTful" (not real-time).
Once we know who we should be following, we can use
- https://developer.github.com/v3/users/followers/#follow-a-user
- https://developer.github.com/v3/users/followers/#check-if-one-user-follows-another
e.g:
curl -v https://api.github.com/users/pgte/following/visionmedia
The fact that scraping or "crawling" is Google's Business Model suggests that scraping is at least "OK" ...
Started typing this into google and saw:
I read a few articles and was not able to locate a definitive answer ...
- Legal Issues: https://en.wikipedia.org/wiki/Web_scraping#Legal_issues
- It depends: http://resources.distilnetworks.com/h/i/53822104-is-web-scraping-illegal-depends-on-what-the-meaning-of-the-word-is-is/181642
- Screen scraping: How to profit from your rival's data: http://www.bbc.com/news/technology-23988890
- Web Scraping For Fun and Profit: https://blog.hartleybrody.com/web-scraping/