Tool for parsing Wikipedia articles graph.
It takes the provided initial wikipedia links, parses them, gets the child links and adds them to parsing queue. It also saves parsed article name.
Your can export collected links as a CSV table (see usage) to be used in external software. For example you can directly import that file to Cosmograph to visualize the graph:
go install github.com/tymbaca/wikigraph@latest
Launch parse process by doing this:
wikigraph parse my_graph.db https://en.wikipedia.org/wiki/Kingdom_of_Greece
Or:
wikigraph parse my_graph.db https://en.wikipedia.org/wiki/Kingdom_of_Greece https://en.wikipedia.org/wiki/Christmas
Program will begin parsing the wikipedia, starting from provided URLs. You can exit
the program at any moment by pressing <Ctrl-c>
(see graceful shutdown).
Your can continue by launching program with already existing database file:
wikigraph parse my_graph.db
It will continue parsing as expected (without loosing the progress). Also it will retry all links that it failed to parse in previous attempt.
Now you can export the graph to CSV file:
wikigraph export my_graph.db my_graph.csv
Notice that you can run this at any time. You don't need to parse all articles in Wikipedia :)
The exported graph will look something like this:
from,to
"Heil unserm König, Heil!",Kingdom of Greece
"Heil unserm König, Heil!",Hymn to Liberty
"Heil unserm König, Heil!",Greece
Constitutional monarchy,Absolute monarchy
Constitutional monarchy,State religion
Constitutional monarchy,Unitary state
...
Run wikigraph help
for more info.
Worker Pool. Program uses a pool of workers parallelize parsing workload.
Job Queueing. Every fetched child link is pushed to the job queue with
PENDING status, so later on another worker can grab it, parse it and produce
more child articles. Job queue is basically an article
table in SQLite DB,
you can explore it with any suitable DB client. If program exits, you still
have all the queue in database, so you can restart easely.
Graceful Shutdown. Any time while program is executing you can press
<Crlt-c>
. Program will wait until all workers will parse and save their
results and only then exits.
Rate Limiter. Program uses internal http client with rate limiting by
default set to 20 RPS (it's hardcoded). I found this rate most optimal. If you
get 429 Too Many Requests
then just wait a bit and try again. Or you can can
change the rate (in cmd/wikigraph/main.go
) in code and recompile the program.
I'm too lazy to add RPS flag (just look at how I handle cli arguments in main.go
lol).
Program was tested with English, Russian and Wolof Wikipedia. So any other non-ascii language articles can be supported, as long as they match similar HTML layout.
- Use official MediaWiki API instead of parsing the whole HTML of every article.