A tool to transfer an extract of a wikidata dump into a CouchDB database
This tool was a bit of a naive implementation; if I wanted to do that today, I would do it differently, and make sure to use CouchDB bulk mode:
- Get a wikidata json dump
- Optionally, filter to get the desired subset. In any case, turn the dump into valid NDJSON (drop the first and last lines and the comma at the end of each lines).
- Pass each entity through a function to move the "id" attribute to "_id", using https://github.com/maxlath/ndjson-apply, to match CouchDB requirements.
- Bulk upload the result to CouchDB using https://github.com/maxlath/couchdb-bulk2
- NodeJS >= v6. If your distribution doesn't provide an recent version of NodeJS, you might want to uninstall NodeJS and reinstall it using NVM
git clone https://github.com/maxlath/import-wikidata-dump-to-couchdb
cd import-wikidata-dump-to-couchdb
npm install
Now you can customize ./config/default.js
to your needs.
Download Wikidata latest dump
Extract the subset of the dump fitting your needs, as you might not want to throw ~40Go at your database's face.
For instance, for the needs of the authors-birthday bot, I wanted to keep only Wikidata entities of writers:
As each line of the dump is an entity, you could do something like this with grep
cat dump.json | grep '36180\,' > isWriter.json
Here the trick is that every entity with occupation-> writer (P106->Q36180) will have 36180 somewhere in the line (as a claim numeric-id
). And tadaa, you went from a 39Go dump to a way nicer 384Mo subset.
But now, we can do something cleaner using wikidata-filter:
cat dump.json | wikidata-filter --claim P106:Q36180 > isWriter.json
This new file isnt valid json (it's line-delimited JSON), but every new line is, once you remove the coma at the end of the line, so here is the plan: take every line, remove the coma, PUT it in your database:
./import.js ./isWriter.json
startline=5
# the line 10 will be included
endline=10
./import.js ./isWriter.json $startline $endline
In the config file (./config/default.js
), you can set the behavior on conflict, that is, when the importers tries to add an entity that was already previously added to CouchDB:
update
(default): update document if there is a change, otherwise pass.pass
: always passexit
: exit process at first conflict
- wikidata-filter: a command-line tool to filter a Wikidata dump by claim
- wikidata-subset-search-engine: tools to setup an ElasticSearch instance fed with subsets of Wikidata
- wikidata-sdk: a javascript tool-suite to query Wikidata and simplify its results
- wikidata-cli: read and edit Wikidata from the command line
MIT