wd2duckdb
is a tool transforming
Wikidata JSON dumps
into a fully indexed DuckDB database ~80% smaller than the original
dump, yet contains most of its information. Note that only the English version of
the Wikidata items are stored. For you to change so, please refer to
this line in the code.
The resulting database enables high-performance queries to be executed on commodity
hardware without the need to install and configure specialized triplestore software.
This project is heavily based on wd2sql.
Make sure that you install the latest stable version of Rust; that is, as of May the 5th, version 1.69 or later, then run:
cargo install wd2duckdb
This will compile wd2duckdb
for your native architecture, increasing the performance.
wd2duckdb --json <JSON_FILE> --database <DUCKDB_FILE>
Use -
as <JSON_FILE>
to read from standard input instead of from a file.
This makes it possible to build a pipeline that processes JSON data as it is
being decompressed, without having to decompress the full dump to disk. In case
of a .bz2
file, you can use the following instruction:
bzcat latest-all.json.bz2 | wd2duckdb --json - --database <DUCKDB_FILE>
In case of a .gz
compressed file, the following is required:
gunzip latest-all.json.gz | wd2duckdb --json - --database <DUCKDB_FILE>
In case you want to write changes directly to the standard ouput; that is, without
creating a file for the uncompressed .json
, you can do the following:
gunzip -c latest-all.json.gz | wd2duckdb --json - --database <DUCKDB_FILE>
If you are working with large dumps where the uncompressed .json
file size is in
the order of Terabytes, it is best to choose the last option. The .duckdb
file,
which is more memory-efficient, may thus be created immediately.
Without the efforts of the countless people who built Wikidata and its
contents, wd2duckdb
would be useless. It's truly impossible to praise
this amazing open data project enough.
- wd2sql is this project's main inspiration.
Copyright © 2023 Ángel Iglesias Préstamo (angel.iglesias.prestamo@gmail.com)
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If not, see https://www.gnu.org/licenses/.
By contributing to this project, you agree to release your contributions under the same license.