Extracts SVG logos from Wikipedia InfoBoxes.
I already extract SVG logos from Wikipedia if they have "logo" in the file name, but that there are valid SVG logos. This is a way to get more of them.
The Wikipedia data is licensed CC-BY-SA.
They provide regular data dumps which can be found on dumps.wikimedia.org. The latest page and there is a dated page (example for 20240920).
Example URLs:
- https://dumps.wikimedia.org/enwiki/20240920/enwiki-20240920-pages-articles-multistream1.xml-p1p41242.bz2
- https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles-multistream4.xml-p311330p558391.bz2
Parsing is non-trivial.
python3 -m venv .venv
source .venv/bin/activate
python3 -m pip install -r requirements.txt