Addresses a bug in dumpgenerator/util.py
. See mediawiki-client-tools#29:
TypeError: cannot use a string pattern on a bytes-like object
When this issue is addressed, this fork should be deleted and the DS-Import repo, which installs on Wikiteam3, should be updated to install elsiehupp/wikiteam3.
We archive wikis, from Wikipedia to the tiniest wikis
wikiteam3
is an ongoing project to port the legacy wikiteam
toolset to Python 3 and PyPI to make it more accessible for today's archivers.
Most of the focus has been on the core dumpgenerator
tool, but Python 3 versions of the other wikiteam
tools may be added over time.
wikiteam3
is a set of tools for archiving wikis. The tools work on MediaWiki wikis, but the team hopes to expand to other wiki engines. As of 2020, WikiTeam has preserved more than 250,000 wikis, several wikifarms, regular Wikipedia dumps and 34 TB of Wikimedia Commons images.
The main general-purpose module of wikiteam3
is dumpgenerator
, which can download XML dumps of MediaWiki sites that can then be parsed or redeployed elsewhere.
wikiteam3
requires Python 3.8 or later (less than 4.0), but you may be able to get it run with earlier versions of Python 3. On recent versions of Linux and macOS Python 3.8 should come preinstalled, but on Windows you will need to install it from python.org.
wikiteam3
has been tested on Linux, macOS, Windows and Android. If you are connecting to Linux or macOS via ssh
, you can continue using the bash
or zsh
command prompt in the same terminal, but if you are starting in a desktop environment and don't already have a preferred Terminal environment you can try one of the following.
NOTE: You may need to update and pre-install dependencies in order for
wikiteam3
to work properly. Shell commands for these dependencies appear below each item in the list. (Also note that while installing and runningwikiteam3
itself should not require administrative priviliges, installing dependencies usually will.)
-
On desktop Linux you can use the default terminal application such as Konsole or GNOME Terminal.
Linux Dependencies
While most Linux distributions will have Python 3 preinstalled, if you are cloning
wikiteam3
rather than downloading it directly you may need to installgit
.On Debian, Ubuntu, and the like:
sudo apt update && sudo apt upgrade && sudo install git
(On Fedora, Arch, etc., use
dnf
,pacman
, etc., instead.) -
On macOS you can use the built-in application Terminal, which is found in
Applications/Utilities
.macOS Dependencies
While macOS will have Python 3 preinstalled, if you are cloning
wikiteam3
rather than downloading it directly and you are using an older versions of macOS, you may need to installgit
.If
git
is not preinstalled, however, macOS will prompt you to install it the first time you run the command. Therefore, to check whether you havegit
installed or to installgit
, simply rungit
(with no arguments) in Terminal:git
If
git
is already installed, it will print its usage instructions. Ifgit
is not preinstalled, the command will pop up a window asking if you want to install Apple's command line developer tools, and clicking "Install" in the popup window will installgit
. -
On Windows 10 or Windows 11 you can use Windows Terminal.
Windows Dependencies
If you are already using the Windows Subsystem for Linux, you can follow the Linux instructions above. If you don't want to install a full WSL distribution, Git for Windows provides Bash emulation, so you can use it as a more lightweight option instead.
When installing Python 3.8 (from python.org), be sure to check "Add Python to PATH" so that installed Python scripts are accessible from any location. If for some reason installed Python scripts, e.g.
pip
, are not available from any location, you can add Python to thePATH
environment variable using the instructions here.And while doing so should not be necessary if you follow the instructions further down and install
wikiteam3
usingpip
, if you'd prefer that Windows store installed Python scripts somewhere other than the default Python folder under%appdata%
, you can also add your preferred alternative path such asC:\Program Files\Python3\Scripts\
or a subfolder ofMy Documents
. (You will need to restart any terminal sessions in order for this to take effect.)Whenever you'd like to run a Bash session, you can open a Bash terminal prompt from any folder in Windows Explorer by right-clicking and choosing the option from the context menu. (For some purposes you may wish to run Bash as an administrator.) This way you can open a Bash prompt and clone the
wikiteam3
repository in one location, and subsequently or later open another Bash prompt and runwikiteam3
to dump a wiki wherever else you'd like without having to browse to the directory manually using Bash. -
On Android you can use Termux.
Termux Dependencies
pkg update && pkg upgrade && pkg install git libxslt python
-
On iOS you can use iSH.
iSH Dependencies
apk update && apk upgrade && apk add git py3-pip
Note: iSH may automatically quit if your iOS device goes to sleep, and it may lose its status if you switch to another app. You can disable auto-sleep while iSH is running by clicking the gear icon and toggling "Disable Screen Dimming". (You may wish to connect your device to a charger while running iSH.)
The Python 3 port of the dumpgenerator
module of wikiteam3
is largely functional and can be installed from a downloaded or cloned copy of this repository.
There are two versions of these instructions:
- If you just want to use a version that mostly works
- If you want to follow my progress and help me test my latest commit
If you run into a problem with the version that mostly works, you can open an Issue. Be sure to include the following:
- The operating system you're using
- What command you ran that didn't work
- What output was printed to your terminal
In whatever folder you use for cloned repositories:
git clone https://github.com/elsiehupp/wikiteam3.git
cd wikiteam3
git checkout --track origin/python3
pip install --force-reinstall dist/*.whl
dumpgenerator [args]
pip uninstall wikiteam3
rm -r [cloned_wikiteam3_folder]
If you'd like to manually build and install wikiteam3
from a cloned or downloaded copy of this repository, run the following commands from the downloaded base directory:
curl -sSL https://install.python-poetry.org | python3 -
poetry install
poetry build
pip install --force-reinstall dist/*.whl
In either case, to uninstall wikiteam3
run this command (from any local directory):
pip uninstall wikiteam3
Note: this branch may not actually work at any given time!
1. Install Python Poetry
curl -sSL https://install.python-poetry.org | python3 -
Note: if you get an SSL error, you may need to follow the instructions here.
git clone git@github.com:elsiehupp/wikiteam3.git
or
git clone https://github.com/elsiehupp/wikiteam3.git
then:
cd wikiteam3
git checkout --track origin/prepare-for-publication
Note: Re-run the following steps each time to reinstall each time the
wikiteam3
branch is updated.
git pull
poetry update && poetry install && poetry build
pip install --force-reinstall dist/*.whl
dumpgenerator [args]
To run the test suite, run:
test-dumpgenerator
pip uninstall wikiteam3
rm -r [cloned_wikiteam3_folder]
After installing wikiteam3
using pip
you should be able to use the dumpgenerator
command from any local directory.
For basic usage, you can run dumpgenerator
in the directory where you'd like the download to be.
For a brief summary of the dumpgenerator
command-line options:
dumpgenerator --help
Several examples follow.
Note: the
\
and line breaks in the examples below are for legibility in this documentation.dumpgenerator
can also be run with the arguments in a single line and separated by a single space each.
dumpgenerator \
http://wiki.domain.org \
--xml \
--images
If the script can't find itself the api.php
and/or index.php
paths, then you can provide them:
dumpgenerator \
--api http://wiki.domain.org/w/api.php \
--xml \
--images
dumpgenerator \
--api http://wiki.domain.org/w/api.php \
--index http://wiki.domain.org/w/index.php \
--xml \
--images
If you only want the XML histories, just use --xml
. For only the images, just --images
. For only the current version of every page, --xml --current
.
dumpgenerator \
--api http://wiki.domain.org/w/api.php \
--xml \
--images \
--resume \
--path=/path/to/incomplete-dump
In the above example, --path
is only necessary if the download path is not the default.
dumpgenerator
will also ask you if you want to resume if it finds an incomplete dump in the path where it is downloading.
WikiTeam is the Archive Team [GitHub] subcommittee on wikis.
It was founded and originally developed by Emilio J. Rodríguez-Posada, a Wikipedia veteran editor and amateur archivist. Thanks to people who have helped, especially to: Federico Leva, Alex Buie, Scott Boyd, Hydriz, Platonides, Ian McEwen, Mike Dupont, balr0g and PiRSquared17.
The Python 3 initiative is currently being led by Elsie Hupp, with contributions from Victor Gambier and Thomas Karcher.