Skip to content

Commit

Permalink
docs: update README with installation instructions and available Make…
Browse files Browse the repository at this point in the history
… targets
  • Loading branch information
simon-clematide committed Jan 2, 2025
1 parent a52bdb4 commit 7396216
Showing 1 changed file with 37 additions and 18 deletions.
55 changes: 37 additions & 18 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,20 +20,14 @@ und unknown licencing status. We set the version and name in the model's meta da
## Prerequisites

The build process has been tested on modern Linux and macOS systems and requires
Python 3.11. Under Ubuntu/Debian
, make sure to have the following packages installed:
Python 3.11. Under Ubuntu/Debian, make sure to have the following packages installed:

```sh
# install python3.11 according to your OS
sudo apt update
sudo apt upgrade -y
sudo add-apt-repository ppa:deadsnakes/ppa
sudo apt update
sudo apt install python3.11 -y
sudo apt install python3.11-distutils -y
curl -sS https://bootstrap.pypa.io/get-pip.py | python3.11
sudo apt install git git-lfs make moreutils coreutils parallel # needed for building
sudo apt jq # needed for computing statistics
# install linux tools and python3.11 according on Debian/Ubuntu
sudo bash cookbook/install_apt.sh

# on macos with brew
sudo bash cookbook/install_brew.sh
```

This repository uses `pipenv`.
Expand Down Expand Up @@ -74,21 +68,40 @@ cp config.local.mk.sample config.local.mk
edit config.local.mk
```

## Running the pipeline
## Available Make targets

The build process is controlled by the `Makefile`. Main targets include:

```sh
make help # show available targets
make setup # initialize development environment
make newspaper -j N # process specific newspaper/year pairs in parallel
make collection # process all newspapers
make clean # clean build artifacts
make distclean # remove all generated files
```

## Processing options

The build process is controlled by the `Makefile`.
For newspaper processing, several options are available:

```sh
make help # show available targets
# Process with specific parallelism
make newspaper MAKE_PARALLEL_OPTION=16

# Process specific newspapers
make newspaper NEWSPAPERS="GDL IMP"

make newspaper -j N # process specific newspaper/year pairs in parallel typically for testing
# Process specific years
make newspaper YEARS="1900 1901"

make collection MAKE_PARALLEL_OPTION=16 # process all newspapers using parallel processing within newspaper/year pairs
# Combine options
make newspaper NEWSPAPERS="GDL" YEARS="1900" MAKE_PARALLEL_OPTION=8
```

## Command-Line Options for `spacy_linguistic_processing.py`

The `spacy_linguistic_processing.py` script supports several command-line options:
The `lib/spacy_linguistic_processing.py` script supports several command-line options:

- `--lid`: Path to the language identification file.
- `--language`: Specify a language code to use for all items.
Expand All @@ -101,6 +114,12 @@ The `spacy_linguistic_processing.py` script supports several command-line option
- `--s3-output-path`: S3 path to upload the output file after processing or check if it already exists.
- `--keep-timestamp-only`: After uploading to S3, keep only the timestamp of the local output file for data efficiency.

## Build System Structure

The build system is organized into several make include files:

- `config.local.mk`: Local configuration overrides (not in the repository)

# Uploading to impresso S3 bucket

Ensure that the environment variables `SE_ACCESS_KEY` and `SE_SECRET_KEY` for access to the
Expand Down

0 comments on commit 7396216

Please sign in to comment.