PIXL

PIXL Image eXtraction Laboratory

PIXL is a system for extracting, linking and de-identifying DICOM imaging data, structured EHR data and free-text data from radiology reports at UCLH. Please see the rolling-skeleton for more details.

PIXL is intended run on one of the GAE (General Application Environments)s and comprises several services orchestrated by Docker Compose.

To get access to the GAE, see the documentation on Slab. Please request access to Slab and add further details in a new blank issue.

Installation

Install the PIXL Python modules by running the following commands from the top-level PIXL/ directory:

python -m pip install -e pixl_core/
python -m pip install -e cli/

Note, the CLI currently needs to be installed in editable mode.

Development

Follow the developer setup instructions.

Before raising a PR, make sure to run the tests for every PIXL module, not just the one you have been working on. In addition, make sure to have pre-commit installed to automatically check your code before committing.

Design

docs/design contains the design documentation for the PIXL system.

Services

PIXL core

The core module contains the functionality shared by the other PIXL modules.

PIXL CLI

Primary interface to the PIXL system.

Hasher API

HTTP API to securely hash an identifier using a key stored in Azure Key Vault.

Orthanc

Orthanc Raw

A DICOM node which receives images from the upstream hospital systems and acts as cache for PIXL.

Orthanc Anon

A DICOM node which wraps our de-identifcation process and uploading of the images to their final destination.

PIXL DICOM de-identifier

Provides helper functions for de-identifying DICOM data

PostgreSQL

RDBMS which stores DICOM metadata, application data and anonymised patient record data.

Export API

HTTP API to export files (parquet and DICOM) from UCLH to endpoints.

Image Extractor

HTTP API to process messages from the imaging queue and populate the raw orthanc instance with images from PACS/VNA.

Setup `PIXL` in GAE

Click here to expand steps and configurations

0. UCLH infrastructure setup

1. Choose deployment environment

This is one of dev|test|staging|prod and referred to as <environment> in the docs.

2. Initialise environment configuration

Create a local .env file in the PIXL directory:

cp .env.sample .env

Add the missing configuration values to the new files:

Credentials

PIXL_DB_* These are credentials for the containerised PostgreSQL service and are set in the official PostgreSQL image. Use a strong password for prod deployment but the only requirement for other environments is consistency as several services interact with the database.

Ports

Most services need to expose ports that must be mapped to ports on the host. The host port is specified in .env Ports need to be configured such that they don't clash with any other application running on that GAE.

Storage size

The maximum storage size of the orthanc-raw instance can be configured through the ORTHANC_RAW_MAXIMUM_STORAGE_SIZE environment variable in .env. This limits the storage size to the specified value (in MB). When the storage is full Orthanc will automatically recycle older studies in favour of new ones.

3. Configure a new project

To configure a new project, follow these steps:

Create a new git branch from main

git checkout main
git pull
git switch -c <branch-name>

Copy the template_config.yaml file to a new file in the projects/config directory and fill in the details.
The filename of the project config should be <project-slug>.yaml

[!NOTE] The project slug should match the slugify-ed project name in the extract_summary.json log file!
Open a PR in PIXL to merge the new project config into main

The config YAML file

The configuration file defines:

Project name: the <project-slug> name of the Project
The DICOM dataset modalities to retain (e.g. ["DX", "CR"] for X-Ray studies)
The anonymisation operations to be applied to the DICOM tags, by providing a file path to one or multiple YAML files. We currently allow two types of files:
- base: the base set of DICOM tags to be retained in the anonymised dataset
- manufacturer_overrides: any manufacturer-specific overrides to the base set of DICOM tags. This is useful for manufacturers that store sensitive information in non-standard DICOM tags. Multiple manufacturers can be specified in the YAML file as follows:
```
- manufacturer: "Philips"
  tags:
  - group: 0x2001
    element: 0x1003
    op: "keep"
    # ...
- manufacturer: "Siemens"
  tags:
  - group: 0x0019
    element: 0x100c
    op: "keep"
    # ...
```
The endpoints used to upload the anonymised DICOM data and the public and radiology parquet files. We currently support the following endpoints:
- "none": no upload
- "ftps": a secure FTP server (for both DICOM and parquet files)
- "dicomweb": a DICOMweb server (for DICOM files only). Requires the DICOMWEB_* environment variables to be set in .env
- "xnat": an XNAT instance (for DICOM files only)

Project secrets

Any credentials required for uploading the project's results should be stored in an Azure Key Vault (set up instructions below). PIXL will query this key vault for the required secrets at runtime. This requires the following environment variables to be set so that PIXL can connect to the key vault:

EXPORT_AZ_CLIENT_ID: the service principal's client ID, mapped to AZURE_CLIENT ID in docker-compose
EXPORT_AZ_CLIENT_PASSWORD: the password, mapped to AZURE_CLIENT_SECRET in docker-compose
EXPORT_AZ_TENANT_ID: ID of the service principal's tenant. Also called its 'directory' ID. Mapped to AZURE_TENANT_ID in docker-compose
EXPORT_AZ_KEY_VAULT_NAME the name of the key vault, used to connect to the correct key vault

These variables can be set in the .env file. For testing, they can be set in the test/.secrets.env file. For dev purposes find the pixl-dev-secrets.env note on LastPass for the necessary values.

If an Azure Keyvault hasn't been set up yet, follow these instructions.

A second Azure Keyvault is used to store hashing keys and salts for the hasher service. This kevyault is configured with the following environment variables:

HASHER_API_AZ_CLIENT_ID: the service principal's client ID, mapped to AZURE_CLIENT ID in docker-compose
HASHER_API_AZ_CLIENT_PASSWORD: the password, mapped to AZURE_CLIENT_SECRET in docker-compose
HASHER_API_AZ_TENANT_ID: ID of the service principal's tenant. Also called its 'directory' ID. Mapped to AZURE_TENANT_ID in docker-compose
HASHER_API_AZ_KEY_VAULT_NAME the name of the key vault, used to connect to the correct key vault

See the hasher documentation for more information.

Run `PIXL` in GAE

Click here to view detailed steps

Start

From the PIXL directory:

pixl dc up

Once the services are running, you can interact with the services using the pixl CLI.

Stop

From the PIXL directory:

pixl dc down  # --volumes to remove all data volumes

Analysis

The number of DICOM instances in the raw Orthanc instance can be accessed from http://<pixl_host>:<ORTHANC_RAW_WEB_PORT>/ui/app/#/settings and similarly with the Orthanc Anon instance, where pixl_host is the host of the PIXL services and ORTHANC_RAW_WEB_PORT is defined in .env.

The imaging export progress can be interrogated by connecting to the PIXL database with a database client (e.g. DBeaver), using the connection parameters defined in .env.

Assumptions

PIXL data extracts include the below assumptions

(MRN, Accession number) is unique identifier for a report/DICOM study pair
Patients have a single relevant MRN

File journey overview

Files that are present at each step of the pipeline.

A more detailed description of the relevant file types is available in docs/file_types/parquet_files.md.

Resources in source repo (for test only)

test/resources/omop/public /*.parquet
....................private/*.parquet
....................extract_summary.json

OMOP ES extract dir (input to PIXL)

EXTRACT_DIR is the directory passed to pixl populate as the input PARQUET_PATH argument.

EXTRACT_DIR/public /*.parquet
............private/*.parquet
............extract_summary.json

PIXL Export dir (PIXL intermediate)

The directory where PIXL will copy the public OMOP extract files (which now contain the radiology reports) to. These files will subsequently be uploaded to the parquet destination specified in the project config.

EXPORT_ROOT/PROJECT_SLUG/all_extracts/EXTRACT_DATETIME/radiology/radiology.parquet
....................................................../omop/public/*.parquet

Destination

FTP server

If the parquet destination is set to ftps, the public extract files and radiology report will be uploaded to the FTP server at the following path:

FTPROOT/PROJECT_SLUG/EXTRACT_DATETIME/parquet/radiology/radiology.parquet
..............................................omop/public/*.parquet

Cloning repository

Generate your SSH keys as suggested here
Clone the repository by typing (or copying) the following lines in a terminal

git clone git@github.com:SAFEHR-data/PIXL.git

Name		Name	Last commit message	Last commit date
Latest commit History 245 Commits
.github		.github
bin/linters		bin/linters
cli		cli
docker		docker
docs		docs
hasher		hasher
orthanc		orthanc
pixl_core		pixl_core
pixl_dcmd		pixl_dcmd
pixl_export		pixl_export
pixl_imaging		pixl_imaging
postgres		postgres
projects		projects
pytest-pixl		pytest-pixl
schemas		schemas
scripts		scripts
test		test
.env.sample		.env.sample
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.renovaterc.json5		.renovaterc.json5
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
codecov.yml		codecov.yml
docker-compose.yml		docker-compose.yml
mypy.ini		mypy.ini
pytest.ini		pytest.ini
ruff.toml		ruff.toml
template_config.yaml		template_config.yaml

License

SAFEHR-data/PIXL

Folders and files

Latest commit

History

Repository files navigation