PIXL Image eXtraction Laboratory
PIXL
is a system for extracting, linking and de-identifying DICOM imaging data, structured EHR data and free-text data from radiology reports at UCLH.
Please see the rolling-skeleton for more details.
PIXL is intended run on one of the GAE (General Application Environments)s and comprises several services orchestrated by Docker Compose.
To get access to the GAE, see the documentation on Slab. Please request access to Slab and add further details in a new blank issue.
Install the PIXL Python modules by running the following commands from the top-level PIXL/
directory:
python -m pip install -e pixl_core/
python -m pip install -e cli/
Note, the CLI currently needs to be installed in editable mode.
Follow the developer setup instructions.
Before raising a PR, make sure to run the tests for every PIXL module, not just the one you
have been working on. In addition, make sure to have pre-commit
installed
to automatically check your code before committing.
docs/design
contains the design documentation for the PIXL system.
The core
module contains the functionality shared by the other PIXL modules.
Primary interface to the PIXL system.
HTTP API to securely hash an identifier using a key stored in Azure Key Vault.
A DICOM node which receives images from the upstream hospital systems and acts as cache for PIXL.
A DICOM node which wraps our de-identifcation process and uploading of the images to their final destination.
Provides helper functions for de-identifying DICOM data
RDBMS which stores DICOM metadata, application data and anonymised patient record data.
HTTP API to export files (parquet and DICOM) from UCLH to endpoints.
HTTP API to process messages from the imaging
queue and populate the raw orthanc instance with images from PACS/VNA.
Click here to expand steps and configurations
This is one of dev|test|staging|prod
and referred to as <environment>
in the docs.
Create a local .env
file in the PIXL directory:
cp .env.sample .env
Add the missing configuration values to the new files:
PIXL_DB_
* These are credentials for the containerised PostgreSQL service and are set in the official PostgreSQL image. Use a strong password forprod
deployment but the only requirement for other environments is consistency as several services interact with the database.
Most services need to expose ports that must be mapped to ports on the host. The host port is specified in .env
Ports need to be configured such that they don't clash with any other application running on that GAE.
The maximum storage size of the orthanc-raw
instance can be configured through the
ORTHANC_RAW_MAXIMUM_STORAGE_SIZE
environment variable in .env
. This limits the storage size to
the specified value (in MB). When the storage is full Orthanc will automatically recycle older
studies in favour of new ones.
To configure a new project, follow these steps:
-
Create a new
git
branch frommain
git checkout main git pull git switch -c <branch-name>
-
Copy the
template_config.yaml
file to a new file in theprojects/config
directory and fill in the details. -
The filename of the project config should be
<project-slug>
.yaml[!NOTE] The project slug should match the slugify-ed project name in the
extract_summary.json
log file! -
Open a PR in PIXL to merge the new project config into
main
The config YAML file
The configuration file defines:
-
Project name: the
<project-slug>
name of the Project -
The DICOM dataset modalities to retain (e.g.
["DX", "CR"]
for X-Ray studies) -
The anonymisation operations to be applied to the DICOM tags, by providing a file path to one or multiple YAML files. We currently allow two types of files:
base
: the base set of DICOM tags to be retained in the anonymised datasetmanufacturer_overrides
: any manufacturer-specific overrides to the base set of DICOM tags. This is useful for manufacturers that store sensitive information in non-standard DICOM tags. Multiple manufacturers can be specified in the YAML file as follows:
- manufacturer: "Philips" tags: - group: 0x2001 element: 0x1003 op: "keep" # ... - manufacturer: "Siemens" tags: - group: 0x0019 element: 0x100c op: "keep" # ...
-
The endpoints used to upload the anonymised DICOM data and the public and radiology parquet files. We currently support the following endpoints:
"none"
: no upload"ftps"
: a secure FTP server (for both DICOM and parquet files)"dicomweb"
: a DICOMweb server (for DICOM files only). Requires theDICOMWEB_*
environment variables to be set in.env
"xnat"
: an XNAT instance (for DICOM files only)
Project secrets
Any credentials required for uploading the project's results should be stored in an Azure Key Vault (set up instructions below). PIXL will query this key vault for the required secrets at runtime. This requires the following environment variables to be set so that PIXL can connect to the key vault:
EXPORT_AZ_CLIENT_ID
: the service principal's client ID, mapped toAZURE_CLIENT ID
indocker-compose
EXPORT_AZ_CLIENT_PASSWORD
: the password, mapped toAZURE_CLIENT_SECRET
indocker-compose
EXPORT_AZ_TENANT_ID
: ID of the service principal's tenant. Also called its 'directory' ID. Mapped toAZURE_TENANT_ID
indocker-compose
EXPORT_AZ_KEY_VAULT_NAME
the name of the key vault, used to connect to the correct key vault
These variables can be set in the .env
file.
For testing, they can be set in the test/.secrets.env
file.
For dev purposes find the pixl-dev-secrets.env
note on LastPass for the necessary values.
If an Azure Keyvault hasn't been set up yet, follow these instructions.
A second Azure Keyvault is used to store hashing keys and salts for the hasher
service.
This kevyault is configured with the following environment variables:
HASHER_API_AZ_CLIENT_ID
: the service principal's client ID, mapped toAZURE_CLIENT ID
indocker-compose
HASHER_API_AZ_CLIENT_PASSWORD
: the password, mapped toAZURE_CLIENT_SECRET
indocker-compose
HASHER_API_AZ_TENANT_ID
: ID of the service principal's tenant. Also called its 'directory' ID. Mapped toAZURE_TENANT_ID
indocker-compose
HASHER_API_AZ_KEY_VAULT_NAME
the name of the key vault, used to connect to the correct key vault
See the hasher documentation for more information.
Click here to view detailed steps
From the PIXL directory:
pixl dc up
Once the services are running, you can interact with the services using the pixl
CLI.
From the PIXL directory:
pixl dc down # --volumes to remove all data volumes
The number of DICOM instances in the raw Orthanc instance can be accessed from
http://<pixl_host>:<ORTHANC_RAW_WEB_PORT>/ui/app/#/settings
and similarly with
the Orthanc Anon instance, where pixl_host
is the host of the PIXL services
and ORTHANC_RAW_WEB_PORT
is defined in .env
.
The imaging export progress can be interrogated by connecting to the PIXL
database with a database client (e.g. DBeaver), using
the connection parameters defined in .env
.
PIXL data extracts include the below assumptions
- (MRN, Accession number) is unique identifier for a report/DICOM study pair
- Patients have a single relevant MRN
Files that are present at each step of the pipeline.
A more detailed description of the relevant file types is available in docs/file_types/parquet_files.md
.
test/resources/omop/public /*.parquet
....................private/*.parquet
....................extract_summary.json
EXTRACT_DIR is the directory passed to pixl populate
as the input PARQUET_PATH
argument.
EXTRACT_DIR/public /*.parquet
............private/*.parquet
............extract_summary.json
The directory where PIXL will copy the public OMOP extract files (which now contain
the radiology reports) to.
These files will subsequently be uploaded to the parquet
destination specified in the
project config.
EXPORT_ROOT/PROJECT_SLUG/all_extracts/EXTRACT_DATETIME/radiology/radiology.parquet
....................................................../omop/public/*.parquet
If the parquet
destination is set to ftps
, the public extract files and radiology report will
be uploaded to the FTP server at the following path:
FTPROOT/PROJECT_SLUG/EXTRACT_DATETIME/parquet/radiology/radiology.parquet
..............................................omop/public/*.parquet
- Generate your SSH keys as suggested here
- Clone the repository by typing (or copying) the following lines in a terminal
git clone git@github.com:SAFEHR-data/PIXL.git