-
Notifications
You must be signed in to change notification settings - Fork 5
5.3.Accessing Imaging Data
The UK Biobank data can be divided into two types:
- Normal access.
- Apptainer/SquashFS access.
Files which can be accessed normally include the tabular files and genetics data. The imaging data is made available through a series of SquashFS files on the Beluga system of Alliance. These SquashFS files are filesystems, and need to be mounted via an Apptainer container. (If you're asking yourself, "Why would they do this to us?", see the last section of this page.)
To access the data, you will first need to create an Apptainer image. Your image should have your analysis software installed within it. A sample Apptainer image is provided on Beluga (/project/rpp-aevans-ab/neurohub/ukbb/example_apptainer.sif
).
You will first need to load the Apptainer module on Beluga:
module load apptainer
Apptainer images are normally used through either their run
or exec
commands. run
will run a set of commands that are defined when the container is first built. exec
will run the specified command from within the image. For example:
apptainer run SING_IMAGE.sif datafile.nii.gz # Executes the set of commands defined under 'run' for the image 'SING_IMAGE' on the input 'datafile.nii.gz'
apptainer exec SING_IMAGE.sif ls -lth / # Uses 'SING_IMAGE.sif' to execute the command, 'ls -lth /'
You can access the shell of an image using shell
:
apptainer shell SING_IMAGE.sif
The terminal is now in your image's environment; software that was installed in the image is now accessible to you. The interactive shell is a good way to test and debug pipelines, but its use for large-scale analysis is discouraged.
Accessing SquashFS files is done through the use of overlays. When using any of the Apptainer commands, you can use the --overlay
to mount files. Using one of the SquashFS files in the imaging directory as an example:
cd /project/rpp-aevans-ab/neurohub/ukbb/
module load apptainer
apptainer shell --overlay imaging/neurohub_ukbb_participants.squashfs example_apptainer.sif
# The SquashFS file is mounted and is now accessible. NeuroHub data is kept in /neurohub/; and in this case
ls -R /neurohub/ # You should now see dataset_description.json and participants.tsv in /neurohub/ukbb/imaging/
The UKB data is divided by modality and session, and due to technical concerns some modalities are divided across multiple SquashFS files. We can mount multiple SquashFS files the same way as with a single overlay:
cd /project/rpp-aevans-ab/neurohub/ukbb/
module load apptainer
apptainer shell --overlay imaging/neurohub_ukbb_participants.squashfs --overlay imaging/neurohub_ukbb_t1_ses2_0_bids.squashfs example_apptainer.sif
# We can now access the T1 data. Let's take a look at the first 100 files:
ls /neurohub/ukbb/imaging/ | head -n 100
Depending on your modality of interest, you will likely need several images, and typing "--overlay ..." is tedious, cumbersome, and difficult to read. Instead, we can use a combination of globbing and command substitution. Let's say we want to access all of the resting-state fMRI (rfMRI); there are 7 .squashfs matching, with the naming convention: neurohub_ukbb_rfmri_ses2_?_bids.squashfs
(note the ?). We know that our command will have to end up looking like:
apptainer shell --overlay neurohub_ukbb_rfmri_ses2_0_bids.squashfs --overlay neurohub_ukbb_rfmri_ses2_1_bids.squashfs (etc.)
We just need to create a loop to gather all the images:
cd /project/rpp-aevans-ab/neurohub/ukbb/
over=''
for ov in `printf "%s\n" "imaging/neurohub_ukbb_rfmri_ses2_?_bids.squashfs"`; do
over+=" --overlay ${ov}"
done
echo "${over}" # "over" should now contain every rfmri ses2 squashfs image.
Now that we've gathered all images, we can mount them all:
apptainer shell ${over} example_apptainer.sif # Note that you can't use quotes around ${over}
ls /neurohub/ukbb/imaging/ | wc -l # Count number of files in the directory
The supplied SquashFS images for the imaging data are in a few different formats. Most of the raw data is in BIDS and can be used with BIDS Apps or any other tools that can read from a BIDS-compliant structure.
We acknowledge that the Apptainer/SquashFS adds a technical hurdle for users not already familiar with Apptainer. However, the use of a read-only filesystem offers a number of technical advantages (see Rioux et al., 2020), and deployment of the full UKB dataset would not be possible without it.
Beyond the space required to store data, Unix-style filesystems use inodes to identify files. Much like the physical space on a disk, the number of inodes in a filesystem are finite; running out of either storage or inodes makes it impossible to store additional data. It is unusual to run out of inodes before storage because typical usage uses proportionally more storage than inodes. In the case of computing clusters such as those provided by Alliance, project allocations have both a storage quota (e.g. 10TB) and a filecount quota (e.g. 1M); the latter is the limit of inodes a project can use.
The UK Biobank has published 40k imaging subjects with a target of 100k subjects. Any given subject may have about 500 imaging-related files, giving us 100k subjects * 500 files/subject = 50M files
. Even limiting ourselves to only the most-pertinent files (~50 per subject), we would still need to use 5M inodes. At the time of the design decision, NeuroHub's allocation had an inode quota of 15M.
Cluster-scale filesystems are distributed across several devices to accomodate the large number of users and files. Betwen the number of users and its distributed nature, the filesystem can be slow when accessing a large of files. SquashFS bypasses that by storing multiple files as a single file; the host filesystem only needs to find a single file, reducing its burden and speeding up file access.
Apptainer is a container system adapted for execution in shared computing resources. Containers provide virtual environments for running code in different software and hardware environments. By running code through a container, researchers make it easier for others to reproduce their results: it is not necessary to provide the full software stack of a system or to have others reproduce it on their own system. Instead, containers (or their build file) can be shared and will produce the same results across different systems. Since the UKB data is widely-distributed, it is one of the few cases where different labs can fully verify that results can be reproduced.
For any questions or issues related to the registration process for accessing the UK Biobank data through the McGill agreement, please contact the NeuroHub team by email at support@neurohub.ca.
For any questions or issues specifically related to accessing the NeuroHub portal or using any of the NeuroHub features and functionalities, please contact the NeuroHub team by email at support@neurohub.ca.
- Available Tools and Datasets
- Access NeuroHub
- Creating a project
- LORIS Data Query TooL(DQT)
- Jupyter Notebooks in NeuroHub
- Messaging in NeuroHub
- Accessing the Canadian Open Neuroscience Platform(CONP) datasets
- Accessing the HCP(Human Connectome Project) datasets
- Accessing the UK Biobank dataset
- Accessing the ABCD dataset
- Accessing the OpenNeuro dataset