diff --git a/docker/vm_boot_images/config/ubuntu.sh b/docker/vm_boot_images/config/ubuntu.sh index 0c13eee87..b7bcca68a 100755 --- a/docker/vm_boot_images/config/ubuntu.sh +++ b/docker/vm_boot_images/config/ubuntu.sh @@ -3,4 +3,4 @@ # Other necessities apt-get update echo "ttf-mscorefonts-installer msttcorefonts/accepted-mscorefonts-eula select true" | debconf-set-selections -apt-get install -y wget unzip curl python3-pydot python3-pydot-ng graphviz ttf-mscorefonts-installer git pip ffmpeg \ No newline at end of file +apt-get install -y wget unzip curl python3-pydot python3-pydot-ng graphviz ttf-mscorefonts-installer git pip ffmpeg hdf5-tools \ No newline at end of file diff --git a/ml4h/tensorize/TENSORIZE.md b/ml4h/tensorize/TENSORIZE.md index 7a2fe40d5..364801ec1 100755 --- a/ml4h/tensorize/TENSORIZE.md +++ b/ml4h/tensorize/TENSORIZE.md @@ -148,11 +148,52 @@ scripts/tensorize.sh -t /mnt/disks/my-disk/ -i log -n 4 -s 1000000 -e 1030000 - ``` This creates and writes tensors to `/mnt/disks/my-disk` (logs can be found at `/mnt/disks/my-disk/log`) -by running `4` jobs in parallel (recommended to match that number to your VM's number of vCPUs) -starting with the sample ID `1000000` and ending with the sample ID `1004000` using both the `EKG` (field ID `20205`) -and the `MRI` (field ID `20209`) data. The tensors will be `.hd5` files named after corresponding sample IDs -(e.g. `/mnt/disks/my-disk/1002798.hd5`). - +by running `4` jobs in parallel (the `-n 4`) starting with the sample ID `1000000` and ending with the sample ID `1004000` +using both the `EKG` (field ID `20205`) and the `MRI` (field ID `20209`) data. The tensors will be `.hd5` +files named after corresponding sample IDs (e.g. `/mnt/disks/my-disk/1002798.hd5`). + +### Determining machine size +It is recommended to match the `-n` value to your VM's number of vCPUs since these jobs run efficiently. +If we run too many in parallel, we will hit up against other limits on the machine, most notably disk speed. +SSD is highly recommended to make these jobs run faster, but at a certain num of vCPUs and associated -n values +disk speed will be the bottleneck. You can check the resource usage for your VM in the +Google Cloud Console's VM Observability Page, which you can access from the Observability tab within the VM +instance details page. It is recommended to install the Ops Agent in order to check memory usage if you are looking +to optimize the speed/resource usage, or if you are finding these jobs taking longer than expected and you want +to figure out why. + +As an example use case to help figure out how to understand usage or scale the VM properly, we had ~100k5k nifti +zips to process, producing ~50k tensors. On a machine with 22 cores and the data on SSD, this took ~20 hours to run. +This instance type was a `c3-standard-22` which has 88 GB of memory. Tensorization never went about 20% of memory usage, +but SSD speed and CPU speed were at near 100%, suggesting we could run in the same time on a machine with 22 cores and +~18GB of ram. + +### Validating and checking progress on tensors +While tensorization runs on the VM, you will see output from each tensorization job, but as these run in parallel +it's a bit hard to understand how far along we are. If you leave the `tmux` session where this is running, you +can check how far along we are on tensorization (how many hd5s have been generated) by running something like +`ls /mnt/disks/output_tensors_directory | wc -l`. In order to validate the hd5s that we've generated, you can +use the following script: `./scripts/validate_tensors.sh /mnt/disks/output_tensors_directory 20 | tee completed_tensors.txt` +where the number 20 should be the same value you used for tensorization above (usually number of vCPUs). This +will output a file called completed_tensors.txt that says "OK" or "BAD" - you can then filter down on BAD files and +determine what went wrong. + +### Notes on tensorizing UKBB data +The `DOWNLOAD.bulk` file or downloaded zips will tell us what samples/files we're going to process, you can get the +unique set of sample ids with a command like this: `cat DOWNLOAD.bulk | cut -d ' ' -f 1 | sort | uniq`. This can be +helpful in validating we have all our output tensors. +We can check whether we have all input data turned into tensors with a command like this: +`comm -23 <(cat DOWNLOAD.bulk | cut -d ' ' -f 1 | sort | uniq) <(ls /mnt/disks/output_tensors_directory | cut -d '.' -f 1 | sort | uniq)` +That will look for all sample ids in the bulk download file and compare that to the output tensors directory, only outputting +samples that tensors were not found for. + +Note that for ecgs and mris there are some instances within the UKBB where data +was only collected at instance 3 and not instance 2 (whereas most samples have instance 2 and sometimes instance 3) - tensorization +will skip these cases where only instance 3 data exists. You can check if a given sample only has instance 2 data using a command +like this `comm -23 <(cat DOWNLOAD.bulk | grep "3_0" | cut -d ' ' -f 1 | sort | uniq) <(cat DOWNLOAD.bulk | grep "2_0" | cut -d ' ' -f 1 | sort | uniq)` +This will output sample ids that only have instance 2 data. + +### Managing the generated tensors If you're happy with the results and would like to share them with your collaborators, don't forget to make your disk `read-only` so they can attach their VMs to it as well. You can do this by diff --git a/ml4h/tensorize/tensor_writer_ukbb.py b/ml4h/tensorize/tensor_writer_ukbb.py index 99cba9004..b7eba0e2c 100755 --- a/ml4h/tensorize/tensor_writer_ukbb.py +++ b/ml4h/tensorize/tensor_writer_ukbb.py @@ -132,12 +132,17 @@ def write_tensors( start_time = timer() # Keep track of elapsed execution time tp = os.path.join(tensors, str(sample_id) + TENSOR_EXT) + + if os.path.exists(tp): + raise Exception(f"File already exists: {tp} - please use merge_hd5s.sh to merge data from two hd5 files.") + if not os.path.exists(os.path.dirname(tp)): os.makedirs(os.path.dirname(tp)) if _prune_sample(sample_id, min_sample_id, max_sample_id, mri_field_ids, xml_field_ids, zip_folder, xml_folder): continue + try: - with h5py.File(tp, 'a') as hd5: + with h5py.File(tp, 'w') as hd5: _write_tensors_from_zipped_dicoms(write_pngs, tensors, mri_unzip, mri_field_ids, zip_folder, hd5, sample_id, stats) _write_tensors_from_zipped_niftis(zip_folder, mri_field_ids, hd5, sample_id, stats) _write_tensors_from_xml(xml_field_ids, xml_folder, hd5, sample_id, write_pngs, stats, continuous_stats) diff --git a/scripts/tensorize.sh b/scripts/tensorize.sh index f586bccd1..6491c0766 100755 --- a/scripts/tensorize.sh +++ b/scripts/tensorize.sh @@ -5,7 +5,7 @@ ################### VARIABLES ############################################ TENSOR_PATH= -NUM_JOBS=96 +NUM_JOBS=20 SAMPLE_IDS_START=1000000 SAMPLE_IDS_END=6030000 XML_FIELD= # exclude ecg data @@ -116,43 +116,65 @@ shift $((OPTIND - 1)) START_TIME=$(date +%s) # Variables used to bin sample IDs so we can tensorize them in parallel -INCREMENT=$(( ( $SAMPLE_IDS_END - $SAMPLE_IDS_START ) / $NUM_JOBS )) -COUNTER=1 MIN_SAMPLE_ID=$SAMPLE_IDS_START -MAX_SAMPLE_ID=$(( $MIN_SAMPLE_ID + $INCREMENT - 1 )) - -# Run every parallel job within its own container -- 'tf.sh' handles the Docker launching -while [[ $COUNTER -lt $(( $NUM_JOBS + 1 )) ]]; do - echo -e "\nLaunching job for sample IDs starting with $MIN_SAMPLE_ID and ending with $MAX_SAMPLE_ID via:" - - cat <&2 + exit 1 +fi - $HOME/ml4h/scripts/tf.sh -c $HOME/ml4h/ml4h/recipes.py \ - --mode $TENSORIZE_MODE \ - --tensors $TENSOR_PATH \ - --output_folder $TENSOR_PATH \ - $PYTHON_ARGS \ - --min_sample_id $MIN_SAMPLE_ID \ - --max_sample_id $MAX_SAMPLE_ID & +# create a directory in the /tmp/ folder to store some utilities for use later +mkdir -p /tmp/ml4h +# Write out a file with the ids of every sample in the input folder +echo "Gathering list of input zips to process between $MIN_SAMPLE_ID and $MAX_SAMPLE_ID, this takes several seconds..." +find $ZIP_FOLDER -name '*.zip' | xargs -I {} basename {} | cut -d '_' -f 1 \ + | awk -v min="$MIN_SAMPLE_ID" -v max="$MAX_SAMPLE_ID" '$1 > min && $1 < max' \ + | sort | uniq > /tmp/ml4h/sample_ids_trimmed.txt + +NUM_SAMPLES_TO_PROCESS=$(cat /tmp/ml4h/sample_ids_trimmed.txt | wc -l) +echo "Including $NUM_SAMPLES_TO_PROCESS samples in this tensorization job." + + +echo -e "\nLaunching job for sample IDs starting with $MIN_SAMPLE_ID and ending with $MAX_SAMPLE_ID via:" + +# we need to run a command using xargs in parallel, and it gets rather complex and messy unless we can just run +# a shell script - the string below is written to a shell script that takes in positional arguments and sets +# min and max sample id to be the incoming sample id (min) to incoming sample id + 1 (max) - this lets us +# run on a single sample at a time +SINGLE_SAMPLE_SCRIPT='HOME=$1 + TENSORIZE_MODE=$2 + TENSOR_PATH=$3 + SAMPLE_ID=$4 + # drop those first 4 above and get all the rest of the arguments + shift 4 + PYTHON_ARGS="$@" + python $HOME/ml4h/ml4h/recipes.py --mode $TENSORIZE_MODE \ + --tensors $TENSOR_PATH \ + --output_folder $TENSOR_PATH \ + $PYTHON_ARGS \ + --min_sample_id $SAMPLE_ID \ + --max_sample_id $(($SAMPLE_ID+1))' +echo "$SINGLE_SAMPLE_SCRIPT" > /tmp/ml4h/tensorize_single_sample.sh +chmod +x /tmp/ml4h/tensorize_single_sample.sh + +# NOTE: the < " --version; > below is very much a hack - it's a way to escape tf.sh's running "python" followed by +# whatever you pass with -c. This causes it to run "python --version; " and then whatever you have after the semicolon. +read -r -d '' TF_COMMAND <