Ab tensorize speed up (#552)

* Changed parallelization within tensorize.sh so it runs in parallel within docker rather than parallel docker - about a 10x speed up. Added hdf5-tools to the docker for testing the hd5 files. Added validate_tensors.sh that can test that a generated hd5 file is complete. * Made parallel use the NUM_JOBS flag to see how many run in parallel. * Fixed defaults for min and max sample id when not provided. * Fixed permissions on validate_tensors.sh so it is executable * Removed redundant line in tensorize.sh and update documentation about scaling instance types and debugging tensorization.
broadinstitute · Feb 8, 2024 · 6fda09c · 6fda09c
1 parent 4975c72
commit 6fda09c
Show file tree

Hide file tree

Showing 6 changed files with 123 additions and 40 deletions.
diff --git a/docker/vm_boot_images/config/ubuntu.sh b/docker/vm_boot_images/config/ubuntu.sh
@@ -3,4 +3,4 @@
 # Other necessities
 apt-get update
 echo "ttf-mscorefonts-installer msttcorefonts/accepted-mscorefonts-eula select true" | debconf-set-selections
-apt-get install -y wget unzip curl python3-pydot python3-pydot-ng graphviz ttf-mscorefonts-installer git pip ffmpeg
+apt-get install -y wget unzip curl python3-pydot python3-pydot-ng graphviz ttf-mscorefonts-installer git pip ffmpeg hdf5-tools
diff --git a/ml4h/tensorize/TENSORIZE.md b/ml4h/tensorize/TENSORIZE.md
@@ -148,11 +148,52 @@ scripts/tensorize.sh  -t /mnt/disks/my-disk/ -i log -n 4 -s 1000000 -e 1030000 -
 ``` 
 
 This creates and writes tensors to `/mnt/disks/my-disk` (logs can be found at `/mnt/disks/my-disk/log`)
-by running `4` jobs in parallel (recommended to match that number to your VM's number of vCPUs) 
-starting with the sample ID `1000000` and ending with the sample ID `1004000` using both the `EKG` (field ID `20205`) 
-and the `MRI` (field ID `20209`) data. The tensors will be `.hd5` files named after corresponding sample IDs 
-(e.g. `/mnt/disks/my-disk/1002798.hd5`).
-
+by running `4` jobs in parallel (the `-n 4`) starting with the sample ID `1000000` and ending with the sample ID `1004000` 
+using both the `EKG` (field ID `20205`) and the `MRI` (field ID `20209`) data. The tensors will be `.hd5` 
+files named after corresponding sample IDs (e.g. `/mnt/disks/my-disk/1002798.hd5`).
+
+### Determining machine size
+It is recommended to match the `-n` value to your VM's number of vCPUs since these jobs run efficiently.  
+If we run too many in parallel, we will hit up against other limits on the machine, most notably disk speed.
+SSD is highly recommended to make these jobs run faster, but at a certain num of vCPUs and associated -n values
+disk speed will be the bottleneck.  You can check the resource usage for your VM in the
+Google Cloud Console's VM Observability Page, which you can access from the Observability tab within the VM
+instance details page.  It is recommended to install the Ops Agent in order to check memory usage if you are looking
+to optimize the speed/resource usage, or if you are finding these jobs taking longer than expected and you want 
+to figure out why.  
+
+As an example use case to help figure out how to understand usage or scale the VM properly, we had ~100k5k nifti 
+zips to process, producing ~50k tensors.  On a machine with 22 cores and the data on SSD, this took ~20 hours to run.
+This instance type was a `c3-standard-22` which has 88 GB of memory.  Tensorization never went about 20% of memory usage,
+but SSD speed and CPU speed were at near 100%, suggesting we could run in the same time on a machine with 22 cores and 
+~18GB of ram.  
+
+### Validating and checking progress on tensors
+While tensorization runs on the VM, you will see output from each tensorization job, but as these run in parallel
+it's a bit hard to understand how far along we are.  If you leave the `tmux` session where this is running, you
+can check how far along we are on tensorization (how many hd5s have been generated) by running something like 
+`ls /mnt/disks/output_tensors_directory | wc -l`.  In order to validate the hd5s that we've generated, you can 
+use the following script: `./scripts/validate_tensors.sh /mnt/disks/output_tensors_directory 20 | tee completed_tensors.txt`
+where the number 20 should be the same value you used for tensorization above (usually number of vCPUs).  This
+will output a file called completed_tensors.txt that says "OK" or "BAD" - you can then filter down on BAD files and
+determine what went wrong.  
+
+### Notes on tensorizing UKBB data
+The `DOWNLOAD.bulk` file or downloaded zips will tell us what samples/files we're going to process, you can get the
+unique set of sample ids with a command like this: `cat DOWNLOAD.bulk | cut -d ' ' -f 1 | sort | uniq`.  This can be
+helpful in validating we have all our output tensors.
+We can check whether we have all input data turned into tensors with a command like this:
+`comm -23 <(cat DOWNLOAD.bulk | cut -d ' ' -f 1 | sort | uniq) <(ls /mnt/disks/output_tensors_directory | cut -d '.' -f 1 | sort | uniq)`
+That will look for all sample ids in the bulk download file and compare that to the output tensors directory, only outputting
+samples that tensors were not found for.  
+
+Note that for ecgs and mris there are some instances within the UKBB where data
+was only collected at instance 3 and not instance 2 (whereas most samples have instance 2 and sometimes instance 3) - tensorization
+will skip these cases where only instance 3 data exists.  You can check if a given sample only has instance 2 data using a command
+like this `comm -23 <(cat DOWNLOAD.bulk | grep "3_0" | cut -d ' ' -f 1 | sort | uniq) <(cat DOWNLOAD.bulk | grep "2_0" | cut -d ' ' -f 1 | sort | uniq)`
+This will output sample ids that only have instance 2 data.
+
+### Managing the generated tensors
 If you're happy with the results and would like to share them with your collaborators, don't forget to make your disk
 `read-only` so they can attach their VMs to it as well. You can do this by
 

diff --git a/ml4h/tensorize/tensor_writer_ukbb.py b/ml4h/tensorize/tensor_writer_ukbb.py
@@ -132,12 +132,17 @@ def write_tensors(
 
         start_time = timer()  # Keep track of elapsed execution time
         tp = os.path.join(tensors, str(sample_id) + TENSOR_EXT)
+
+        if os.path.exists(tp):
+                raise Exception(f"File already exists: {tp} - please use merge_hd5s.sh to merge data from two hd5 files.")
+
         if not os.path.exists(os.path.dirname(tp)):
             os.makedirs(os.path.dirname(tp))
         if _prune_sample(sample_id, min_sample_id, max_sample_id, mri_field_ids, xml_field_ids, zip_folder, xml_folder):
             continue
+
         try:
-            with h5py.File(tp, 'a') as hd5:
+            with h5py.File(tp, 'w') as hd5:
                 _write_tensors_from_zipped_dicoms(write_pngs, tensors, mri_unzip, mri_field_ids, zip_folder, hd5, sample_id, stats)
                 _write_tensors_from_zipped_niftis(zip_folder, mri_field_ids, hd5, sample_id, stats)
                 _write_tensors_from_xml(xml_field_ids, xml_folder, hd5, sample_id, write_pngs, stats, continuous_stats)

diff --git a/scripts/tensorize.sh b/scripts/tensorize.sh
@@ -5,7 +5,7 @@
 ################### VARIABLES ############################################
 
 TENSOR_PATH=
-NUM_JOBS=96
+NUM_JOBS=20
 SAMPLE_IDS_START=1000000
 SAMPLE_IDS_END=6030000
 XML_FIELD=  # exclude ecg data
@@ -116,43 +116,65 @@ shift $((OPTIND - 1))
 START_TIME=$(date +%s)
 
 # Variables used to bin sample IDs so we can tensorize them in parallel
-INCREMENT=$(( ( $SAMPLE_IDS_END - $SAMPLE_IDS_START ) / $NUM_JOBS ))
-COUNTER=1
 MIN_SAMPLE_ID=$SAMPLE_IDS_START
-MAX_SAMPLE_ID=$(( $MIN_SAMPLE_ID + $INCREMENT - 1 ))
-
-# Run every parallel job within its own container -- 'tf.sh' handles the Docker launching
-while [[ $COUNTER -lt $(( $NUM_JOBS + 1 )) ]]; do
-    echo -e "\nLaunching job for sample IDs starting with $MIN_SAMPLE_ID and ending with $MAX_SAMPLE_ID via:"
-
-        cat <<LAUNCH_CMDLINE_MESSAGE
-            $HOME/ml4h/scripts/tf.sh -c $HOME/ml4h/ml4h/recipes.py
-                --mode $TENSORIZE_MODE \
-                --tensors $TENSOR_PATH \
-                --output_folder $TENSOR_PATH \
-                $PYTHON_ARGS \
-                --min_sample_id $MIN_SAMPLE_ID \
-                --max_sample_id $MAX_SAMPLE_ID &
-LAUNCH_CMDLINE_MESSAGE
+MAX_SAMPLE_ID=$SAMPLE_IDS_END
+
+# We want to get the zip folder that was passes to recipes.py - look for the --zip_folder argument and extract the value passed after that
+ZIP_FOLDER=$(echo ${PYTHON_ARGS} | sed 's/--zip_folder \([^ ]*\).*/\1/')
+if [ ! -e $ZIP_FOLDER ]; then
+    echo "ERROR: Zip folder passed was not valid, found $ZIP_FOLDER but expected folder path." 1>&2
+    exit 1
+fi
 
-    $HOME/ml4h/scripts/tf.sh -c $HOME/ml4h/ml4h/recipes.py \
-		--mode $TENSORIZE_MODE \
-		--tensors $TENSOR_PATH \
-		--output_folder $TENSOR_PATH \
-		$PYTHON_ARGS \
-		--min_sample_id $MIN_SAMPLE_ID \
-		--max_sample_id $MAX_SAMPLE_ID &
+# create a directory in the /tmp/ folder to store some utilities for use later
+mkdir -p /tmp/ml4h
+# Write out a file with the ids of every sample in the input folder
+echo "Gathering list of input zips to process between $MIN_SAMPLE_ID and $MAX_SAMPLE_ID, this takes several seconds..."
+find $ZIP_FOLDER -name '*.zip' | xargs -I {} basename {} | cut -d '_' -f 1 \
+                               | awk -v min="$MIN_SAMPLE_ID" -v max="$MAX_SAMPLE_ID" '$1 > min && $1 < max' \
+                               | sort | uniq > /tmp/ml4h/sample_ids_trimmed.txt
+
+NUM_SAMPLES_TO_PROCESS=$(cat /tmp/ml4h/sample_ids_trimmed.txt | wc -l)
+echo "Including $NUM_SAMPLES_TO_PROCESS samples in this tensorization job."
+
+
+echo -e "\nLaunching job for sample IDs starting with $MIN_SAMPLE_ID and ending with $MAX_SAMPLE_ID via:"
+
+# we need to run a command using xargs in parallel, and it gets rather complex and messy unless we can just run
+# a shell script - the string below is written to a shell script that takes in positional arguments and sets
+# min and max sample id to be the incoming sample id (min) to incoming sample id + 1 (max) - this lets us
+# run on a single sample at a time
+SINGLE_SAMPLE_SCRIPT='HOME=$1
+                    TENSORIZE_MODE=$2
+                    TENSOR_PATH=$3
+                    SAMPLE_ID=$4
+                    # drop those first 4 above and get all the rest of the arguments
+                    shift 4
+                    PYTHON_ARGS="$@"
+                    python $HOME/ml4h/ml4h/recipes.py --mode $TENSORIZE_MODE \
+        --tensors $TENSOR_PATH \
+        --output_folder $TENSOR_PATH \
+        $PYTHON_ARGS \
+        --min_sample_id $SAMPLE_ID \
+        --max_sample_id $(($SAMPLE_ID+1))'
+echo "$SINGLE_SAMPLE_SCRIPT" > /tmp/ml4h/tensorize_single_sample.sh
+chmod +x /tmp/ml4h/tensorize_single_sample.sh
+
+# NOTE: the < " --version;  > below is very much a hack - it's a way to escape tf.sh's running "python" followed by 
+#       whatever you pass with -c.  This causes it to run "python --version; " and then whatever you have after the semicolon.
+read -r -d '' TF_COMMAND <<LAUNCH_CMDLINE_MESSAGE
+    $HOME/ml4h/scripts/tf.sh -m "/tmp/ml4h/" -c " --version; \
+        cat /tmp/ml4h/sample_ids_trimmed.txt | \
+                xargs -P $NUM_JOBS -I {} /tmp/ml4h/tensorize_single_sample.sh $HOME $TENSORIZE_MODE $TENSOR_PATH {} $PYTHON_ARGS"
+LAUNCH_CMDLINE_MESSAGE
 
-    let COUNTER=COUNTER+1
-    let MIN_SAMPLE_ID=MIN_SAMPLE_ID+INCREMENT
-    let MAX_SAMPLE_ID=MAX_SAMPLE_ID+INCREMENT
+echo "Executing command within tf.sh: $TF_COMMAND"
+bash -c "$TF_COMMAND"
 
-    sleep 4s
-done
 
 ################### DISPLAY TIME #########################################
 
 END_TIME=$(date +%s)
 ELAPSED_TIME=$(($END_TIME - $START_TIME))
-printf "\nDispatched $((COUNTER - 1)) tensorization jobs in "
+printf "\nDispatched $NUM_SAMPLES_TO_PROCESS tensorization jobs in "
 display_time $ELAPSED_TIME
diff --git a/scripts/tf.sh b/scripts/tf.sh
@@ -184,6 +184,6 @@ ${GPU_DEVICE} \
 -v ${WORKDIR}/:${WORKDIR}/ \
 -v ${HOME}/:${HOME}/ \
 ${MOUNTS} \
-${DOCKER_IMAGE} /bin/bash -c "pip3 install --upgrade pip
-pip install ${WORKDIR};
+${DOCKER_IMAGE} /bin/bash -c "pip install --quiet --upgrade pip
+pip install --quiet ${WORKDIR};
 eval ${CALL_DOCKER_AS_USER} ${PYTHON_COMMAND} ${PYTHON_ARGS}"
diff --git a/scripts/validate_tensors.sh b/scripts/validate_tensors.sh
@@ -0,0 +1,15 @@
+# use this script to validate the tensors created by the tensorize.sh script
+# expects two positional arguments: directory containing the tensors and the number of threads to use
+# example: ./validate_tensors.sh /mnt/disks/tensors/ 20 | tee completed_tensors.txt
+# the output will be in the following form:
+# OK - /mnt/disks/tensors/ukb1234.hd5
+# BAD - /mnt/disks/tensors/ukb5678.hd5
+
+
+INPUT_TENSORS_DIR=$1
+NUMBER_OF_THREADS=$2
+
+
+find ${INPUT_TENSORS_DIR}/*.hd5 | \
+    xargs -P ${NUMBER_OF_THREADS} -I {} \
+        bash -c "h5dump -n {} | (grep -q 'HDF5 \"{}\"' && echo 'OK - {}' || echo 'BAD - {}')"