diff --git a/README.md b/README.md index f501e021..e446473e 100644 --- a/README.md +++ b/README.md @@ -35,11 +35,10 @@ There are a number of different computational tasks to complete once a MIBI run - 3a: real time monitoring. The [MIBI monitoring](./templates/3a_monitor_MIBI_run.ipynb) notebook will monitor an ongoing MIBI run, and begin processing the image data as soon as it is generated. This notebook is being continually be updated as we move more of our processing pipeline to happen in real time as the data is generated. - 3b: post-run monitoring. For each step in the monitoring notebook, we have a dedicated notebook that can perform the same tasks once a run is complete. This includes [the image extraction notebook](./templates/extract_bin_file.ipynb) and the [qc metrics notebook](./templates/3b_generate_qc_metrics.ipynb). +### 4. Processing MIBI data +Once your run has finished, you can begin to process the data to make it ready for analysis. To remove background signal contamination, as well as compensate for channel crosstalk, you can use the [compensation](./templates/4a_compensate_image_data.ipynb) notebook. This will guide you through the Rosetta algorithm, which uses a flow-cytometry style compensation approach to remove spurious signal. -### 4. Processing MIBI Data -Once your run has finished, you can begin to process the data to make it ready for analysis. To remove background signal contamination, as well as compensate for channel crosstalk, you can use the [compensation](./templates/4_compensate_image_data.ipynb) notebook. This will guide you through the Rosetta algorithm, which uses a flow-cytometry style compensation approach to remove spurious signal. - -Following compensation, you will want to normalize your images to ensure consistent intensity across the run. This functionality is currently in the works, and we'll have a beta version available to test soon. +Following compensation, you will want to normalize your images to ensure consistent intensity across the run. You can use the [normalization](./templates/4b_normalize_image_data.ipynb) notebook to perform this step. ## Installation In order to get toffy working, you'll need to first install the repo. diff --git a/environment.yml b/environment.yml index cd5d0f30..62835228 100644 --- a/environment.yml +++ b/environment.yml @@ -3,10 +3,11 @@ dependencies: - python=3.8 - pip - pip: - - git+https://github.com/angelolab/mibi-bin-tools.git@02b4549731c2204e727073d6f23d3a91123e69d8 - - git+https://github.com/angelolab/ark-analysis.git@master + - git+https://github.com/angelolab/ark-analysis.git@02b4549731c2204e727073d6f23d3a91123e69d8 + - git+https://github.com/angelolab/mibi-bin-tools.git@master - jupyter>=1.0.0,<2 - jupyter_contrib_nbextensions>=0.5.1,<1 - jupyterlab>=3.1.5,<4 - watchdog>=2.1.6,<3 + - natsort >= 0.8 - numpy>=1.22,<2 diff --git a/requirements.txt b/requirements.txt index 40808e40..e2316972 100644 --- a/requirements.txt +++ b/requirements.txt @@ -3,6 +3,7 @@ mibi-bin-tools @ git+https://github.com/angelolab/mibi-bin-tools.git@master jupyter>=1.0.0,<2 jupyter_contrib_nbextensions>=0.5.1,<1 jupyterlab>=3.1.5,<4 +natsort>=0.8 numpy>=1.22,<2 watchdog>=2.1.6,<3 traitlets==5.2.2.post1 diff --git a/templates/1_set_up_toffy.ipynb b/templates/1_set_up_toffy.ipynb index 59c4f23c..09f3d7f8 100644 --- a/templates/1_set_up_toffy.ipynb +++ b/templates/1_set_up_toffy.ipynb @@ -15,9 +15,10 @@ "id": "e36293c5-aa89-4029-a3fa-e8ea841bb8b5", "metadata": {}, "source": [ - "There are two parts to this notebook. \n", + "There are three parts to this notebook. \n", "1. The first part creates the necessary folders that toffy is expecting, and only needs to be run the first time you install it on a new CAC. \n", - "2. The second part updates the co-registration parameters between the slide image (optical image) and the stage coordinates. This needs to be run anytime Ionpath changes the co-registration" + "2. The second part updates the co-registration parameters between the slide image (optical image) and the stage coordinates. This needs to be run anytime Ionpath changes the co-registration\n", + "3. The third part generates a tuning curve to correct for shifts in instrument sensitivity, and only needs to be run once per instrument" ] }, { @@ -37,7 +38,8 @@ "import os\n", "from sklearn.linear_model import LinearRegression\n", "\n", - "from toffy import tiling_utils" + "from toffy import tiling_utils, normalize\n", + "from ark.utils import io_utils" ] }, { @@ -55,9 +57,12 @@ "metadata": {}, "outputs": [], "source": [ - "folders = ['D:\\\\Extracted_Images', 'C:\\\\Users\\\\Customer.ION\\\\Documents\\\\run_metrics', 'C:\\\\Users\\\\Customer.ION\\\\Documents\\\\watcher_logs',\n", - " 'C:\\\\Users\\\\Customer.ION\\\\Documents\\\\tiled_run_jsons', 'C:\\\\Users\\\\Customer.ION\\\\Documents\\\\autolabeled_tma_jsons', \n", - " 'C:\\\\Users\\\\Customer.ION\\\\Documents\\\\panel_files']\n", + "folders = ['D:\\\\Extracted_Images', 'D:\\\\Rosetta_Compensated_Images', 'D:\\\\Normalized_Images', \n", + " 'C:\\\\Users\\\\Customer.ION\\\\Documents\\\\run_metrics', 'C:\\\\Users\\\\Customer.ION\\\\Documents\\\\watcher_logs',\n", + " 'C:\\\\Users\\\\Customer.ION\\\\Documents\\\\tiled_run_jsons', \n", + " 'C:\\\\Users\\\\Customer.ION\\\\Documents\\\\autolabeled_tma_jsons', \n", + " 'C:\\\\Users\\\\Customer.ION\\\\Documents\\\\panel_files', 'C:\\\\Users\\\\Customer.ION\\\\Documents\\\\normalization_curve', \n", + " 'C:\\\\Users\\\\Customer.ION\\\\Documents\\\\mph_files']\n", "\n", "for folder in folders:\n", " if not os.path.exists(folder):\n", @@ -178,6 +183,89 @@ "source": [ "tiling_utils.save_coreg_params(coreg_params)" ] + }, + { + "cell_type": "markdown", + "id": "14c82566-d6f5-4096-a249-92fae371ab39", + "metadata": {}, + "source": [ + "## 3. Generate sensitivity tuning curve\n", + "Depending on when an FOV was acquired with respect to the last time the detector was tuned, you will see variable levels of antibody signal. These differences in sensitivity can result in differences in marker intensity, when in fact there is no underlying biological difference in the real signal. In order to correct for this, and ensure that samples which have the same expression levels of a given marker record the same intensity, we need to normalize the images. \n", + "\n", + "The normalization process relies on constructing a tuning curve, which accurately tracks the relationship between increasing detector gain and antibody signal. We can use a detector sweep to figure out this relationship. We can then correct each image to ensure that there are consistent levels of antibody signal. " + ] + }, + { + "cell_type": "markdown", + "id": "b4170c03-1619-479c-99f0-ea03ac13d76b", + "metadata": {}, + "source": [ + "### Identify detector sweep\n", + "The first step is selecting a detector sweep. The goal is for this sweep to cover the range of values most often seen during image acqusition. Therefore, it's best to pick a sweep where the suggested change in voltage following the sweep was less than 50V." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "525de367-672a-416c-9c1a-1cb20a397cda", + "metadata": {}, + "outputs": [], + "source": [ + "# pick a name for the sweep, such as the date it was run\n", + "sweep_name = '20220417_pmma'" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e661bba8-6657-4f9f-ba8b-ebb9eab874bc", + "metadata": {}, + "outputs": [], + "source": [ + "# create a new folder with the sweep name\n", + "normalization_dir = 'C:\\\\Users\\\\Customer.ION\\\\Documents\\\\normalization_curve'\n", + "sweep_path = os.path.join(normalization_dir, sweep_name)\n", + "os.makedirs(sweep_path)" + ] + }, + { + "cell_type": "markdown", + "id": "9cab012d-9092-4136-b81b-9c8b3a969e15", + "metadata": {}, + "source": [ + "Now, copy all of the FOVs from the sweep into the newly created folder, which can be found in *C:\\\\Users\\\\Customer.ION\\\\Documents\\\\normalization_curve*" + ] + }, + { + "cell_type": "markdown", + "id": "86941169-09a1-43fe-9299-f9f1ca8766b3", + "metadata": {}, + "source": [ + "### Create tuning curve\n", + "We'll then use these FOVs in order to create the curve" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e7d0aa94-a481-4628-b41a-fa2b74d489d9", + "metadata": {}, + "outputs": [], + "source": [ + "# define masses to use\n", + "normalize.create_tuning_function(sweep_path=sweep_path)" + ] + }, + { + "cell_type": "markdown", + "id": "762d6e3b-41d1-4064-9578-2bae599e3f99", + "metadata": {}, + "source": [ + "Your curve should look like the image below. It's okay if your values are a bit different, but the shape of the curve should be qualitatively the same. The curve will be saved in the *sweep_path* folder you defined above\n", + "
\n", + " \n", + "
\n" + ] } ], "metadata": { diff --git a/templates/4_compensate_image_data.ipynb b/templates/4a_compensate_image_data.ipynb similarity index 76% rename from templates/4_compensate_image_data.ipynb rename to templates/4a_compensate_image_data.ipynb index 64991b0d..a478d255 100644 --- a/templates/4_compensate_image_data.ipynb +++ b/templates/4a_compensate_image_data.ipynb @@ -34,7 +34,7 @@ "id": "d1db8364", "metadata": {}, "source": [ - "### First, make a folder to hold all of the files related to rosetta processing, and put the full path below" + "### First, make a folder for evaluating rosetta normalization, and put the full path below" ] }, { @@ -54,7 +54,7 @@ "source": [ "### Next, copy over the .bin files for the ~10 FOVs will you use for testing. In addition to the .bin files, make sure to copy over the .JSON files with the same name into this folder. Place them in a folder named *example_bins*.\n", "\n", - "#### For example, fov-1-scan-1.bin, fov-1-scan-1.json, fov-23-scan-1.bin, fov-23-scan-1.json, etc" + "### For example, fov-1-scan-1.bin, fov-1-scan-1.json, fov-23-scan-1.bin, fov-23-scan-1.json, etc" ] }, { @@ -200,7 +200,7 @@ "id": "bd4f51e0-8bcc-4b5b-b2d3-6bc00140e0ca", "metadata": {}, "source": [ - "### Once you're satisfied that the Rosetta is working appropriately, you can use it to process your entire dataset" + "### Once you're satisfied that the Rosetta is working appropriately, you can use it to process your run. First select the run you want to process, and define the relevant top-level folders" ] }, { @@ -210,13 +210,25 @@ "metadata": {}, "outputs": [], "source": [ - "# Specify necessary folders\n", + "# Put the name of your run here\n", + "run_name = '20220101_my_run'\n", "\n", - "# This should be a folder of run folders. Each folder within bin_file_dir should contain all of the .bin and .json files for that run\n", - "bin_file_dir = 'path/to/cohort/all_runs'\n", + "# The path to the folder containing raw run data\n", + "bin_file_dir = 'D:\\\\Data'\n", "\n", "# This folder is where all of the extracted images will get saved\n", - "extracted_image_dir = 'path/to/cohort/extracted_runs'" + "extracted_image_dir = 'D:\\\\Extracted_Images'\n", + "\n", + "# This folder will hold the post-rosetta images\n", + "rosetta_image_dir = 'D:\\\\Rosetta_Compensated_Images'" + ] + }, + { + "cell_type": "markdown", + "id": "7f6161fb-3050-44e1-ab9a-7b9c59675d89", + "metadata": {}, + "source": [ + "### Prior to running compensation, you'll need to extract your data if you haven't already" ] }, { @@ -226,36 +238,27 @@ "metadata": {}, "outputs": [], "source": [ - "# If you only want to extract a subset of your runs, specify their names here; otherwise, leave as None\n", - "runs = None\n", - "if runs is None:\n", - " runs = list_folders(bin_file_dir)\n", + "# set run-specific folders\n", + "run_bin_dir = os.path.join(bin_file_dir, run_name)\n", + "run_extracted_dir = os.path.join(extracted_image_dir, run_name)\n", + "if not os.path.exists(run_extracted_dir):\n", + " os.makedirs(run_extracted_dir)\n", "\n", - "for run in runs:\n", - " print(\"processing run {}\".format(run))\n", - " current_bin = os.path.join(bin_file_dir, run)\n", - " current_out = os.path.join(extracted_image_dir, run)\n", - " if not os.path.exists(current_out):\n", - " os.makedirs(current_out)\n", - " \n", - " # extract bins and replace gold image\n", - " bin_files.extract_bin_files(current_bin, current_out, panel=panel, intensities=['Au', 'chan_39'])\n", - " rosetta.replace_with_intensity_image(run_dir=current_out, channel='Au')\n", - " rosetta.replace_with_intensity_image(run_dir=current_out, channel='chan_39')\n", - " \n", - " # clean up dirs\n", - " rosetta.remove_sub_dirs(run_dir=current_out, sub_dirs=['intensities', 'intensity_times_width'])" + "# extract bins\n", + "bin_files.extract_bin_files(run_bin_dir, run_extracted_dir, panel=panel, intensities=['Au', 'chan_39'])\n", + "rosetta.replace_with_intensity_image(run_dir=run_extracted_dir, channel='Au')\n", + "rosetta.replace_with_intensity_image(run_dir=run_extracted_dir, channel='chan_39')\n", + "\n", + "# clean up dirs\n", + "rosetta.remove_sub_dirs(run_dir=run_extracted_dir, sub_dirs=['intensities', 'intensity_times_width'])" ] }, { - "cell_type": "code", - "execution_count": null, - "id": "ce583c8d-2fff-4a87-9d41-3c02246eb56d", + "cell_type": "markdown", + "id": "c4df6224-9839-43f7-9122-62bca0668fbc", "metadata": {}, - "outputs": [], "source": [ - "# specify path to save rosetta images\n", - "rosetta_image_dir = base_dir + 'rosetta_run_output'" + "### Then, you can compensate the data using rosetta" ] }, { @@ -266,14 +269,12 @@ "outputs": [], "source": [ "# Perform rosetta on extracted images\n", - "for run in runs:\n", - " print(\"processing run {}\".format(run))\n", - " raw_img_dir = os.path.join(extracted_image_dir, run)\n", - " out_dir = os.path.join(rosetta_image_dir, run)\n", - " if not os.path.exists(out_dir):\n", - " os.makedirs(out_dir)\n", - " rosetta.compensate_image_data(raw_data_dir=raw_img_dir, comp_data_dir=out_dir, \n", - " comp_mat_path=rosetta_mat_path, panel_info=panel, batch_size=1)" + "run_rosetta_dir = os.path.join(rosetta_image_dir, run_name)\n", + "if not os.path.exists(run_rosetta_dir):\n", + " os.makedirs(run_rosetta_dir)\n", + "\n", + "rosetta.compensate_image_data(raw_data_dir=run_extracted_dir, comp_data_dir=run_rosetta_dir, \n", + " comp_mat_path=rosetta_mat_path, panel_info=panel, batch_size=1)" ] } ], diff --git a/templates/4b_normalize_image_data.ipynb b/templates/4b_normalize_image_data.ipynb new file mode 100644 index 00000000..8ee53ee5 --- /dev/null +++ b/templates/4b_normalize_image_data.ipynb @@ -0,0 +1,158 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "8e012798-3632-4107-8d5a-5f47e7671a9d", + "metadata": { + "tags": [] + }, + "source": [ + "## This notebook is an example: create a copy before running it or you will get merge conflicts!\n", + "\n", + "This notebook will walk you through the process of normalizing your image data. This notebook uses information about the sensitivity of the detector to calculate the correct normalization value for each channel in each image. Before running through the notebook, make sure you've completed section 3 of `1_set_up_toffy.ipynb`, which is used to create the necessary normalization curve. In addition, you should have already compensated your data with rosetta using `4_compensate_image_data.ipynb`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c9751084-7855-4005-b152-55895ff40823", + "metadata": {}, + "outputs": [], + "source": [ + "import sys\n", + "sys.path.append('../')\n", + "\n", + "import os\n", + "import pandas as pd\n", + "\n", + "from toffy import normalize\n", + "from ark.utils.io_utils import list_files, list_folders" + ] + }, + { + "cell_type": "markdown", + "id": "dbe49e0b-34c2-4f86-b786-403c24b2f678", + "metadata": {}, + "source": [ + "### You'll first need to specify the location of the relevant files to enable image normalization" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f48127b8-84da-4763-9c44-01a5bde4c4ed", + "metadata": {}, + "outputs": [], + "source": [ + "# First specify the name of the run that you'll be normalizing\n", + "run_name = '20220101_run_to_be_processed'\n", + "\n", + "# Then provide the path to your panel\n", + "panel_path = 'C:\\\\Users\\\\Customer.ION\\\\Documents\\\\panel_files\\\\my_cool_panel.csv'\n", + "panel = pd.read_csv(panel_path)\n", + "\n", + "# These paths should point to the folders containing each step of the processing pipeline\n", + "bin_base_dir = 'D:\\\\Data'\n", + "rosetta_base_dir = 'D:\\\\Rosetta_Compensated_Images'\n", + "normalized_base_dir = 'D:\\\\Normalized_Images'\n", + "mph_base_dir = 'C:\\\\Users\\\\Customer.ION\\\\Documents\\\\mph_files'" + ] + }, + { + "cell_type": "markdown", + "id": "62142cdc-fb78-4c21-8cb2-77b3fb60d399", + "metadata": {}, + "source": [ + "### Within the defined directories, we'll specify the relevant run_dir based on the run_name provided above" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4e323824-38a1-4c67-8088-a1725ecbc910", + "metadata": {}, + "outputs": [], + "source": [ + "# specify sub-folder for rosetta images\n", + "img_sub_folder = 'normalized'\n", + "\n", + "# create directory to hold normalized images\n", + "normalized_run_dir = os.path.join(normalized_base_dir, run_name)\n", + "if not os.path.exists(normalized_run_dir):\n", + " os.makedirs(normalized_run_dir)\n", + " \n", + "# create directory to hold associated processing files\n", + "mph_run_dir = os.path.join(mph_base_dir, run_name)\n", + "if not os.path.exists(mph_run_dir):\n", + " os.makedirs(mph_run_dir)" + ] + }, + { + "cell_type": "markdown", + "id": "748a2e7c-ecb0-421a-8192-d4bfaf889f37", + "metadata": {}, + "source": [ + "### Then, we'll loop over each FOV, generating the necessary normalization files if they weren't already created" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cfe67691-1676-4c3a-93fc-491e3abd6a24", + "metadata": {}, + "outputs": [], + "source": [ + "# get all FOVs\n", + "fovs = list_folders(os.path.join(rosetta_base_dir, run_name))\n", + "\n", + "# loop over each FOV\n", + "for fov in fovs:\n", + " # generate mph values\n", + " mph_file_path = os.path.join(mph_run_dir, fov + '_pulse_heights.csv')\n", + " if not os.path.exists(mph_file_path):\n", + " normalize.write_mph_per_mass(base_dir=os.path.join(bin_base_dir, run_name), output_dir=mph_run_dir, \n", + " fov=fov, masses=panel['Mass'].values, start_offset=0.3, stop_offset=0)" + ] + }, + { + "cell_type": "markdown", + "id": "e60fc2eb-9072-4ffe-a302-7ddb175bf4e7", + "metadata": {}, + "source": [ + "### Finally, we'll normalize the images, and save them to the output folder" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "59f49779-5e77-4017-8e68-50afce126883", + "metadata": {}, + "outputs": [], + "source": [ + "normalize.normalize_image_data(img_dir=os.path.join(rosetta_base_dir, run_name), norm_dir=normalized_run_dir, pulse_heights_dir=mph_run_dir,\n", + " panel_info=panel, img_sub_folder=img_sub_folder)" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.12" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/toffy/normalize.py b/toffy/normalize.py index 3e937f25..650f5f47 100644 --- a/toffy/normalize.py +++ b/toffy/normalize.py @@ -1,5 +1,5 @@ # adapted from https://machinelearningmastery.com/curve-fitting-with-python/ - +import copy import json import os import shutil @@ -9,9 +9,73 @@ from scipy.optimize import curve_fit import skimage.io as io import matplotlib.pyplot as plt +import natsort as ns import pandas as pd -from ark.utils import io_utils, load_utils +from seaborn import algorithms as algo +from seaborn.utils import ci + +from ark.utils import io_utils, load_utils, misc_utils +from mibi_bin_tools.bin_files import extract_bin_files, get_median_pulse_height +from mibi_bin_tools.panel_utils import make_panel + + +def write_counts_per_mass(base_dir, output_dir, fov, masses, start_offset=0.5, + stop_offset=0.5): + """Records the total counts per mass for the specified FOV + + Args: + base_dir (str): the directory containing the FOV + output_dir (str): the directory where the csv file will be saved + fov (str): the name of the fov to extract + masses (list): the list of masses to extract counts from + start_offset (float): beginning value for integrating mass peak + stop_offset (float): ending value for integrating mass peak + """ + + # create panel with extraction criteria + panel = make_panel(mass=masses, low_range=start_offset, high_range=stop_offset) + + array = extract_bin_files(data_dir=base_dir, out_dir=None, include_fovs=[fov], panel=panel, + intensities=False) + # we only care about pulse counts, not intensities + array = array.loc[fov, 'pulse', :, :, :] + channel_count = np.sum(array, axis=(0, 1)) + + # create df to hold output + fovs = np.repeat(fov, len(masses)) + out_df = pd.DataFrame({'mass': masses, + 'fov': fovs, + 'channel_count': channel_count}) + out_df.to_csv(os.path.join(output_dir, fov + '_channel_counts.csv'), index=False) + + +def write_mph_per_mass(base_dir, output_dir, fov, masses, start_offset=0.5, stop_offset=0.5): + """Records the mean pulse height (MPH) per mass for the specified FOV + + Args: + base_dir (str): the directory containing the FOV + output_dir (str): the directory where the csv file will be saved + fov (str): the name of the fov to extract + masses (list): the list of masses to extract MPH from + start_offset (float): beginning value for calculating mph values + stop_offset (float): ending value for calculating mph values + """ + # hold computed values + mph_vals = [] + + # compute pulse heights + panel = make_panel(mass=masses, target_name=masses, low_range=start_offset, + high_range=stop_offset) + for mass in masses: + mph_vals.append(get_median_pulse_height(data_dir=base_dir, fov=fov, channel=mass, + panel=panel)) + # create df to hold output + fovs = np.repeat(fov, len(masses)) + out_df = pd.DataFrame({'mass': masses, + 'fov': fovs, + 'pulse_height': mph_vals}) + out_df.to_csv(os.path.join(output_dir, fov + '_pulse_heights.csv'), index=False) def create_objective_function(obj_func): @@ -54,14 +118,16 @@ def exp(x, a, b, c, d): return objectives[obj_func] -def fit_calibration_curve(x_vals, y_vals, obj_func, plot_fit=False): +def fit_calibration_curve(x_vals, y_vals, obj_func, outliers=None, plot_fit=False, save_path=None): """Finds the optimal weights to fit the supplied values for the specified function Args: x_vals (list): the x values to be fit y_vals (list): the y value to be fit obj_func (str): the name of the function that will be fit to the data + outliers (tuple or None): optional tuple of ([x_coords], [y_coords]) to plot plot_fit (bool): whether or not to plot the fit of the function vs the values + save_path (str or None): location to save the plot of the fitted values Returns: list: the weights of the fitted function""" @@ -79,6 +145,13 @@ def fit_calibration_curve(x_vals, y_vals, obj_func, plot_fit=False): y_line = objective(x_line, *popt) plt.plot(x_line, y_line, '--', color='red') + if outliers is not None: + plt.scatter(outliers[0], outliers[1]) + + if save_path is not None: + plt.savefig(save_path) + plt.close() + return popt @@ -103,30 +176,25 @@ def pred_func(x): return pred_func -def combine_run_metrics(run_dir, file_prefix): +def combine_run_metrics(run_dir, substring): """Combines the specified metrics from different FOVs into a single file Args: run_dir (str): the directory containing the files - file_prefix (str): the prefix of the files to be combined""" + substring(str): the substring contained within the files to be combined""" - files = io_utils.list_files(run_dir, file_prefix) - bins = io_utils.list_files(run_dir, '.bin') + files = io_utils.list_files(run_dir, substring) # validate inputs - if len(bins) == 0: - raise ValueError('No bin files found in {}'.format(run_dir)) + if len(files) == 0: + raise ValueError('No files found in {}'.format(run_dir)) - if file_prefix + '_combined.csv' in files: - warnings.warn('removing previously generated ' - 'combined {} file in {}'.format(file_prefix, run_dir)) - os.remove(os.path.join(run_dir, file_prefix + '_combined.csv')) + if substring + '_combined.csv' in files: + warnings.warn('Removing previously generated ' + 'combined {} file in {}'.format(substring, run_dir)) + os.remove(os.path.join(run_dir, substring + '_combined.csv')) files = [file for file in files if 'combined' not in file] - if len(bins) != len(files): - raise ValueError('Mismatch between the number of bins and number ' - 'of {} files'.format(file_prefix)) - # collect all metrics files metrics = [] for file in files: @@ -138,11 +206,11 @@ def combine_run_metrics(run_dir, file_prefix): for i in range(1, len(metrics)): if len(metrics[i]) != base_len: raise ValueError('Not all {} files are the same length: file {} does not match' - 'file {}'.format(file_prefix, files[0], files[i])) + 'file {}'.format(substring, files[0], files[i])) metrics = pd.concat(metrics) - metrics.to_csv(os.path.join(run_dir, file_prefix + '_combined.csv'), index=False) + metrics.to_csv(os.path.join(run_dir, substring + '_combined.csv'), index=False) def combine_tuning_curve_metrics(dir_list): @@ -157,18 +225,13 @@ def combine_tuning_curve_metrics(dir_list): # create list to hold all extracted data all_dirs = [] - # loop through directories, and if present, multiple fovs within directories + # loop through each run folder for dir in dir_list: - # generate aggregated table if it doesn't already exist - for prefix in ['pulse_heights', 'channel_counts']: - if not os.path.exists(os.path.join(dir, prefix + '_combined.csv')): - combine_run_metrics(dir, prefix) - # combine tables together - pulse_heights = pd.read_csv(os.path.join(dir, 'pulse_heights_combined.csv')) - channel_counts = pd.read_csv(os.path.join(dir, 'channel_counts_combined.csv')) - combined = pulse_heights.merge(channel_counts, 'outer', on=['fovs', 'masses']) + pulse_heights = pd.read_csv(os.path.join(dir, 'fov-1-scan-1_pulse_heights.csv')) + channel_counts = pd.read_csv(os.path.join(dir, 'fov-1-scan-1_channel_counts.csv')) + combined = pulse_heights.merge(channel_counts, 'outer', on=['fov', 'mass']) if len(combined) != len(pulse_heights): raise ValueError("Pulse heights and channel counts must be generated for the same " @@ -184,69 +247,347 @@ def combine_tuning_curve_metrics(dir_list): all_data.reset_index() # create normalized counts column - subset = all_data[['channel_counts', 'masses']] - all_data['norm_channel_counts'] = subset.groupby('masses').transform(lambda x: (x / x.max())) + subset = all_data[['channel_count', 'mass']] + all_data['norm_channel_count'] = subset.groupby('mass').transform(lambda x: (x / x.max())) return all_data -def normalize_image_data(data_dir, output_dir, fovs, pulse_heights, panel_info, - norm_func_path, mph_func_type='poly_2', extreme_vals=(0.5, 1)): - """Normalizes image data based on median pulse height from the run and a tuning curve +def create_tuning_function(sweep_path, moly_masses=[92, 94, 95, 96, 97, 98, 100], + save_path=os.path.join('..', 'toffy', 'norm_func.json')): + """Creates a tuning curve for an instrument based on the provided moly sweep Args: - data_dir (str): directory with the image data - output_dir (str): directory where the normalized images will be saved - fovs (list or None): which fovs to include in normalization. If None, uses all fovs - pulse_heights (pd.DataFrame): pulse heights per mass per fov - panel_info (pd.DataFrame): mapping between channels and masses - norm_func_path (str): file containing the saved weights for the normalization function - mph_func_type (str): name of the function to use for fitting the mass vs mph curve - extreme_vals (tuple): determines the range for norm vals which will raise a warning + sweep_path (str): path to folder containing a detector sweep + moly_masses (list): list of masses to use for fitting the curve + save_path (str): path to save the weights of the tuning curve """ - # get FOVs to loop over - if fovs is None: - fovs = io_utils.list_folders(data_dir) + # get all folders from the sweep + sweep_fovs = io_utils.list_folders(sweep_path) + sweep_fov_paths = [os.path.join(sweep_path, fov) for fov in sweep_fovs] - # load calibration function - with open(norm_func_path, 'r') as cf: - norm_json = json.load(cf) + # compute pulse heights and channel counts for each FOV + for fov_path in sweep_fov_paths: + write_mph_per_mass(base_dir=fov_path, output_dir=fov_path, fov='fov-1-scan-1', + masses=moly_masses) + write_counts_per_mass(base_dir=fov_path, output_dir=fov_path, fov='fov-1-scan-1', + masses=moly_masses) - norm_weights, norm_name = norm_json['weights'], norm_json['name'] + # combine data together into single df + tuning_data = combine_tuning_curve_metrics(sweep_fov_paths) - channels = panel_info['targets'].values + # generate fitted curve + coeffs = fit_calibration_curve(tuning_data['pulse_height'].values, + tuning_data['norm_channel_count'].values, 'exp', plot_fit=True, + save_path=os.path.join(sweep_path, 'function_fit.jpg')) - # instantiate function which translates pulse height to a normalization constant - norm_func = create_prediction_function(norm_name, norm_weights) + # save the fitted curve + norm_json = {'name': 'exp', 'weights': coeffs.tolist()} + + with open(save_path, 'w') as sp: + json.dump(norm_json, sp) + + +def identify_outliers(x_vals, y_vals, obj_func, outlier_fraction=0.1): + """Finds the indices of outliers in the provided data to prune for subsequent curve fitting + + Args: + x_vals (np.array): the x values of the data being analyzed + y_vals (np.array): the y values of the data being analyzed + obj_func (str): the objective function to use for curve fitting to determine outliers + outlier_fraction (float): the fractional deviation from predicted value required in + order to classify a data point as an outlier + + Returns: + np.array: the indices of the identified outliers""" + + # get objective function + objective = create_objective_function(obj_func) + + # get fitted values + popt, _ = curve_fit(objective, x_vals, y_vals) + + # create generate function + func = create_prediction_function(name=obj_func, weights=popt) + + # generate predictions + preds = func(x_vals) + + # specify outlier bounds based on multiple of predicted value + upper_bound = preds * (1 + outlier_fraction) + lower_bound = preds * (1 - outlier_fraction) + + # identify outliers + outlier_mask = np.logical_or(y_vals > upper_bound, y_vals < lower_bound) + outlier_idx = np.where(outlier_mask)[0] + + return outlier_idx + + +def smooth_outliers(vals, outlier_idx, smooth_range=2): + """Performs local smoothing on the provided outliers - for fov in fovs: - output_fov_dir = os.path.join(output_dir, fov) + Args: + vals (np.array): the complete list of values to be smoothed + outlier_idx (np.array): the indices of the outliers in *vals* argument + smooth_range (int): the number of adjacent values in each direction to use for smoothing + + Returns: + np.array: the smoothed version of the provided vals""" + + smoothed_vals = copy.deepcopy(vals) + vals = np.array(vals) + + for outlier in outlier_idx: + previous_vals = smoothed_vals[(outlier - smooth_range):outlier] + + if outlier == len(vals): + # last value in list, can't average using subsequent values + subsequent_vals = [] + else: + # not the last value, we can use remaining values to get an estimate + subsequent_indices = np.arange(outlier + 1, len(vals)) + valid_subs_indices = [idx for idx in subsequent_indices if idx not in outlier_idx] + subsequent_indices = np.array(valid_subs_indices)[:smooth_range] + + # check to make sure there are valid subsequent indices + if len(subsequent_indices) > 0: + subsequent_vals = vals[subsequent_indices] + else: + subsequent_vals = np.array([]) + + new_val = np.mean(np.concatenate([previous_vals, subsequent_vals])) + smoothed_vals[outlier] = new_val + + return smoothed_vals + + +def fit_mass_mph_curve(mph_vals, mass, save_dir, obj_func, min_obs=10): + """Fits a curve for the MPH over time for the specified mass + + Args: + mph_vals (list): mph for each FOV in the run + mass (str or int): the mass being fit + save_dir (str): the directory to save the fit parameters + obj_func (str): the function to use for constructing the fit + min_obs (int): the minimum number of observations to fit a curve, otherwise uses median""" + + fov_order = np.linspace(0, len(mph_vals) - 1, len(mph_vals)) + save_path = os.path.join(save_dir, str(mass) + '_mph_fit.jpg') + + if len(mph_vals) > min_obs: + # find outliers in the MPH vals + outlier_idx = identify_outliers(x_vals=fov_order, y_vals=mph_vals, obj_func=obj_func) + + # replace with local smoothing around that point + smoothed_vals = smooth_outliers(vals=mph_vals, outlier_idx=outlier_idx) + + # if outliers identified, generate tuple to pass to plotting function + if len(outlier_idx) > 0: + outlier_x = fov_order[outlier_idx] + outlier_y = mph_vals[outlier_idx] + outlier_tup = (outlier_x, outlier_y) + else: + outlier_tup = None + + # fit curve + weights = fit_calibration_curve(x_vals=fov_order, y_vals=smoothed_vals, obj_func=obj_func, + outliers=outlier_tup, plot_fit=True, save_path=save_path) + + else: + # default to using the median instead for short runs with small number of FOVs + mph_median = np.median(mph_vals) + if obj_func == 'poly_2': + weight_len = 3 + elif obj_func == 'poly_3': + weight_len = 4 + else: + raise ValueError("Unsupported objective function provided: {}".format(obj_func)) + + # plot median + plt.axhline(y=mph_median, color='r', linestyle='-') + plt.plot(fov_order, mph_vals, '.') + plt.savefig(save_path) + plt.close() + + # all coefficients except intercept are 0 + weights = np.zeros(weight_len) + weights[-1] = mph_median + + mass_json = {'name': obj_func, 'weights': weights.tolist()} + mass_path = os.path.join(save_dir, str(mass) + '_norm_func.json') + + with open(mass_path, 'w') as mp: + json.dump(mass_json, mp) + + +def create_fitted_mass_mph_vals(pulse_height_df, obj_func_dir): + """Uses the mph curves for each mass to generate a smoothed mph estimate + + Args: + pulse_height_df (pd.DataFrame): contains the MPH value per mass for all FOVs + obj_func_dir (str): directory containing the curves generated for each mass + + Returns: + pd.DataFrame: updated dataframe with fitted version of each MPH value for each mass""" + + # get all masses + masses = np.unique(pulse_height_df['mass'].values) + + # create column to hold fitted values + pulse_height_df['pulse_height_fit'] = 0 + + # create x axis values + num_fovs = len(np.unique(pulse_height_df['fov'])) + fov_order = np.linspace(0, num_fovs - 1, num_fovs) + + for mass in masses: + # load channel-specific prediction function + mass_path = os.path.join(obj_func_dir, str(mass) + '_norm_func.json') + + with open(mass_path, 'r') as mp: + mass_json = json.load(mp) + + # compute predicted MPH + name, weights = mass_json['name'], mass_json['weights'] + pred_func = create_prediction_function(name=name, weights=weights) + pred_vals = pred_func(fov_order) + + # update df + mass_idx = pulse_height_df['mass'] == mass + pulse_height_df.loc[mass_idx, 'pulse_height_fit'] = pred_vals + + return pulse_height_df + + +def create_fitted_pulse_heights_file(pulse_height_dir, panel_info, norm_dir, mass_obj_func): + """Create a single file containing the pulse heights after fitting a curve per mass + + Args: + pulse_height_dir (str): path to directory containing pulse height csvs + panel_info (pd.DataFrame): the panel for this dataset + norm_dir (str): the directory where normalized images will be saved + mass_obj_func (str): the objective function used to fit the MPH over time per mass + + Returns: + pd.DataFrame: the combined pulse heights file""" + + # create variables for mass fitting + masses = panel_info['Mass'].values + fit_dir = os.path.join(norm_dir, 'curve_fits') + os.makedirs(fit_dir) + + # combine fov-level files together + combine_run_metrics(run_dir=pulse_height_dir, substring='pulse_heights') + pulse_height_df = pd.read_csv(os.path.join(pulse_height_dir, 'pulse_heights_combined.csv')) + + # order by FOV + ordering = ns.natsorted((pulse_height_df['fov'].unique())) + pulse_height_df['fov'] = pd.Categorical(pulse_height_df['fov'], + ordered=True, + categories=ordering) + pulse_height_df = pulse_height_df.sort_values('fov') + + # loop over each mass, and fit a curve for MPH over the course of the run + for mass in masses: + mph_vals = pulse_height_df.loc[pulse_height_df['mass'] == mass, 'pulse_height'].values + fit_mass_mph_curve(mph_vals=mph_vals, mass=mass, save_dir=fit_dir, + obj_func=mass_obj_func) + + # update pulse_height_df to include fitted mph values + pulse_height_df = create_fitted_mass_mph_vals(pulse_height_df=pulse_height_df, + obj_func_dir=fit_dir) + + return pulse_height_df + + +def normalize_fov(img_data, norm_vals, norm_dir, fov, channels, extreme_vals): + """Normalize a single FOV with provided normalization constants for each channel""" + + # create directory to hold normalized images + output_fov_dir = os.path.join(norm_dir, fov) + if os.path.exists(output_fov_dir): + print("output directory {} already exists, " + "data will be overwritten".format(output_fov_dir)) + else: os.makedirs(output_fov_dir) - # get images and pulse heights for current fov - images = load_utils.load_imgs_from_tree(data_dir, fovs=[fov], channels=channels, - dtype='float32') - fov_pulse_heights = pulse_heights.loc[pulse_heights['fov'] == fov, :] + # check if any values are outside expected range + extreme_mask = np.logical_or(norm_vals < extreme_vals[0], norm_vals > extreme_vals[1]) + if np.any(extreme_mask): + bad_channels = np.array(channels)[extreme_mask] + warnings.warn('The following channel(s) had an extreme normalization ' + 'value for fov {}. Manual inspection for accuracy is ' + 'recommended: {}'.format(fov, bad_channels)) + + # correct images and save + normalized_images = img_data / norm_vals.reshape((1, 1, 1, len(norm_vals))) - # fit a function to model pulse height as a function of mass - mph_weights = fit_calibration_curve(x_vals=fov_pulse_heights['masses'].values, - y_vals=fov_pulse_heights['mphs'].values, - obj_func=mph_func_type) + for idx, chan in enumerate(channels): + io.imsave(os.path.join(output_fov_dir, chan + '.tiff'), + normalized_images[0, :, :, idx], check_contrast=False) - # predict mph for each mass in the panel - mph_func = create_prediction_function(name=mph_func_type, weights=mph_weights) - mph_vals = mph_func(panel_info['masses'].values) + # save logs + log_df = pd.DataFrame({'channels': channels, + 'norm_vals': norm_vals}) + log_df.to_csv(os.path.join(output_fov_dir, 'normalization_coefs.csv'), index=False) - # predict normalization for each mph in the panel - norm_vals = norm_func(mph_vals) - if np.any(norm_vals < extreme_vals[0]) or np.any(norm_vals > extreme_vals[1]): - warnings.warn('The following FOV had an extreme normalization value. Manually ' - 'inspection for accuracy is recommended: fov {}'.format(fov)) +def normalize_image_data(img_dir, norm_dir, pulse_height_dir, panel_info, + img_sub_folder='', mass_obj_func='poly_2', extreme_vals=(0.4, 1.1), + norm_func_path=os.path.join('..', 'toffy', 'norm_func.json')): + """Normalizes image data based on median pulse height from the run and a tuning curve - normalized_images = images / norm_vals.reshape((1, 1, 1, len(channels))) + Args: + img_dir (str): directory with the image data + norm_dir (str): directory where the normalized images will be saved + pulse_height_dir (str): directory containing per-fov pulse heights + panel_info (pd.DataFrame): mapping between channels and masses + mass_obj_func (str): class of function to use for modeling MPH over time per mass + extreme_vals (tuple): determines the range for norm vals which will raise a warning + norm_func_path (str): file containing the saved weights for the normalization function + """ + + # error checks + if not os.path.exists(norm_func_path): + raise ValueError("No normalization function found. You will need to run " + "section 3 of the 1_set_up_toffy.ipynb notebook to generate the " + "necessary function before you can normalize your data") + + # create normalization function for mapping MPH to counts + with open(norm_func_path, 'r') as cf: + norm_json = json.load(cf) + + img_fovs = io_utils.list_folders(img_dir, 'fov') + + norm_weights, norm_name = norm_json['weights'], norm_json['name'] + norm_func = create_prediction_function(norm_name, norm_weights) - for idx, chan in enumerate(channels): - io.imsave(os.path.join(output_fov_dir, chan + '.tiff'), - normalized_images[0, :, :, idx], check_contrast=False) + # combine pulse heights together into single df + pulse_height_df = create_fitted_pulse_heights_file(pulse_height_dir=pulse_height_dir, + panel_info=panel_info, norm_dir=norm_dir, + mass_obj_func=mass_obj_func) + # add channel name to pulse_height_df + renamed_panel = panel_info.rename({'Mass': 'mass'}, axis=1) + pulse_height_df = pulse_height_df.merge(renamed_panel, how='left', on=['mass']) + pulse_height_df = pulse_height_df.sort_values('Target') + + # make sure FOVs used to construct tuning curve are same ones being normalized + pulse_fovs = np.unique(pulse_height_df['fov']) + misc_utils.verify_same_elements(image_data_fovs=img_fovs, pulse_height_csv_files=pulse_fovs) + + # loop over each fov + for fov in img_fovs: + # compute per-mass normalization constant + pulse_height_fov = pulse_height_df.loc[pulse_height_df['fov'] == fov, :] + norm_vals = norm_func(pulse_height_fov['pulse_height_fit'].values) + channels = pulse_height_fov['Target'].values + + # get images + images = load_utils.load_imgs_from_tree(img_dir, fovs=[fov], channels=channels, + dtype='float32', img_sub_folder=img_sub_folder) + + # normalize and save + normalize_fov(img_data=images, norm_vals=norm_vals, norm_dir=norm_dir, fov=fov, + channels=channels, extreme_vals=extreme_vals) diff --git a/toffy/normalize_test.py b/toffy/normalize_test.py index 476c513e..9bd8b3d3 100644 --- a/toffy/normalize_test.py +++ b/toffy/normalize_test.py @@ -1,20 +1,83 @@ import json -import os -import pytest +import shutil +import natsort import numpy as np +import os import pandas as pd +import pytest import tempfile +import xarray as xr from pytest_cases import parametrize_with_cases -from ark.utils import test_utils, load_utils +from ark.utils import test_utils, load_utils, io_utils from toffy import normalize import toffy.normalize_test_cases as test_cases parametrize = pytest.mark.parametrize +def mocked_extract_bin_file(data_dir, include_fovs, panel, out_dir, intensities): + mass_num = len(panel) + + base_img = np.ones((3, 4, 4)) + + all_imgs = [] + for i in range(1, mass_num + 1): + all_imgs.append(base_img * i) + + out_img = np.stack(all_imgs, axis=-1) + + out_img = np.expand_dims(out_img, axis=0) + + out_array = xr.DataArray(data=out_img, + coords=[ + [include_fovs[0]], + ['pulse', 'intensity', 'area'], + np.arange(base_img.shape[1]), + np.arange(base_img.shape[2]), + panel['Target'].values, + ], + dims=['fov', 'type', 'x', 'y', 'channel']) + return out_array + + +def mocked_pulse_height(data_dir, fov, panel, channel): + return channel * 2 + + +def test_write_counts_per_mass(mocker): + with tempfile.TemporaryDirectory() as temp_dir: + out_dir = os.path.join(temp_dir, 'out_dir') + os.makedirs(out_dir) + masses = [88, 89, 90] + expected_counts = [16 * i for i in range(1, len(masses) + 1)] + mocker.patch('toffy.normalize.extract_bin_files', mocked_extract_bin_file) + + normalize.write_counts_per_mass(base_dir=temp_dir, output_dir=out_dir, fov='fov1', + masses=masses) + output = pd.read_csv(os.path.join(out_dir, 'fov1_channel_counts.csv')) + assert len(output) == len(masses) + assert set(output['mass'].values) == set(masses) + assert set(output['channel_count'].values) == set(expected_counts) + + +def test_write_mph_per_mass(mocker): + with tempfile.TemporaryDirectory() as temp_dir: + out_dir = os.path.join(temp_dir, 'out_dir') + os.makedirs(out_dir) + masses = [88, 89, 90] + mocker.patch('toffy.normalize.get_median_pulse_height', mocked_pulse_height) + + normalize.write_mph_per_mass(base_dir=temp_dir, output_dir=out_dir, fov='fov1', + masses=masses) + output = pd.read_csv(os.path.join(out_dir, 'fov1_pulse_heights.csv')) + assert len(output) == len(masses) + assert set(output['mass'].values) == set(masses) + assert np.all(output['pulse_height'].values == output['mass'].values * 2) + + # TODO: move to toolbox repo once created def _make_blank_file(folder, name): with open(os.path.join(folder, name), 'w'): @@ -50,27 +113,24 @@ def test_create_prediction_function(obj_func, num_params): _ = pred_func(np.random.rand(10)) -@parametrize_with_cases('bins, metrics', cases=test_cases.CombineRunMetricFiles) -def test_combine_run_metrics(bins, metrics): +@parametrize_with_cases('metrics', cases=test_cases.CombineRunMetricFiles) +def test_combine_run_metrics(metrics): with tempfile.TemporaryDirectory() as temp_dir: - for bin_file in bins: - _make_blank_file(temp_dir, bin_file) - for metric in metrics: name, values_df = metric[0], pd.DataFrame(metric[1]) values_df.to_csv(os.path.join(temp_dir, name), index=False) - normalize.combine_run_metrics(temp_dir, 'example_metric') + normalize.combine_run_metrics(temp_dir, 'pulse_height') - combined_data = pd.read_csv(os.path.join(temp_dir, 'example_metric_combined.csv')) + combined_data = pd.read_csv(os.path.join(temp_dir, 'pulse_height_combined.csv')) - assert np.array_equal(combined_data.columns, ['column_1', 'column_2', 'column_3']) - assert len(combined_data) == len(bins) * 10 + assert np.array_equal(combined_data.columns, ['pulse_height', 'mass', 'fov']) + assert len(combined_data) == len(metrics) * 10 # check that previously generated combined file is removed with warning with pytest.warns(UserWarning, match='previously generated'): - normalize.combine_run_metrics(temp_dir, 'example_metric') + normalize.combine_run_metrics(temp_dir, 'pulse_height') # check that files with different lengths raises error name, bad_vals = metrics[0][0], pd.DataFrame(metrics[0][1]) @@ -78,21 +138,14 @@ def test_combine_run_metrics(bins, metrics): bad_vals.to_csv(os.path.join(temp_dir, name), index=False) with pytest.raises(ValueError, match='files are the same length'): - normalize.combine_run_metrics(temp_dir, 'example_metric') - os.remove(os.path.join(temp_dir, name)) - os.remove(os.path.join(temp_dir, 'example_1.bin')) - - # different number of bins raises error - os.remove(os.path.join(temp_dir, bins[3])) - with pytest.raises(ValueError, match='Mismatch'): - normalize.combine_run_metrics(temp_dir, 'example_metric') + normalize.combine_run_metrics(temp_dir, 'pulse_height') # empty directory raises error empty_dir = os.path.join(temp_dir, 'empty') os.makedirs(empty_dir) - with pytest.raises(ValueError, match='No bin files'): - normalize.combine_run_metrics(empty_dir, 'example_metric') + with pytest.raises(ValueError, match='No files'): + normalize.combine_run_metrics(empty_dir, 'pulse_height') @parametrize_with_cases('dir_names, mph_dfs, count_dfs', test_cases.TuningCurveFiles) @@ -106,78 +159,297 @@ def test_combine_tuning_curve_metrics(dir_names, mph_dfs, count_dfs): for i in range(len(dir_names)): full_path = os.path.join(temp_dir, dir_names[i]) os.makedirs(full_path) - mph_dfs[i].to_csv(os.path.join(full_path, 'pulse_heights_combined.csv'), index=False) - all_mph.extend(mph_dfs[i]['mph']) + mph_dfs[i].to_csv(os.path.join(full_path, 'fov-1-scan-1_pulse_heights.csv'), + index=False) + all_mph.extend(mph_dfs[i]['pulse_height']) - count_dfs[i].to_csv(os.path.join(full_path, 'channel_counts_combined.csv'), + count_dfs[i].to_csv(os.path.join(full_path, 'fov-1-scan-1_channel_counts.csv'), index=False) - all_counts.extend(count_dfs[i]['channel_counts']) + all_counts.extend(count_dfs[i]['channel_count']) dir_paths.append(os.path.join(temp_dir, dir_names[i])) combined = normalize.combine_tuning_curve_metrics(dir_paths) # data may be in a different order due to matching dfs, but all values should be present - assert set(all_mph) == set(combined['mph']) - assert set(all_counts) == set(combined['channel_counts']) + assert set(all_mph) == set(combined['pulse_height']) + assert set(all_counts) == set(combined['channel_count']) saved_dir_names = [name.split('/')[-1] for name in np.unique(combined['directory'])] assert set(saved_dir_names) == set(dir_names) # check that normalized value is 1 for maximum in each channel - for mass in np.unique(combined['masses']): - subset = combined.loc[combined['masses'] == mass, :] - max = np.max(subset[['channel_counts']].values) - norm_vals = subset.loc[subset['channel_counts'] == max, 'norm_channel_counts'].values + for mass in np.unique(combined['mass']): + subset = combined.loc[combined['mass'] == mass, :] + max = np.max(subset[['channel_count']].values) + norm_vals = subset.loc[subset['channel_count'] == max, 'norm_channel_count'].values assert np.all(norm_vals == 1) -def test_normalize_image_data(): - with tempfile.TemporaryDirectory() as top_level_dir: - data_dir = os.path.join(top_level_dir, 'data_dir') - os.makedirs(data_dir) +def test_smooth_outliers(): - output_dir = os.path.join(top_level_dir, 'output_dir') - os.makedirs(output_dir) + # Check for outliers which are separated by smoothing_range + smooth_range = 2 - # make fake data for testing - fovs, chans = test_utils.gen_fov_chan_names(num_fovs=2, num_chans=10) - filelocs, data_xr = test_utils.create_paired_xarray_fovs( - data_dir, fovs, chans, img_shape=(10, 10), fills=True) + vals = np.arange(20, 40).astype('float') + outlier1 = np.random.randint(3, 7) + outlier2 = np.random.randint(9, 13) + outlier3 = np.random.randint(15, 18) + outliers = np.array([outlier1, outlier2, outlier3]) + smoothed_vals = normalize.smooth_outliers(vals=vals, outlier_idx=outliers, + smooth_range=smooth_range) - # weights of mph to norm const func: 0.1x + 0x^2 + 0.5 - weights = [0.1, 0, 0.5] - name = 'poly_2' - func_json = {'name': name, 'weights': weights} - func_path = os.path.join(top_level_dir, 'norm_func.json') + assert np.array_equal(vals, smoothed_vals) - with open(func_path, 'w') as fp: - json.dump(func_json, fp) + # check for outliers which are next to one another + outliers = np.array([5, 6]) + smoothed_vals = normalize.smooth_outliers(vals=vals, outlier_idx=outliers, + smooth_range=smooth_range) - masses = np.array(range(1, len(chans) + 1)) - panel_info_file = pd.DataFrame({'masses': masses, 'targets': chans}) + # 5th entry is two below, plus first two non-outliers above + smooth_5 = np.mean(np.concatenate([vals[3:5], vals[7:9]])) - # fov1 mass to mph function: line with 20% slope starting at 1.2 - mph_0 = masses * 0.2 + 1 - fov_0 = ['fov0'] * len(chans) + # 6th entry is two below (one original and one smoothed from previous step), plus two above + smooth_6 = np.mean(np.concatenate([vals[4:5], [smooth_5], vals[7:9]])) + np.array_equal(smoothed_vals[outliers], [smooth_5, smooth_6]) - # fov1 mass to mph function: line with 10% slope starting at 4 - mph_1 = masses * .1 + 4 - fov_1 = ['fov1'] * len(chans) + # check for outliers which are at the ends of the list + outliers = np.array([0, 19]) - pulse_heights = pd.DataFrame({'masses': np.concatenate([masses, masses]), - 'fov': fov_0 + fov_1, - 'mphs': np.concatenate([mph_0, mph_1])}) + smoothed_vals = normalize.smooth_outliers(vals=vals, outlier_idx=outliers, + smooth_range=smooth_range) + # first entry is the mean of two above it + outlier_0 = np.mean(vals[1:3]) - normalize.normalize_image_data(data_dir, output_dir, fovs=None, - pulse_heights=pulse_heights, panel_info=panel_info_file, - norm_func_path=func_path) + # second entry is mean of two below + outlier_18 = np.mean(vals[17:19]) + + assert np.allclose(smoothed_vals[outliers], np.array([outlier_0, outlier_18])) + + +def test_create_tuning_function(tmpdir, mocker): + # create directory to hold the sweep + sweep_dir = os.path.join(tmpdir, 'sweep_1') + os.makedirs(sweep_dir) + + # create individual runs each with a single FOV + for voltage in ['25V', '50V', '75V']: + run_dir = os.path.join(sweep_dir, '20220101_{}'.format(voltage)) + os.makedirs(run_dir) + os.makedirs(os.path.join(run_dir, 'fov-1-scan-1')) + + # mock functions that interact with bin files directly + mocker.patch('toffy.normalize.get_median_pulse_height', mocked_pulse_height) + mocker.patch('toffy.normalize.extract_bin_files', mocked_extract_bin_file) + + # define paths for generated outputs + save_path = os.path.join(tmpdir, 'norm_func.json') + plot_path = os.path.join(sweep_dir, 'function_fit.jpg') + + normalize.create_tuning_function(sweep_path=sweep_dir, save_path=save_path) + assert os.path.exists(save_path) + assert os.path.exists(plot_path) + + +def test_identify_outliers(): + # create dataset with specified outliers + y_vals = np.arange(10, 30) + x_vals = np.linspace(0, len(y_vals) - 1, len(y_vals)) + outlier_idx = [5, 10, 15] + y_vals[outlier_idx] = [7, 32, 12] + + pred_outliers = normalize.identify_outliers(x_vals=x_vals, y_vals=y_vals, obj_func='poly_2') + # check that outliers are correctly identified + assert np.array_equal(outlier_idx, pred_outliers) + + +@parametrize('min_obs', [5, 12]) +def test_fit_mass_mph_curve(tmpdir, min_obs): + # create random data with single outlier + mph_vals = np.random.randint(0, 3, 10) + np.arange(10) + mph_vals[4] = 12 + + mass_name = '88' + obj_func = 'poly_2' + + normalize.fit_mass_mph_curve(mph_vals=mph_vals, mass=mass_name, save_dir=tmpdir, + obj_func=obj_func, min_obs=min_obs) + + # make sure plot was created + plot_path = os.path.join(tmpdir, mass_name + '_mph_fit.jpg') + assert os.path.exists(plot_path) + + # make sure json with weights was created + weights_path = os.path.join(tmpdir, mass_name + '_norm_func.json') + + with open(weights_path, 'r') as wp: + mass_json = json.load(wp) - normalized = load_utils.load_imgs_from_tree(output_dir, fovs=fovs, channels=chans) + # load weights into prediction function + weights = mass_json['weights'] + pred_func = normalize.create_prediction_function(name=obj_func, weights=weights) - # compute expected multipliers for each mass in each fov - mults = [mph_array * weights[0] + weights[2] for mph_array in [mph_0, mph_1]] + # generate predictions + preds = pred_func(np.arange(10)) - for idx, mult in enumerate(mults): - mult = mult.reshape(1, 1, len(mult)) - assert np.allclose(data_xr.values[idx, :, :, :], - normalized.values[idx, :, :, :] * mult) + if min_obs == 5: + # check that prediction function generates unique output + assert len(np.unique(preds)) == len(preds) + else: + # check that prediction function generates same output for all + assert len(np.unique(preds)) == 1 + assert np.allclose(preds[0], np.median(mph_vals)) + + +def test_create_fitted_mass_mph_vals(tmpdir): + masses = ['88', '100', '120'] + fovs = ['fov1', 'fov2', 'fov3', 'fov4'] + obj_func = 'poly_2' + + # each mass has a unique multiplier for fitted function + mass_mults = [1, 2, 3] + + # create json for each channel + for mass_idx in range(len(masses)): + weights = [mass_mults[mass_idx], 0, 0] + mass_json = {'name': obj_func, 'weights': weights} + mass_path = os.path.join(tmpdir, masses[mass_idx] + '_norm_func.json') + + with open(mass_path, 'w') as mp: + json.dump(mass_json, mp) + + # create combined mph_df + pulse_height_list = np.random.rand(len(masses) * len(fovs)) + mass_list = np.tile(masses, len(fovs)) + fov_list = np.repeat(fovs, len(masses)) + + pulse_height_df = pd.DataFrame({'pulse_height': pulse_height_list, + 'mass': mass_list, 'fov': fov_list}) + + modified_df = normalize.create_fitted_mass_mph_vals(pulse_height_df=pulse_height_df, + obj_func_dir=tmpdir) + + # check that fitted values are correct multiplier of FOV + for mass_idx in range(len(masses)): + mass = masses[mass_idx] + mult = mass_mults[mass_idx] + + fov_order = np.linspace(0, len(fovs) - 1, len(fovs)) + fitted_vals = modified_df.loc[modified_df['mass'] == mass, 'pulse_height_fit'].values + + assert np.array_equal(fov_order * mult, fitted_vals) + + +@parametrize_with_cases('metrics', cases=test_cases.CombineRunMetricFiles) +def test_create_fitted_pulse_heights_file(tmpdir, metrics): + + # create metric files + pulse_dir = os.path.join(tmpdir, 'pulse_heights') + os.makedirs(pulse_dir) + for metric in metrics: + name, values_df = metric[0], pd.DataFrame(metric[1]) + values_df.to_csv(os.path.join(pulse_dir, name), index=False) + + panel = test_cases.panel + fovs = natsort.natsorted(test_cases.fovs) + + df = normalize.create_fitted_pulse_heights_file(pulse_height_dir=pulse_dir, panel_info=panel, + norm_dir=tmpdir, mass_obj_func='poly_3') + + # all four FOVs included + assert len(np.unique(df['fov'].values)) == 4 + + # FOVs are ordered in proper order + ordered_fovs = df.loc[df['mass'] == 10, 'fov'].values.astype('str') + assert np.array_equal(ordered_fovs, fovs) + + # fitted values are distinct from original + assert np.all(df['pulse_height'].values != df['pulse_height_fit']) + + +def test_normalize_fov(tmpdir): + # create image data + fovs, chans = test_utils.gen_fov_chan_names(num_fovs=1, num_chans=3) + _, data_xr = test_utils.create_paired_xarray_fovs( + tmpdir, fovs, chans, img_shape=(10, 10)) + + # create inputs + norm_vals = np.random.rand(len(chans)) + extreme_vals = (-1, 1) + norm_dir = os.path.join(tmpdir, 'norm_dir') + os.makedirs(norm_dir) + + # normalize fov + normalize.normalize_fov(img_data=data_xr, norm_vals=norm_vals, norm_dir=norm_dir, + fov=fovs[0], channels=chans, extreme_vals=extreme_vals) + + # check that normalized images were modified by correct amount + norm_imgs = load_utils.load_imgs_from_tree(norm_dir, channels=chans) + assert np.allclose(data_xr.values, norm_imgs.values * norm_vals) + + # check that log file has correct values + log_file = pd.read_csv(os.path.join(norm_dir, 'fov0', 'normalization_coefs.csv')) + assert np.array_equal(log_file['channels'], chans) + assert np.allclose(log_file['norm_vals'], norm_vals) + + # check that warning is raised for extreme values + with pytest.warns(UserWarning, match='inspection for accuracy is recommended'): + norm_vals[0] = 1.5 + normalize.normalize_fov(img_data=data_xr, norm_vals=norm_vals, norm_dir=norm_dir, + fov=fovs[0], channels=chans, extreme_vals=extreme_vals) + + +@parametrize_with_cases('metrics', cases=test_cases.CombineRunMetricFiles) +def test_normalize_image_data(tmpdir, metrics): + + # create directory of pulse height csvs + pulse_height_dir = os.path.join(tmpdir, 'pulse_height_dir') + os.makedirs(pulse_height_dir) + + for metric in metrics: + name, values_df = metric[0], pd.DataFrame(metric[1]) + values_df.to_csv(os.path.join(pulse_height_dir, name), index=False) + + # create directory with image data + img_dir = os.path.join(tmpdir, 'img_dir') + os.makedirs(img_dir) + + fovs, chans = test_cases.fovs, test_cases.channels + filelocs, data_xr = test_utils.create_paired_xarray_fovs( + img_dir, fovs, chans, img_shape=(10, 10)) + + # create mph norm func + weights = np.random.rand(3) + name = 'poly_2' + func_json = {'name': name, 'weights': weights.tolist()} + func_path = os.path.join(tmpdir, 'norm_func.json') + + with open(func_path, 'w') as fp: + json.dump(func_json, fp) + + # get panel + panel = test_cases.panel + + norm_dir = os.path.join(tmpdir, 'norm_dir') + os.makedirs(norm_dir) + + # normalize images + normalize.normalize_image_data(img_dir=img_dir, norm_dir=norm_dir, + pulse_height_dir=pulse_height_dir, panel_info=panel, + norm_func_path=func_path) + + assert np.array_equal(io_utils.list_folders(norm_dir, 'fov').sort(), fovs.sort()) + + # no normalization function + with pytest.raises(ValueError, match='section 3 of the 1_set_up_toffy'): + normalize.normalize_image_data(img_dir=img_dir, norm_dir=norm_dir, + pulse_height_dir=pulse_height_dir, panel_info=panel, + norm_func_path='bad_path') + + # mismatch between FOVs + shutil.rmtree(os.path.join(img_dir, fovs[0])) + shutil.rmtree(norm_dir) + os.makedirs(norm_dir) + with pytest.raises(ValueError, match='image data fovs'): + normalize.normalize_image_data(img_dir=img_dir, norm_dir=norm_dir, + pulse_height_dir=pulse_height_dir, panel_info=panel, + norm_func_path=func_path) diff --git a/toffy/normalize_test_cases.py b/toffy/normalize_test_cases.py index e1d848d0..640ed1e6 100644 --- a/toffy/normalize_test_cases.py +++ b/toffy/normalize_test_cases.py @@ -3,13 +3,16 @@ import numpy as np import pandas as pd -masses = np.arange(5, 20) +masses = np.arange(5, 15) +channels = ['chan_{}'.format(i) for i in range(len(masses))] +panel = pd.DataFrame({'Mass': masses, 'Target': channels}) +fovs = ['fov{}'.format(12 - i) for i in range(4)] class TuningCurveFiles: def case_default_combined_files(self): - dirs = ['dir_{}'.format(i) for i in range(1, 4)] + dirs = ['Detector_202{}v_2022-01-13_13-30-5{}'.format(i, i) for i in range(1, 4)] # create lists to hold dfs from each directory mph_dfs = [] @@ -18,24 +21,15 @@ def case_default_combined_files(self): for dir in dirs: # create lists to hold values from each fov in directory - mph_vals = [] - channel_counts = [] - fovs = [] - num_fovs = np.random.randint(1, 5) - - for i in range(1, num_fovs + 1): - # initialize random columns for each fov - mph_vals.extend(np.random.randint(1, 200, len(masses))) - channel_counts.extend(np.random.randint(3, 100, len(masses))) - fovs.extend(np.repeat(i, len(masses))) + mph_vals = np.random.randint(1, 200, len(masses)) + channel_counts = np.random.randint(3, 100, len(masses)) + fovs = np.repeat(1, len(masses)) # create dfs from current directory - mph_df = pd.DataFrame({'masses': np.tile(masses, num_fovs), - 'fovs': fovs, 'mph': mph_vals}) + mph_df = pd.DataFrame({'mass': masses, 'fov': fovs, 'pulse_height': mph_vals}) # count_df has fields in different order to check that matching is working - count_df = pd.DataFrame({'masses': np.tile(masses, num_fovs), - 'channel_counts': channel_counts, 'fovs': fovs}) + count_df = pd.DataFrame({'mass': masses, 'channel_count': channel_counts, 'fov': fovs}) mph_dfs.append(mph_df) count_dfs.append(count_df) @@ -43,18 +37,15 @@ def case_default_combined_files(self): return dirs, mph_dfs, count_dfs -class CombineRunMetricFiles(): +class CombineRunMetricFiles: def case_default_metrics(self): # create full directory of files - bins = [] metrics = [] - for i in range(1, 5): - bin_name = 'example_{}.bin'.format(i) - bins.append(bin_name) - metric_name = 'example_metric_{}.csv'.format(i) - metric_values = {'column_1': np.random.rand(10), - 'column_2': np.random.rand(10), - 'column_3': np.random.rand(10)} + for i in range(0, 4): + metric_name = 'pulse_heights_{}.csv'.format(i) + metric_values = {'pulse_height': np.random.rand(10), + 'mass': masses, + 'fov': [fovs[i]] * 10} metrics.append([metric_name, metric_values]) - return bins, metrics + return metrics