Merge pull request #126 from BodenmillerGroup/develop

Allow handling of MCD files with missing channel labels
BodenmillerGroup · Mar 8, 2023 · 2609835 · 2609835
2 parents f318a90 + bbc329a
commit 2609835
Showing 16 changed files with 396 additions and 271 deletions.
diff --git a/.flake8 b/.flake8
@@ -0,0 +1,3 @@
+[flake8]
+max-line-length = 88
+extend-ignore = E203
diff --git a/.github/workflows/docs.yml b/.github/workflows/docs.yml
@@ -1,6 +1,6 @@
-on: 
-  push: 
-    branches: [main] 
+on:
+  push:
+    branches: [main]
   pull_request:
     branches: [main]
 
@@ -10,9 +10,9 @@ jobs:
   deploy:
     runs-on: ubuntu-latest
     steps:
-      - uses: actions/checkout@v2
-      - uses: actions/setup-python@v2
+      - uses: actions/checkout@v3
+      - uses: actions/setup-python@v4
         with:
-          python-version: 3.x
+          python-version: "3.x"
       - run: pip install mkdocs-material
       - run: mkdocs gh-deploy --force
diff --git a/.isort.cfg b/.isort.cfg
@@ -0,0 +1,2 @@
+[settings]
+profile=black
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -0,0 +1,43 @@
+exclude: ^(\.vscode/.*|scripts/.*|mkdocs.yml|docs/.*)$
+repos:
+  - repo: https://github.com/pre-commit/pre-commit-hooks
+    rev: v4.4.0
+    hooks:
+      - id: check-added-large-files
+      - id: check-case-conflict
+      - id: check-docstring-first
+      - id: check-executables-have-shebangs
+      - id: check-merge-conflict
+      - id: check-shebang-scripts-are-executable
+      - id: check-toml
+      - id: check-yaml
+      - id: debug-statements
+      - id: end-of-file-fixer
+      - id: requirements-txt-fixer
+      - id: trailing-whitespace
+  - repo: https://github.com/PyCQA/isort
+    rev: "5.12.0"
+    hooks:
+      - id: isort
+  - repo: https://github.com/PyCQA/autoflake
+    rev: v2.0.1
+    hooks:
+      - id: autoflake
+        args: [--in-place, --remove-all-unused-imports]
+  - repo: https://github.com/psf/black
+    rev: '23.1.0'
+    hooks:
+      - id: black
+  - repo: https://github.com/PyCQA/flake8
+    rev: "6.0.0"
+    hooks:
+      - id: flake8
+        additional_dependencies: [flake8-typing-imports]
+  - repo: https://github.com/pre-commit/mirrors-mypy
+    rev: v0.991
+    hooks:
+      - id: mypy
+        additional_dependencies: [types-requests, types-PyYAML]
+ci:
+  autoupdate_branch: develop
+  skip: [flake8, mypy]
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,5 +1,23 @@
 # Changelog
 
+## [3.6, 08-03-2023]
+
+ - allow handling MCD files with missing channel label entries
+ - updated links to raw data on Zenodo
+ - switched from `MCDFile.metadata` to `MCDFile.schema_xml` to keep up with the latest version of `readimc`
+
+## [3.5, 07-11-2022]
+
+ - exclude hidden files from processing
+
+## [3.4, 02-06-2022]
+
+ - removed `tifffile` version pinning
+
+## [3.3, 27-04-2022]
+
+ - fixed `tifffile` version
+
 ## [3.2]
 
  - sort channels by metal tag when creating the ilastik and full stacks
@@ -20,22 +38,19 @@
  - segmentation masks are directly written out to `cpout/masks` in the second pipeline and read in as objects in the last pipeline
  - pixel probabilities are downscaled in the second pipeline and directly written into `cpout/probabilites`
  - cell segmentation is performed on downscaled pixel probabilities
- 
+
 ## [2.3]
 
  - Bugfixes: `1_prepare_ilastik`: Removed special characters from pipeline comments as this caused encoding issues.
 
 ## [2.1]
 
  - Bugfixes: `1_prepare_ilastik`: Fix range to 0-1 for mean image, preventing out of range errors
- 
+
 ## [2.0]
 
  - Change to imctools v2: Changes the structure of the folder to the new format, changing the naming of the .ome.tiff files
  - Change to Cellprofiler v4: Requires the use of the ImcPluginsCP master branch or a release > v.4.1
  - Updated documentation
  - Adds var_Cells.csv containing metadata for the measurements
  - Adds panel to cpout folder
-
-
-
diff --git a/README.md b/README.md
@@ -3,14 +3,11 @@
 
 ## Introduction
 
-The pipeline is based on [CellProfiler](http://cellprofiler.org/) (tested v4.2.1) for segmentation and [Ilastik](http://ilastik.org/) (tested v1.3.3post3) for pixel classification. 
-It is streamlined by using the `imcsegpipe` python package available via this repository as well as custom CellProfiler modules ([ImcPluginsCP](https://github.com/BodenmillerGroup/ImcPluginsCP), release v4.2.1).
+The pipeline is based on [CellProfiler](http://cellprofiler.org/) (tested v4.2.1) for segmentation and [Ilastik](http://ilastik.org/) (tested v1.3.3post3) for pixel classification. It is streamlined by using the `imcsegpipe` python package available via this repository as well as custom CellProfiler modules ([ImcPluginsCP](https://github.com/BodenmillerGroup/ImcPluginsCP), release v4.2.1).
 
-This repository showcases the basis of the workflow with step-by-step instructions. 
-As an alternative and dockerized version of the pipeline, check out [steinbock](https://github.com/BodenmillerGroup/steinbock).
+This repository showcases the basis of the workflow with step-by-step instructions. As an alternative and dockerized version of the pipeline, check out [steinbock](https://github.com/BodenmillerGroup/steinbock).
 
-This pipeline was developed in the Bodenmiller laboratory at the University of Zurich ([www.bodenmillerlab.com](https://www.bodenmillerlab.com/)) to segment hundreds of highly multiplexed imaging mass cytometry (IMC) images.
-The concepts applied here to IMC data can also be transfered to data generated by other highly multiplexed imaging modalities.
+This pipeline was developed in the Bodenmiller laboratory at the University of Zurich ([www.bodenmillerlab.com](https://www.bodenmillerlab.com/)) to segment hundreds of highly multiplexed imaging mass cytometry (IMC) images. The concepts applied here to IMC data can also be transfered to data generated by other highly multiplexed imaging modalities.
 
 For a general overview on IMC as technology and data processing tasks, please refer to [bodenmillergroup.github.io/IMCWorkflow](https://bodenmillergroup.github.io/IMCWorkflow/).
 
@@ -22,13 +19,13 @@ Before being able to pre-process the data, you will need to setup the environmen
 
 1. [Install conda](https://docs.conda.io/projects/conda/en/latest/user-guide/install/)
 
-2. Clone the repository: 
+2. Clone the repository:
 
 ```bash
 git clone --recursive https://github.com/BodenmillerGroup/ImcSegmentationPipeline.git
 ```
 
-3. Setup the conda environment: 
+3. Setup the conda environment:
 
 ```bash
 cd ImcSegmentationPipeline
@@ -44,15 +41,14 @@ conda activate imcsegpipe
 jupyter-lab
 ```
 
-This will automatically open a jupyter instance at `http://localhost:8888/lab` in your browser.
-From there, you can open the `scripts/imc_preprocessing.ipynb` notebook and start the data pre-processing.
+This will automatically open a jupyter instance at `http://localhost:8888/lab` in your browser. From there, you can open the `scripts/imc_preprocessing.ipynb` notebook and start the data pre-processing.
 
 In brief, the main analysis steps include:
 
-1. Pre-processing of the raw images to create `.ome.tiffs` and `.tiff` stacks for ilastik training and measurement (python).   
-2. Ilastik pixel classification based on random crops of the images (CellProfiler, Ilastik).  
-3. Image segmentation based on the classification probabilities (CellProfiler).  
-4. Measurement and export of cell-specific features, such as marker expression (CellProfiler).  
+1. Pre-processing of the raw images to create `.ome.tiffs` and `.tiff` stacks for ilastik training and measurement (python).
+2. Ilastik pixel classification based on random crops of the images (CellProfiler, Ilastik).
+3. Image segmentation based on the classification probabilities (CellProfiler).
+4. Measurement and export of cell-specific features, such as marker expression (CellProfiler).
 
 ## Example data
 
@@ -69,21 +65,22 @@ The slides briefly explain why we chose this approach to image segmentation and
 ## Changelog
 
 For changes in specific releases, please refer to the [CHANGELOG](CHANGELOG.md).
-        
+
 ## License
 
-We [freely share](LICENSE) this pipeline in the hope that it will be useful for others to perform high quality image segmentation and serve as a basis to develop more complicated open source IMC image processing workflows. 
-In return we would like you to be considerate and give us and others feedback if you find a bug/issue and [raise a GitHub Issue](https://github.com/BodenmillerGroup/ImcSegmentationPipeline/issues) on the affected projects or on this page.
+We [freely share](LICENSE) this pipeline in the hope that it will be useful for others to perform high quality image segmentation and serve as a basis to develop more complicated open source IMC image processing workflows. In return we would like you to be considerate and give us and others feedback if you find a bug/issue and [raise a GitHub Issue](https://github.com/BodenmillerGroup/ImcSegmentationPipeline/issues) on the affected projects or on this page.
 
 ## Contributing
 
 To contribute to this work, please fork the repository, make changes to it and open a pull request.
 
 ## Contributors
 
-**Creator:** Vito Zanotelli  
-**Contributor:** Jonas Windhager, Nils Eling  
-**Maintainer:** Nils Eling  
+**Creator:** Vito Zanotelli
+
+**Contributor:** Jonas Windhager, Nils Eling
+
+**Maintainer:** Nils Eling
 
 ## Citation
 
@@ -100,4 +97,3 @@ If you use this workflow for your research, please cite us:
     url          = {https://doi.org/10.5281/zenodo.3841961}
     }
 ```
-
diff --git a/docs/index.md b/docs/index.md
@@ -40,25 +40,31 @@ Furthermore, before running the analysis, you will need to setup a `conda` envir
 
 2. Clone the repository: 
 
-```bash
-git clone --recursive https://github.com/BodenmillerGroup/ImcSegmentationPipeline.git
-```
+    ```
+    git clone --recursive https://github.com/BodenmillerGroup/ImcSegmentationPipeline.git
+    ```
 
 3. Setup the conda environment: 
 
-```bash
-cd ImcSegmentationPipeline
-conda env create -f environment.yml
-```
+    ```
+    cd ImcSegmentationPipeline
+    ```
+    
+    ```
+    conda env create -f environment.yml
+    ```
 
 4. Configure CellProfiler to use the plugins by opening the CellProfiler GUI, selecting `Preferences` and setting the `CellProfiler plugins directory` to `path/to/ImcSegmentationPipeline/resources/ImcPluginsCP/plugins` and **restart CellProfiler**. Alternatively you can clone the `ImcPluginsCP` repository individually and set the path correctly in CellProfiler.
 
 5. Activate the environment created in 3. and start a jupyter instance
 
-```bash
-conda activate imcsegpipe
-jupyter-lab
-```
+    ```
+    conda activate imcsegpipe
+    ```
+    
+    ```
+    jupyter-lab
+    ```
 
 This will automatically open a jupyter instance at `http://localhost:8888/lab` in your browser.
 From there, you can open the `scripts/imc_preprocessing.ipynb` notebook and start the data pre-processing.

diff --git a/mkdocs.yml b/mkdocs.yml
@@ -22,11 +22,11 @@ nav:
     - Cell segmentation: segmentation.md
     - Cell measurement: measurement.md
     - Output files: output.md
-    
+
 markdown_extensions:
     - footnotes
     - attr_list
     - md_in_html
     - pymdownx.emoji:
         emoji_index: !!python/name:materialx.emoji.twemoji
-        emoji_generator: !!python/name:materialx.emoji.to_svg
+        emoji_generator: !!python/name:materialx.emoji.to_svg
diff --git a/pyproject.toml b/pyproject.toml
@@ -1,3 +1,3 @@
 [build-system]
-requires = ["setuptools", "wheel"]
+requires = ["setuptools>=64", "wheel"]
 build-backend = "setuptools.build_meta"
diff --git a/scripts/download_examples.ipynb b/scripts/download_examples.ipynb
diff --git a/scripts/download_examples.py b/scripts/download_examples.py
@@ -17,23 +17,23 @@
 for example_file_name, example_file_url in [
     (
         "Patient1.zip",
-        "https://zenodo.org/record/5949116/files/Patient1.zip",
+        "https://zenodo.org/record/7575859/files/Patient1.zip",
     ),
     (
         "Patient2.zip",
-        "https://zenodo.org/record/5949116/files/Patient2.zip",
+        "https://zenodo.org/record/7575859/files/Patient2.zip",
     ),
     (
         "Patient3.zip",
-        "https://zenodo.org/record/5949116/files/Patient3.zip",
+        "https://zenodo.org/record/7575859/files/Patient3.zip",
     ),
     (
         "Patient4.zip",
-        "https://zenodo.org/record/5949116/files/Patient4.zip",
+        "https://zenodo.org/record/7575859/files/Patient4.zip",
     ),
     (
         "panel.csv",
-        "https://zenodo.org/record/5949116/files/panel.csv",
+        "https://zenodo.org/record/7575859/files/panel.csv",
     )
 ]:
     example_file = raw_folder / example_file_name
@@ -48,7 +48,7 @@
 # Sample metadata
 sample_metadata = Path("..") / "sample_metadata.xlsx"
 if not sample_metadata.exists():
-    request.urlretrieve("https://zenodo.org/record/5949116/files/sample_metadata.xlsx", sample_metadata)
+    request.urlretrieve("https://zenodo.org/record/7575859/files/sample_metadata.csv", sample_metadata)
 
 # %%
 # !conda list

diff --git a/scripts/imc_preprocessing.ipynb b/scripts/imc_preprocessing.ipynb
diff --git a/scripts/imc_preprocessing.py b/scripts/imc_preprocessing.py
@@ -134,9 +134,11 @@
                 imcsegpipe.extract_zip_file(zip_file, temp_dir.name)
     acquisition_metadatas = []
     for raw_dir in raw_dirs + [Path(temp_dir.name) for temp_dir in temp_dirs]:
-        mcd_files = list(raw_dir.rglob("[!.]*.mcd"))
+        mcd_files = list(raw_dir.rglob("*.mcd"))
+        mcd_files=[(i) for i in mcd_files if not i.stem.startswith('.')]
         if len(mcd_files) > 0:
-            txt_files = list(raw_dir.rglob("[!.]*.txt"))
+            txt_files = list(raw_dir.rglob("*.txt"))
+            txt_files=[(i) for i in txt_files if not i.stem.startswith('.')]
             matched_txt_files = imcsegpipe.match_txt_files(mcd_files, txt_files)
             for mcd_file in mcd_files:
                 acquisition_metadata = imcsegpipe.extract_mcd_file(

diff --git a/setup.cfg b/setup.cfg
@@ -4,11 +4,11 @@ version = 1.0.0
 
 [options]
 zip_safe = True
-install_requires = 
+install_requires =
     imageio
     numpy
     pandas
-    readimc
+    readimc>=0.6.2
     scipy
     tifffile
     xtiff>=0.7.8
@@ -19,7 +19,3 @@ packages = find:
 
 [options.packages.find]
 where = src
-
-[flake8]
-max-line-length = 88
-extend-ignore = E203
diff --git a/setup.py b/setup.py
diff --git a/src/imcsegpipe/_imcsegpipe.py b/src/imcsegpipe/_imcsegpipe.py
@@ -28,7 +28,7 @@ def match_txt_files(
     mcd_files: Sequence[Union[str, PathLike]], txt_files: Sequence[Union[str, PathLike]]
 ) -> Dict[Union[str, PathLike], List[Path]]:
     unmatched_txt_files = list(txt_files)
-    matched_txt_files: Dict[Union[str, PathLike], List[Union[str, PathLike]]] = {}
+    matched_txt_files: Dict[Union[str, PathLike], List[Path]] = {}
     for mcd_file in sorted(mcd_files, key=lambda x: Path(x).stem, reverse=True):
         matched_txt_files[mcd_file] = []
         i = 0
@@ -80,7 +80,7 @@ def extract_mcd_file(
                 acquisition_is_valid = _extract_acquisition(
                     f_mcd, acquisition, acquisition_img_file, acquisition_channels_file
                 )
-                if not acquisition_is_valid:
+                if not acquisition_is_valid and txt_files is not None:
                     acquisition_txt_files = [
                         txt_file
                         for txt_file in txt_files
@@ -173,10 +173,13 @@ def export_to_histocat(
         histocat_img_dir.mkdir(exist_ok=True)
         for channel_index, row in acquisition_channels.iterrows():
             acquisition_channel_img: np.ndarray = acquisition_img[channel_index]
-            channel_label = re.sub("[^a-zA-Z0-9()]", "-", row["channel_label"])
             channel_name = row["channel_name"]
+            channel_label = row["channel_label"]
+            if not pd.isnull(channel_label) and not channel_label:
+                channel_label = re.sub("[^a-zA-Z0-9()]", "-", channel_label)
             tifffile.imwrite(
-                histocat_img_dir / f"{channel_label}_{channel_name}.tiff",
+                histocat_img_dir
+                / f"{channel_label or channel_name}_{channel_name}.tiff",
                 data=acquisition_channel_img,
                 imagej=True,
             )
@@ -197,7 +200,7 @@ def export_to_histocat(
 def _extract_schema(mcd_file_handle: MCDFile, schema_xml_file: Path) -> bool:
     try:
         with schema_xml_file.open("w") as f:
-            f.write(mcd_file_handle.metadata)
+            f.write(mcd_file_handle.schema_xml)
         return True
     except Exception as e:
         logging.error(
@@ -218,6 +221,7 @@ def _extract_slide(
         logging.error(
             f"Error reading slide {slide.id} from file {mcd_file_handle.path.name}: {e}"
         )
+        return False
 
 
 def _extract_panorama(
@@ -292,13 +296,19 @@ def _write_acquisition_image(
     acquisition_img_file: Path,
     acquisition_channels_file: Path,
 ) -> None:
+    channel_labels_or_names = [
+        channel_label or channel_name
+        for channel_name, channel_label in zip(
+            acquisition.channel_names, acquisition.channel_labels
+        )
+    ]
     xtiff.to_tiff(
         acquisition_img,
         acquisition_img_file,
         ome_xml_fun=get_acquisition_ome_xml,
-        channel_names=acquisition.channel_labels,
+        channel_names=channel_labels_or_names,
         channel_fluors=acquisition.channel_names,
-        xml_metadata=mcd_file_handle.metadata.replace("\r\n", ""),
+        xml_metadata=mcd_file_handle.schema_xml.replace("\r\n", ""),
     )
     pd.DataFrame(
         data={