Skip to content

Latest commit

 

History

History
342 lines (297 loc) · 18.5 KB

folder_structure.md

File metadata and controls

342 lines (297 loc) · 18.5 KB

Cell Painting Gallery folder structure

All projects in the Cell Painting Gallery form a stereotyped structure. The parent structure is as follows.

cellpainting-gallery
└── <project>
    └── <project-specific-nesting>
        ├── images
        └── workspace
  • <project>: top level folder for the project. Keep the name short and simple with [a-z0-9_] only
  • <project-specific-nesting>: additional nesting level that is typically an institution identifier. It can be anonymized (e.g. s3://cellpainting-gallery/jump/ contains source_1/, source_2/, etc.). It should be present even if the data is from a single source (e.g. s3://cellpainting-gallery/cpg0003-rosetta/ only contains broad/).
  • images: all images and illumination correction functions
  • workspace: everything else goes here

The "completeness" of a project can be checked using this data validation script.

images folder structure

cellpainting-gallery/
└── <project>
    └── <project-specific-nesting>
        └── images
        │   ├── YYYY_MM_DD_<batch-name>
        │   │   ├── illum
        │   │   │   ├── <plate-name>
        │   │   │   │   ├── <plate-name>_Illum<Channel>.npy
        │   │   │   │   └── <plate-name>_Illum<Channel>.npy
        │   │   │   └── <plate-name>
        │   │   └── images
        │   │       ├── <full-plate-name>
        │   │       └── <full-plate-name>
        │   └── YYYY_MM_DD_<batch-name>
        └── workspace

Within the images folder, there are YYYY_MM_DD_<batch-name> subfolders for each batch. Each batch folder should start with YYYY_MM_DD of the date that image acquisition started (or your best guess thereof). The rest of the batch folder name can be a simple ordinal (e.g. YYYY_MM_DD_Batch1) or more descriptive of its contents (e.g. 2020_01_02_TestPhalloidinConcentration). A single batch typically contains all of the plates that were imaged (or started acquisition) on that day. However, for simplifying project tracking and analysis, sometimes plates imaged on the same day are divided into multiple batches where each batch is a different experimental condition (e.g. 2020_01_02_LowPhalloidin and 2020_01_02_HighPhalloidin)

Within each YYYY_MM_DD_<batch-name> batch subfolder there is an illum and an images folder.

The images folder contains a <full-plate-name> folder for each plate imaged in that batch. The structure beneath the <full-plate-name> folder depends on your imager, but it should contain all the images from the plate, and perhaps some other related metadata.

The illum folder contains a <plate-name> folder for each plate imaged in that batch. The <plate-name> can match the <full-plate-name> or it can be truncated if the <full-plate-name> is long. Note that the relationship between <full-plate-name> and <plate-name> needs to be immediately obvious and the <plate-name> still needs to be a unique identifier. Additionally, the <plate-name> used in the images folder must match that used in the workspace folder. Within each <plate-name> folder there are illumination correction functions for all channels imaged in that plate, as generated by CellProfiler. The illumination correction functions are named <plate-name>_Illum<Channel>.npy.

Note that the images folder contains the raw images as they come off of the microscope. Though images undergo manipulation before analysis (e.g. application of illumination correction functions), intermediate, processed images are not typically saved or uploaded. However, all of the information necessary to replicate the manipulation should be found in the <project-specific-nesting> folder (this is typically just the illumination correction function). For atypical experiments in which the images undergo more extensive manipulation and for which replicating those manipulations is challenging or prohibitive, additional folders of images may be uploaded. Those folders will follow the format images_<manipulation-description> (e.g. images_corrected_cropped).

An example of what this looks like in practice is below.

cellpainting-gallery
└── jump
    └── source_1
    └── source_2
    └── source_3
    └── source_4
        ├── images
        │   ├── 2021_04_26_Batch1
        │   │   ├── illum
        │   │   │   ├── BR00117035
        │   │   │   │   ├── BR00117035_IllumAGP.npy
        │   │   │   │   ├── BR00117035_IllumBrightfield.npy
        │   │   │   │   ├── BR00117035_IllumBrightfield_H.npy
        │   │   │   │   ├── BR00117035_IllumBrightfield_L.npy
        │   │   │   │   ├── BR00117035_IllumDNA.npy
        │   │   │   │   ├── BR00117035_IllumER.npy
        │   │   │   │   ├── BR00117035_IllumMito.npy
        │   │   │   │   └── BR00117035_IllumRNA.npy
        │   │   │   └── BR00117036
        │   │   └── images
        │   │       ├── BR00117035__2021-05-02T16_02_51-Measurement1
        │   │       └── BR00117036__2021-05-02T18_01_40-Measurement1
        │   └── 2021_05_31_Batch2
        └── workspace
  • jump is the project folder. Note that it differs from most project names in that it doesn't start with cpg.
  • source_4 is the anonymized nesting folder, representing Broad's data. Note that there are multiple sources in this project, though a nesting folder is still required even if your project doesn't have multiple sources.
  • 2021_04_26_Batch1 is the batch folder. Note that there are multiple batches of data acquired on different days in this project.
  • There are two plates in this example. BR00117035__2021-05-02T16_02_51-Measurement1 is the plate name as it comes off the microscope. This naming may differ with different microscopes and different acquisition configurations.
  • BR00117035 is the truncated plate name that we have given to BR00117035__2021-05-02T16_02_51-Measurement1 that is used for naming the plate in the illum folder (and the workspace folder, discussed below).
  • In the illum folder, within the BR00117035 plate folder, there are 8 separate illumination correction functions, one for each of the 8 channels imaged in that plate (e.g. BR00117035_IllumAGP.npy is the correction function for the AGP channel.)

workspace folder structure

Let's look under the workspace folder. Everything but images lives here. These folders are produced when following the data processing steps in the Image-based Profiling Handbook. Below are the minimally required top-level folders under workspace. Note that some experiments may generate additional categories of data/metadata and these should be uploaded to the workspace folder in their own folder/s.

cellpainting-gallery/
└── jump
    └── source_4
        ├── images
        └── workspace
            ├── analysis
            ├── backend
            ├── load_data_csv
            ├── metadata
            └── profiles
  • analysis: contains the CSV files and outline PNGs generated by CellProfiler
  • backend: contains the single-cell SQLite files (one per plate), the well-level aggregated profiles CSV files (also one per plate)
  • load_data_csv: contains LoadData CSV files used by CellProfiler to process the data
  • metadata: contains metadata files used to annotate the profiles
  • profiles: contains a set of well-level profiles files (one set per plate). The set comprises different stages of the CSV files produced when running the profiling recipe, as well as other output.

Examples of additional folders you may upload to workspace include:

  • assaydev or segmentation: work use to test/optimize segmentation parameters
  • pipelines: the CellProfiler .cppipe or .cpproj files used
  • software: scripts used while handling the batch
  • qc: quality control data

analysis folder structure:

Within the analysis folder, is a folder for each batch and within each batch folder is a folder for each plate. Within the plate folder is an additional analysis folder. It is the only folder at this level; it is redundant and somewhat confusingly-named but we have kept it for legacy reasons.

Within the nested analysis folder, data is typically saved in <plate>-<well>-<site> subfolders with a .csv for each object measured (e.g. Cells.csv) and for experimental details (Experiment.csv) and whole image measurements (Image.csv) from that single site. However, the grouping can vary depending on how the grouping was performed for the CellProfiler run (e.g. an experiment grouped by well instead of site would generate <plate>-<well> folders with the .csvs containing all of the data from the well in each .csv).

Often there is an additional folder such as outlines that contains object outlines or masks containing object masks.

└── analysis
    ├── 2021_04_26_Batch1
    │   ├── BR00117035
    │   │   └── analysis
    │   │       ├── BR00117035-A01-1
    │   │       │   ├── Cells.csv
    │   │       │   ├── Cytoplasm.csv
    │   │       │   ├── Experiment.csv
    │   │       │   ├── Image.csv
    │   │       │   ├── Nuclei.csv
    │   │       │   └── outlines
    │   │       │       ├── A01_s1--cell_outlines.png
    │   │       │       └── A01_s1--nuclei_outlines.png
    │   │       └── BR00117035-A01-2
    │   └── BR00117036
    └── 2021_05_31_Batch2

In this example batch:

  • 2021_04_26_Batch1 is the batch and BR00117035 is the plate
  • BR00117035-A01-1 is a folder containing CSV files and outline files for site 1 in well A01 in plate BR00117035

backend folder structure:

Within the analysis folder, is a folder for each batch and within each batch folder is a folder for each plate. Within each plate folder is a single-cell SQLite file, comprising all measurements from all cells in the plate, and a CSV that aggregates the single-cell data into a per-well measurement.

└── backend
    └── 2021_04_26_Batch1
        ├── BR00117035
        │   ├── BR00117035.csv
        │   └── BR00117035.sqlite
        └── BR00117036

In this example batch:

  • 2021_04_26_Batch1 is the batch and BR00117035 is the plate
  • BR00117035.sqlite is the single-cell SQLite file
  • BR00117035.csv is the aggregated CSV file

load_data_csv folder structure:

Within the load_data_csv folder is a folder for each batch and within each batch folder is a folder for each plate. Within the plate folder there are typically two files - a load_data.csv for pipelines that do not use an illumination correction function and a load_data_with_illum.csv for pipelines that do use an illumination correction function, however atypical workflows can have other arrangements such as a separate CSV for each pipeline in the workflow.

└── load_data_csv
     └── 2021_04_26_Batch1
         ├── BR00117035
         │   ├── load_data.csv
         │   └── load_data_with_illum.csv
         └── BR00117036

metadata folder structure:

The metadata folder has the slightly different structure, as explained in the profiling recipe

└── metadata
     ├─── external_metadata
     |   └── external_metadata.tsv
     └── platemaps
         └── 2021_04_26_Batch1
             ├── platemap
             │   └── OAA01.02.03.04.A.txt
             └── barcode_platemap.csv

profiles folder structure:

Within the profiles folder is a folder for each batch and within each batch folder is a folder for each plate. Within each plate folder are many files produced by the profiling-recipe that describe single-cell morphological profiles. For a full description of the files, see profiling-recipe files generated.

└── profiles
    └── 2021_04_26_Batch1
        ├── BR00117035
        │   ├── BR00117035.csv.gz
        │   ├── BR00117035_augmented.csv.gz
        │   ├── BR00117035_normalized.csv.gz
        │   ├── BR00117035_normalized_feature_select_negcon_plate.csv.gz
        │   ├── BR00117035_normalized_feature_select_plate.csv.gz
        │   └── BR00117035_normalized_negcon.csv.gz
        └── BR00117036
  • 2021_04_26_Batch1 is the batch and BR00117035 is the plate
  • The .csv files undergo gzip compression to be .csv.gz files

quality_control folder structure:

The quality_control folder has the slightly different structure. The files are all produced by the profiling-recipe.

└── quality_control
    └── heatmap
        └── 2021_04_26_Batch1
            ├── BR00117035
            │   ├── BR00117035_cell_count.png
            │   ├── BR00117035_correlation.png
            │   ├── BR00117035_position_effect.png
            │   └── and possibly others
            └── BR00117036
  • 2021_04_26_Batch1 is the batch and BR00117035 is the plate

Complete folder structure

Here's the complete folder structure for a sample project.

Click here
 └── cellpainting-gallery
    └── jump
        └── source_4
            ├── images
            │   ├── 2021_04_26_Batch1
            │   │   ├── illum
            │   │   │   ├── BR00117035
            │   │   │   │   ├── BR00117035_IllumAGP.npy
            │   │   │   │   ├── BR00117035_IllumBrightfield.npy
            │   │   │   │   ├── BR00117035_IllumBrightfield_H.npy
            │   │   │   │   ├── BR00117035_IllumBrightfield_L.npy
            │   │   │   │   ├── BR00117035_IllumDNA.npy
            │   │   │   │   ├── BR00117035_IllumER.npy
            │   │   │   │   ├── BR00117035_IllumMito.npy
            │   │   │   │   └── BR00117035_IllumRNA.npy
            │   │   │   └── BR00117036
            │   │   └── images
            │   │       ├── BR00117035__2021-05-02T16_02_51-Measurement1
            │   │       └── BR00117036__2021-05-02T18_01_40-Measurement1
            │   └── 2021_05_31_Batch2
            └── workspace
                ├── analysis
                │   ├── 2021_04_26_Batch1
                │   │   ├── BR00117035
                │   │   │   └── analysis
                │   │   │       ├── BR00117035-A01-1
                │   │   │       │   ├── Cells.csv
                │   │   │       │   ├── Cytoplasm.csv
                │   │   │       │   ├── Image.csv
                │   │   │       │   ├── Nuclei.csv
                │   │   │       │   └── outlines
                │   │   │       │       ├── A01_s1--cell_outlines.png
                │   │   │       │       └── A01_s1--nuclei_outlines.png
                │   │   │       └── BR00117035-A01-2
                │   │   └── BR00117036
                │   └── 2021_05_31_Batch2
                ├── backend
                │   └── 2021_04_26_Batch1
                │       ├── BR00117035
                │       │   ├── BR00117035.csv
                │       │   └── BR00117035.sqlite
                │       └── BR00117036
                ├── load_data_csv
                │   └── 2021_04_26_Batch1
                │       ├── BR00117035
                │       │   ├── load_data.csv.gz
                │       │   └── load_data_with_illum.csv.gz
                │       └── BR00117036
                ├── metadata
                │   ├─── external_metadata
                |   |   └── external_metadata.tsv
                │   └── platemaps
                |       └── 2021_04_26_Batch1
                |           ├── platemap
                |           │   └── OAA01.02.03.04.A.txt
                |           └── barcode_platemap.csv
                ├── quality_control
                │   └── heatmap
                │       └── 2021_04_26_Batch1
                │           ├── BR00117035
                │           │   ├── BR00117035_cell_count.png
                │           │   ├── BR00117035_correlation.png
                │           │   ├── BR00117035_position_effect.png
                │           │   └── and possibly others
                │           └── BR00117036
                └── profiles
                    └── 2021_04_26_Batch1
                        ├── BR00117035
                        │   ├── BR00116991_augmented.csv.gz
                        │   ├── BR00116991_normalized.csv.gz
                        │   ├── BR00116991_normalized_feature_select_negcon_plate.csv.gz
                        │   ├── BR00116991_normalized_feature_select_plate.csv.gz
                        │   ├── BR00116991_normalized_negcon.csv.gz
                        │   ├── BR00117035.csv.gz
                        │   └── and others https://github.com/cytomining/profiling-recipe#files-generated
                        └── BR00117036