All projects in the Cell Painting Gallery form a stereotyped structure. The parent structure is as follows.
cellpainting-gallery
└── <project>
└── <project-specific-nesting>
├── images
└── workspace
<project>
: top level folder for the project. Keep the name short and simple with[a-z0-9_]
only<project-specific-nesting>
: additional nesting level that is typically an institution identifier. It can be anonymized (e.g.s3://cellpainting-gallery/jump/
containssource_1/
,source_2/
, etc.). It should be present even if the data is from a single source (e.g.s3://cellpainting-gallery/cpg0003-rosetta/
only containsbroad/
).images
: all images and illumination correction functionsworkspace
: everything else goes here
The "completeness" of a project can be checked using this data validation script.
cellpainting-gallery/
└── <project>
└── <project-specific-nesting>
└── images
│ ├── YYYY_MM_DD_<batch-name>
│ │ ├── illum
│ │ │ ├── <plate-name>
│ │ │ │ ├── <plate-name>_Illum<Channel>.npy
│ │ │ │ └── <plate-name>_Illum<Channel>.npy
│ │ │ └── <plate-name>
│ │ └── images
│ │ ├── <full-plate-name>
│ │ └── <full-plate-name>
│ └── YYYY_MM_DD_<batch-name>
└── workspace
Within the images
folder, there are YYYY_MM_DD_<batch-name>
subfolders for each batch.
Each batch folder should start with YYYY_MM_DD
of the date that image acquisition started (or your best guess thereof).
The rest of the batch folder name can be a simple ordinal (e.g. YYYY_MM_DD_Batch1
) or more descriptive of its contents (e.g. 2020_01_02_TestPhalloidinConcentration
).
A single batch typically contains all of the plates that were imaged (or started acquisition) on that day.
However, for simplifying project tracking and analysis, sometimes plates imaged on the same day are divided into multiple batches where each batch is a different experimental condition (e.g. 2020_01_02_LowPhalloidin
and 2020_01_02_HighPhalloidin
)
Within each YYYY_MM_DD_<batch-name>
batch subfolder there is an illum
and an images
folder.
The images
folder contains a <full-plate-name>
folder for each plate imaged in that batch.
The structure beneath the <full-plate-name>
folder depends on your imager, but it should contain all the images from the plate, and perhaps some other related metadata.
The illum
folder contains a <plate-name>
folder for each plate imaged in that batch.
The <plate-name>
can match the <full-plate-name>
or it can be truncated if the <full-plate-name>
is long.
Note that the relationship between <full-plate-name>
and <plate-name>
needs to be immediately obvious and the <plate-name>
still needs to be a unique identifier.
Additionally, the <plate-name>
used in the images
folder must match that used in the workspace
folder.
Within each <plate-name>
folder there are illumination correction functions for all channels imaged in that plate, as generated by CellProfiler.
The illumination correction functions are named <plate-name>_Illum<Channel>.npy
.
Note that the images
folder contains the raw images as they come off of the microscope.
Though images undergo manipulation before analysis (e.g. application of illumination correction functions), intermediate, processed images are not typically saved or uploaded.
However, all of the information necessary to replicate the manipulation should be found in the <project-specific-nesting>
folder (this is typically just the illumination correction function).
For atypical experiments in which the images undergo more extensive manipulation and for which replicating those manipulations is challenging or prohibitive, additional folders of images may be uploaded.
Those folders will follow the format images_<manipulation-description>
(e.g. images_corrected_cropped
).
An example of what this looks like in practice is below.
cellpainting-gallery
└── jump
└── source_1
└── source_2
└── source_3
└── source_4
├── images
│ ├── 2021_04_26_Batch1
│ │ ├── illum
│ │ │ ├── BR00117035
│ │ │ │ ├── BR00117035_IllumAGP.npy
│ │ │ │ ├── BR00117035_IllumBrightfield.npy
│ │ │ │ ├── BR00117035_IllumBrightfield_H.npy
│ │ │ │ ├── BR00117035_IllumBrightfield_L.npy
│ │ │ │ ├── BR00117035_IllumDNA.npy
│ │ │ │ ├── BR00117035_IllumER.npy
│ │ │ │ ├── BR00117035_IllumMito.npy
│ │ │ │ └── BR00117035_IllumRNA.npy
│ │ │ └── BR00117036
│ │ └── images
│ │ ├── BR00117035__2021-05-02T16_02_51-Measurement1
│ │ └── BR00117036__2021-05-02T18_01_40-Measurement1
│ └── 2021_05_31_Batch2
└── workspace
jump
is the project folder. Note that it differs from most project names in that it doesn't start withcpg
.source_4
is the anonymized nesting folder, representing Broad's data. Note that there are multiple sources in this project, though a nesting folder is still required even if your project doesn't have multiple sources.2021_04_26_Batch1
is the batch folder. Note that there are multiple batches of data acquired on different days in this project.- There are two plates in this example.
BR00117035__2021-05-02T16_02_51-Measurement1
is the plate name as it comes off the microscope. This naming may differ with different microscopes and different acquisition configurations. BR00117035
is the truncated plate name that we have given toBR00117035__2021-05-02T16_02_51-Measurement1
that is used for naming the plate in theillum
folder (and theworkspace
folder, discussed below).- In the
illum
folder, within theBR00117035
plate folder, there are 8 separate illumination correction functions, one for each of the 8 channels imaged in that plate (e.g.BR00117035_IllumAGP.npy
is the correction function for the AGP channel.)
Let's look under the workspace
folder.
Everything but images lives here.
These folders are produced when following the data processing steps in the Image-based Profiling Handbook.
Below are the minimally required top-level folders under workspace
.
Note that some experiments may generate additional categories of data/metadata and these should be uploaded to the workspace
folder in their own folder/s.
cellpainting-gallery/
└── jump
└── source_4
├── images
└── workspace
├── analysis
├── backend
├── load_data_csv
├── metadata
└── profiles
analysis
: contains the CSV files and outline PNGs generated by CellProfilerbackend
: contains the single-cell SQLite files (one per plate), the well-level aggregated profiles CSV files (also one per plate)load_data_csv
: contains LoadData CSV files used by CellProfiler to process the datametadata
: contains metadata files used to annotate the profilesprofiles
: contains a set of well-level profiles files (one set per plate). The set comprises different stages of the CSV files produced when running the profiling recipe, as well as other output.
Examples of additional folders you may upload to workspace
include:
assaydev
orsegmentation
: work use to test/optimize segmentation parameterspipelines
: the CellProfiler .cppipe or .cpproj files usedsoftware
: scripts used while handling the batchqc
: quality control data
Within the analysis
folder, is a folder for each batch and within each batch folder is a folder for each plate.
Within the plate folder is an additional analysis
folder.
It is the only folder at this level; it is redundant and somewhat confusingly-named but we have kept it for legacy reasons.
Within the nested analysis
folder, data is typically saved in <plate>-<well>-<site>
subfolders with a .csv for each object measured (e.g. Cells.csv
) and for experimental details (Experiment.csv
) and whole image measurements (Image.csv
) from that single site.
However, the grouping can vary depending on how the grouping was performed for the CellProfiler run (e.g. an experiment grouped by well instead of site would generate <plate>-<well>
folders with the .csvs containing all of the data from the well in each .csv).
Often there is an additional folder such as outlines
that contains object outlines or masks
containing object masks.
└── analysis
├── 2021_04_26_Batch1
│ ├── BR00117035
│ │ └── analysis
│ │ ├── BR00117035-A01-1
│ │ │ ├── Cells.csv
│ │ │ ├── Cytoplasm.csv
│ │ │ ├── Experiment.csv
│ │ │ ├── Image.csv
│ │ │ ├── Nuclei.csv
│ │ │ └── outlines
│ │ │ ├── A01_s1--cell_outlines.png
│ │ │ └── A01_s1--nuclei_outlines.png
│ │ └── BR00117035-A01-2
│ └── BR00117036
└── 2021_05_31_Batch2
In this example batch:
2021_04_26_Batch1
is the batch andBR00117035
is the plateBR00117035-A01-1
is a folder containing CSV files and outline files for site1
in wellA01
in plateBR00117035
Within the analysis
folder, is a folder for each batch and within each batch folder is a folder for each plate.
Within each plate folder is a single-cell SQLite file, comprising all measurements from all cells in the plate, and a CSV that aggregates the single-cell data into a per-well measurement.
└── backend
└── 2021_04_26_Batch1
├── BR00117035
│ ├── BR00117035.csv
│ └── BR00117035.sqlite
└── BR00117036
In this example batch:
2021_04_26_Batch1
is the batch andBR00117035
is the plateBR00117035.sqlite
is the single-cell SQLite fileBR00117035.csv
is the aggregated CSV file
Within the load_data_csv
folder is a folder for each batch and within each batch folder is a folder for each plate.
Within the plate folder there are typically two files - a load_data.csv
for pipelines that do not use an illumination correction function and a load_data_with_illum.csv
for pipelines that do use an illumination correction function, however atypical workflows can have other arrangements such as a separate CSV for each pipeline in the workflow.
└── load_data_csv
└── 2021_04_26_Batch1
├── BR00117035
│ ├── load_data.csv
│ └── load_data_with_illum.csv
└── BR00117036
The metadata
folder has the slightly different structure, as explained in the profiling recipe
└── metadata
├─── external_metadata
| └── external_metadata.tsv
└── platemaps
└── 2021_04_26_Batch1
├── platemap
│ └── OAA01.02.03.04.A.txt
└── barcode_platemap.csv
Within the profiles
folder is a folder for each batch and within each batch folder is a folder for each plate.
Within each plate folder are many files produced by the profiling-recipe that describe single-cell morphological profiles.
For a full description of the files, see profiling-recipe files generated.
└── profiles
└── 2021_04_26_Batch1
├── BR00117035
│ ├── BR00117035.csv.gz
│ ├── BR00117035_augmented.csv.gz
│ ├── BR00117035_normalized.csv.gz
│ ├── BR00117035_normalized_feature_select_negcon_plate.csv.gz
│ ├── BR00117035_normalized_feature_select_plate.csv.gz
│ └── BR00117035_normalized_negcon.csv.gz
└── BR00117036
2021_04_26_Batch1
is the batch andBR00117035
is the plate- The .csv files undergo gzip compression to be .csv.gz files
The quality_control
folder has the slightly different structure.
The files are all produced by the profiling-recipe.
└── quality_control
└── heatmap
└── 2021_04_26_Batch1
├── BR00117035
│ ├── BR00117035_cell_count.png
│ ├── BR00117035_correlation.png
│ ├── BR00117035_position_effect.png
│ └── and possibly others
└── BR00117036
2021_04_26_Batch1
is the batch andBR00117035
is the plate
Here's the complete folder structure for a sample project.
Click here
└── cellpainting-gallery
└── jump
└── source_4
├── images
│ ├── 2021_04_26_Batch1
│ │ ├── illum
│ │ │ ├── BR00117035
│ │ │ │ ├── BR00117035_IllumAGP.npy
│ │ │ │ ├── BR00117035_IllumBrightfield.npy
│ │ │ │ ├── BR00117035_IllumBrightfield_H.npy
│ │ │ │ ├── BR00117035_IllumBrightfield_L.npy
│ │ │ │ ├── BR00117035_IllumDNA.npy
│ │ │ │ ├── BR00117035_IllumER.npy
│ │ │ │ ├── BR00117035_IllumMito.npy
│ │ │ │ └── BR00117035_IllumRNA.npy
│ │ │ └── BR00117036
│ │ └── images
│ │ ├── BR00117035__2021-05-02T16_02_51-Measurement1
│ │ └── BR00117036__2021-05-02T18_01_40-Measurement1
│ └── 2021_05_31_Batch2
└── workspace
├── analysis
│ ├── 2021_04_26_Batch1
│ │ ├── BR00117035
│ │ │ └── analysis
│ │ │ ├── BR00117035-A01-1
│ │ │ │ ├── Cells.csv
│ │ │ │ ├── Cytoplasm.csv
│ │ │ │ ├── Image.csv
│ │ │ │ ├── Nuclei.csv
│ │ │ │ └── outlines
│ │ │ │ ├── A01_s1--cell_outlines.png
│ │ │ │ └── A01_s1--nuclei_outlines.png
│ │ │ └── BR00117035-A01-2
│ │ └── BR00117036
│ └── 2021_05_31_Batch2
├── backend
│ └── 2021_04_26_Batch1
│ ├── BR00117035
│ │ ├── BR00117035.csv
│ │ └── BR00117035.sqlite
│ └── BR00117036
├── load_data_csv
│ └── 2021_04_26_Batch1
│ ├── BR00117035
│ │ ├── load_data.csv.gz
│ │ └── load_data_with_illum.csv.gz
│ └── BR00117036
├── metadata
│ ├─── external_metadata
| | └── external_metadata.tsv
│ └── platemaps
| └── 2021_04_26_Batch1
| ├── platemap
| │ └── OAA01.02.03.04.A.txt
| └── barcode_platemap.csv
├── quality_control
│ └── heatmap
│ └── 2021_04_26_Batch1
│ ├── BR00117035
│ │ ├── BR00117035_cell_count.png
│ │ ├── BR00117035_correlation.png
│ │ ├── BR00117035_position_effect.png
│ │ └── and possibly others
│ └── BR00117036
└── profiles
└── 2021_04_26_Batch1
├── BR00117035
│ ├── BR00116991_augmented.csv.gz
│ ├── BR00116991_normalized.csv.gz
│ ├── BR00116991_normalized_feature_select_negcon_plate.csv.gz
│ ├── BR00116991_normalized_feature_select_plate.csv.gz
│ ├── BR00116991_normalized_negcon.csv.gz
│ ├── BR00117035.csv.gz
│ └── and others https://github.com/cytomining/profiling-recipe#files-generated
└── BR00117036