The steps below are outlined specifically for running Distributed CellProfiler on AWS. If you are not doing so, the steps will still be more-or-less the same
- Make sure CellProfiler pipelines are accessible
- Make sure CellProfiler knows where the input images are, either via a CSV or a batch file
- Run each CellProfiler pipeline, in sequence, with appropriate input folder, output folder, and grouping
In your project's workspace directory, create a batch specific folder and upload your pipelines there. If there are previous batches from the same project or a similar one, you may find it easiest to copy the files directly. Once uploaded and/or copied, the file structure should look like the below.
└── pipelines
└── 2016_04_01_a549_48hr_batch1
├── illum_without_batchfile.cppipe
└── analysis_without_batchfile.cppipe
Note that [`run_batch_general`](https://github.com/CellProfiler/Distributed-CellProfiler/blob/master/run_batch_general.py) is not required; the Distributed CellProfiler [handbook](https://github.com/CellProfiler/Distributed-CellProfiler/wiki/Step-2%3A-Submit-jobs) lays out a number of different ways of creating jobs. However, we find it the most efficient way to run numerous pipelines on the same data. If you do not wish to use it, you can adjust steps 3 and 4 in the "Run each CellProfiler step" to "Create a job file" and "Execute `python3 run.py submitJob jobFileName.json`"
run_batch_general.py
can be configured once at the beginning of the run of a batch of data, and then can be run for each step simply by uncommenting the name of the step to run. The following variables in the project specific stuff
section of the script should be configured:
topdirname
andbatchsuffix
should match yourPROJECT_NAME
andBATCH_ID
, respectivelyappname
is typically the same astopdirname
, but if that name is long and cumbersome you can create an abbreviated version here (ie2015_10_05_DrugRepurposing
rather than2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad
). This will be used in yourconfig.py
filerows
,columns
, andsites
should reflect the imaging conditions usedplatelist
should contain a list of plates, comma separated, ie['SQ00015167','SQ00015168']
- If you are using pipeline files with the LoadData module and CSVs, you should make sure that the pipeline names reflect your pipeline names (or adjust if not). Otherwise, you should make sure that the batch file names reflect your batch file names.
- If following the recommended structures and procedures, none of the
not project specific
section of the script should need to be adapted, but if you are making changes you may need to.
If running in a fresh clone of Distributed-CellProfiler, you will need to configure a single fleet file, which will be used in all subsequent steps. Refer to the manual for instructions.
If running in a fresh clone of Distributed-CellProfiler, you will need to set the AWS_REGION
, SSH_KEY_NAME
, AWS_BUCKET
, and SQS_DEAD_LETTER_QUEUE
settings to appropriate settings for your account. Refer to the manual for instructions.
You may have as many as 5 or as few as 2 CellProfiler steps
- (optional) Z projection
- (optional) QC - see also section 4.6
- illumination correction
- (optional) assay development - see also section 4.6
- analysis
For each step, the steps you will run will be identical:
- Configure the
config.py
file - Execute
python3 run.py setup
- Uncomment the correct step name in your
run_batch_general.py
file (and ensure all other steps are commented out) - Execute
python3 run_batch_general.py
- Execute
python3 run.py startCluster files/yourFleetFileName.json
, where you have set the name of the fleet file previously created or located - Execute
python3 run.py monitor files/APP_NAMESpotFleetRequestId.json
, where APP_NAME matches the APP_NAME variable set in step 1.
Information on all of these steps is available in the Distributed-CellProfiler wiki.
You need only absolutely change the variables stated above and below for Distributed-CellProfiler to function, but other variables may be useful, such as using a non-default profile, adjusting whether or not you would like to pre-download files and/or use plugins and/or restart only parts of a batch of data.
In general, as long as you are running inside a tmux session and it isn't killed, the monitor should destroy any and all infrastructure created on AWS as part of the running Distributed-CellProfiler, but it is the user's responsibility to check that this has completed appropriately; failure to do so may lead to spot fleets generating charges after all useful work has completed.
- Your
APP_NAME
variable should be set to theappname
set inrun_batch_general.py
plus_Zproj
, ie2015_10_05_DrugRepurposing_Zproj
- Your number of
CLUSTER_MACHINES
should be medium-large, ie a hundred or few hundred. - Your
SQS_MESSAGE_VISIBILITY
should be short, such as5*60
(5 minutes)
- Your
APP_NAME
variable should be set to theappname
set inrun_batch_general.py
plus_QC
, ie2015_10_05_DrugRepurposing_QC
- Your number of
CLUSTER_MACHINES
should be medium-large, ie a hundred or few hundred. - Your
SQS_MESSAGE_VISIBILITY
should be short, such as5*60
(5 minutes)
- Your
APP_NAME
variable should be set to theappname
set inrun_batch_general.py
plus_Illum
, ie2015_10_05_DrugRepurposing_Illum
- Your number of
CLUSTER_MACHINES
should be set to the number of plates you have divided by 4 then rounded up, ie 6 for 22 plates - Your
SQS_MESSAGE_VISIBILITY
should be 12 hours720*60
- Your
APP_NAME
variable should be set to theappname
set inrun_batch_general.py
plus_AssayDev
, ie2015_10_05_DrugRepurposing_AssayDev
- Your number of
CLUSTER_MACHINES
should be medium-large, ie a hundred or few hundred. - Your
SQS_MESSAGE_VISIBILITY
should be short, such as5*60
(5 minutes)
- Your
APP_NAME
variable should be set to theappname
set inrun_batch_general.py
plus_Analysis
, ie2015_10_05_DrugRepurposing_Analysis
- Your number of
CLUSTER_MACHINES
should be as many as possible per your account limits, ideally at least a few hundred. - Your
SQS_MESSAGE_VISIBILITY
should be 10-20 minutes for images with a binning of 2, longer (30-120 minutes) for unbinned images and/or CellProfiler 2 or 3 runs. This value is the most potentially variable -once you've run a single analysis workflow, you can adjust this value accordingly based on your log files.
The optional QC and assay development steps have post-CellProfiler components. These pipelines need only be run if the user plans to do these post-CellProfiler steps.
- QC may require visual inspection of images, creation of a machine learning classifier to detect poor quality images, and/or running scripts to evaluate CV. How best to evaluate quality is left to the user.
- The assay development step creates images to be visually evaluated, either individually or after stitching for easier evaluation. If your segmentation is not ideal, you may need to update your assay dev and analysis pipelines by manually tuning the segmentation steps on a local set of representative data until they perform better on your images, then making sure to update the segmentation in the assay dev and analysis pipelines on your cluster accordingly.
- After running the final analysis pipelines, proceed to the next step of this guide.