Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2022_09_DD_DeepProfiler (cpg0019) #20

Closed
6 of 9 tasks
shntnu opened this issue Sep 13, 2022 · 7 comments
Closed
6 of 9 tasks

2022_09_DD_DeepProfiler (cpg0019) #20

shntnu opened this issue Sep 13, 2022 · 7 comments
Labels
cpg0019 cpg0019

Comments

@shntnu
Copy link
Collaborator

shntnu commented Sep 13, 2022

Segmentation/ Feature extraction is being performed by (Cimini lab / Carpenter-Singh lab)
Profile creation is being performed by (Cimini lab / Carpenter-Singh lab)
Data can be public in RODA Immediately

Update as generated:
[Link to profile repo]
https://doi.org/10.1101/2022.08.12.503783
cpg0019-moshkov-deepprofiler

  • Metadata completely filled out in Project Profiler Database (Imaging Platform internal use only)
  • Segmentation/Feature extraction complete
  • Profiling complete

Transfer to CellPainting Gallery:

  • Upload data to RODA (is private by default)
  • Run validation script to ensure completion
  • Update cellpainting-gallery/README.md
  • Make RODA entry public

If data is being published, prepare for publication:
These are only training images (crops)

  • Run Distributed-BioFormats2Raw to create .ome.zarr files
  • Upload (meta)data to IDR (images remain hosted in cellpainting-gallery).

Once published:

  • ~~ Make IDR entry public~~
  • Update cellpainting-gallery/README.md and open-data-registry/cellpainting-gallery.yml to reflect publication
  • Move this Issue from cellpainting-gallery-private to cellpainting-gallery. This step can be performed at an earlier point if it needs inputs from an external collaborator.
@shntnu
Copy link
Collaborator Author

shntnu commented Sep 16, 2022

cellpainting-gallery
└── cpg0019-moshkov-deepprofiler
    └──broad
        └── training_images
        │   ├── TAORF
        │   │   └── images
        │   │       └── <plate-id>
        │   │           └── <well>
        │   │              └── <site>
        │   │                  ├── <cell1>
        │   │                  ├── <cell2>
        │   │                  ├── ...
        │   │                  └── <celln>
        │   ├── BBBC022
        │   ├── LUAD
        │   ├── CDRP
        │   └── LINCS
        └── workspace

@shntnu shntnu changed the title 2022_09_DD_DeepProfiler 2022_09_DD_DeepProfiler (cpg0019) Sep 22, 2022
@Arkkienkeli
Copy link

Hi Shantanu, I have this folder structure now, any concerns or suggestions?

cpg0019-moshkov-deepprofiler
└── broad
    ├── training_images
    │   ├── BBBC022
    │   │   ├── A01
    │   │       ├── 1
    │   │           ├── *.png
    │   ├── BBBC036
    │   ├── BBBC037
    │   ├── BBBC043
    │   └── LINCS
    └── workspace_dl
        ├── collated
        │   └── 105281_zenodo7114558
        │       ├── BBBC022
        │       │   ├── notspherized.csv
        │       │   └── spherized.csv
        │       ├── BBBC036
        │       │   ├── notspherized.csv
        │       │   └── spherized.csv
        │       └── BBBC037
        │           ├── notspherized.csv
        │           └── spherized.csv
        ├── consensus
        │   └── 105281_zenodo7114558
        │       ├── BBBC022
        │       │   ├── notspherized.csv
        │       │   └── spherized.csv
        │       ├── BBBC036
        │       │   ├── notspherized.csv
        │       │   └── spherized.csv
        │       └── BBBC037
        │           ├── notspherized.csv
        │           └── spherized.csv
        ├── embeddings
        │   └── 105281_zenodo7114558
        │       ├── BBBC022
        │       │   ├── 20585
        │       │       ├── A01
        │       │           ├── 1
        │       │               ├── embedding.npz
        │       ├── BBBC036
        │       └── BBBC037
        └── metadata
            ├── BBBC022_profiling.csv
            ├── BBBC036_profiling.csv
            ├── BBBC037_profiling.csv
            └── sc-metadata.csv

shntnu added a commit that referenced this issue Oct 20, 2022
@shntnu
Copy link
Collaborator Author

shntnu commented Oct 20, 2022

Looks great @Arkkienkeli!

I've modified c1412d6 to reflect this

@Arkkienkeli
Copy link

Hi @shntnu and @ErinWeisbart, is the dataset available in the gallery?

@shntnu
Copy link
Collaborator Author

shntnu commented Nov 7, 2022

It is 🎉

@shntnu shntnu mentioned this issue Nov 10, 2022
@shntnu
Copy link
Collaborator Author

shntnu commented Nov 10, 2022

@Arkkienkeli Are you happy with this one-liner to summarize cpg0019?

8.3 million single cells from 232 plates, across 488 treatments from 5 public datasets, used for learning representations

Feel free to edit #27 if not.

@shntnu shntnu added the cpg0019 cpg0019 label Dec 14, 2022
@shntnu
Copy link
Collaborator Author

shntnu commented May 19, 2023

I'm adding our email logs here for our records

Forwarded Conversation
Subject: Posting dataset in the AWS Cell Painting Gallery

From: Juan Caicedo

Hi Shantanu,

We'd like to make the combined dataset of single cells that Nikita created for training publicly available as part of the materials that will support the submission of the DeepProfiler paper. Can you guide us on how to do this?

The dataset takes about 200GB of space. If this is not the best resource for the dataset, do you have any other suggestions for making it public?

Thank you!

Juan C.


From: Shantanu Singh

Hi Juan

The gallery sounds like the right place to store this information (I'd imagine you also want to store the corresponding images, and not just the single cell data, correct?)

We have a process for doing this
https://github.com/broadinstitute/cellpainting-gallery#contributing-to-cell-painting-gallery
I have got us started here
#20

It's worth your skimming the folder structure
https://github.com/broadinstitute/cellpainting-gallery/blob/main/folder_structure.md
to see how things are organized

In your case, we will need to skip a bunch of folders (I'd imagine), but we can figure that out later

The first thing to figure out is: where in the structure do we store embeddings? This will be the first such dataset, so it will be good to think this through, and your inputs would be great.

This is what I proposed
https://github.com/broadinstitute/cellpainting-gallery/pull/19/files
after chatting with Mike Ando, who will also be producing embeddings (for JUMP).

Let me know what you think (either in the PR or here)

Once we settle on the structure (for the embeddings), we can tackle the next steps

I'm cc'ing Erin to keep her in the loop

-Shantanu


From: Juan Caicedo

Hi Shantanu,

Thank you so much for getting this started and for all the instructions to proceed!

Just to clarify, this dataset is not useful for biological analysis, this is only a training resource. So we don't plan on releasing embeddings, and we don't plan on releasing the original full images. We only want to make the single cell images and their metadata available for future use in machine learning algorithms. How these single cells were obtained is something that we will document, so we can point to the original sources and list the treatments (wells and plates) that we sampled. Does this make sense?

Regarding the embeddings, we are happy to share the features processed with our technique for existing datasets (e.g. TAORF, CDRP, BBBC022). Can we append these features to existing datasets?

Nikita will follow the instructions to make the dataset public during the next few days. Nikita, please let us know if you have any questions!

Thank you!

Juan C.


From: Shantanu Singh

Hi Juan,

Thanks for the clarifications

All this makes sense.

It would be great if Nikita could ponder an appropriate folder structure for sharing the data, keeping the current structure in view. The structure is essential because it will set a precedent for future datasets of this nature.

Regarding the other embeddings for existing data – that would be fantastic! Nikita, can you organize it in the proposed folder structure for each dataset?
https://github.com/broadinstitute/cellpainting-gallery/pull/19/files
(or proposed changes to the structure)

S


From: Moshkov Nikita

Hi Shantanu,

Most of the folders in the documented structure don't seem to be needed for this dataset.
The images are stored in the following structure: Source dataset -> Plate ID - Well - Site.
From the workspace directory, we only need one for metadata.
Outlines are already part of the images.
The example folder structure is in the image attached.
I believe that training resources should have a simple folder structure.

Regarding the embeddings: we have the embeddings for BBBC022, CDRP and TA-ORF.
Should we just put the embeddings in the dataset folders?
BBBC022 does not seem to be in the gallery.
Do we want to share only the embeddings extracted with the Cell Painting CNN model or with other models too?

Note that for the extraction of embeddings we used slightly different metadata
(in short, it means that we did not extract embedding from all images, but only the ones which passed out QC).
Those are npz files if it matters.

Thank you!

image.png


From: Juan Caicedo

Hi Nikita,

Great that you are looking into this!
I agree that the folder structure for the single cell images may be different. We are following the way other machine learning datasets are organized to help researchers in the field use it out of the box, and lower the barrier of entry.

On the embeddings side, I think we only need to make the Cell Painting CNN embeddings public, and we should release all levels of profiling (from single cells to sphered well-level and aggregated treatment-level).

Best,

Juan C.


From: Shantanu Singh <

Sounds good

Let's go with what you recommended, just that we should call the images folder something else – let's go with training_images?

#20 (comment)

On the embeddings side, I think we only need to make the Cell Painting CNN embeddings public, and we should release all levels of profiling (from single cells to sphered well-level and aggregated treatment-level).

Fantastic! Are we good with the structure proposed here for that https://github.com/broadinstitute/cellpainting-gallery/pull/19/files? (and use npz files instead of parquet)

BBBC022 does not seem to be in the gallery.

That's right but we can get that ready while you are preparing the data

So in summary, you will have just 3 folders:

  • training_images
  • workspace/profiles
  • workspace/metadata

Please LMK if you have any questions.


From: Moshkov Nikita

Hi Shantanu, Rebecca, Erin, Juan,
I have put together the dataset for publishing.

Shantanu and Juan, I guess it makes more sense to put our embeddings to a separate folder instead of putting them to dataset folders
OR make a single folder for DeepProfiler paper (similarly to M.Rohban's heterogeneity paper) and put everything there.

Currently, the folder structure is the following:

cellpainting-gallery
└── Broad-CP-TrainingSet2022
    └── broad
        ├── training_images
            ├── TAORF (same for other datasets)
                └── Plate Id
                    └── Well
                        └── Site
            ├── BBBC022
            ├── LUAD
            ├── LINCS
            └── CDRP
        └── workspace
            └── metadata
                └── sc-metadata.csv

Metadata file is adjusted to have a relative path to images in this folder structure.

If no concerns, the dataset is ready to be uploaded and now is available on DGX in the folder:
/raid/data/cellpainting/Broad-CP-TrainingSet2022/

I can prepare the embeddings to be uploaded either as a separate folder or as a part of a single folder later.

Thank you!


From: Shantanu Singh

make a single folder for DeepProfiler paper (similarly to M.Rohban's heterogeneity paper) and put everything there.

I like this idea

Does this structure work for you? If not, please feel free to propose alternatives

https://github.com/broadinstitute/cellpainting-gallery/pull/19/files


From: Moshkov Nikita

Hi Shantanu, thank you for looking into this!
I have added some comments and questions to the PR you shared: https://github.com/broadinstitute/cellpainting-gallery/pull/19/files
We have slightly different structure in DeepProfiler: /Plate/Well/Site.npz
Maybe we could unify this together?
Thank you!


From: Shantanu Singh

Thanks, Nikita. I have responded; have a look.

/Plate/Well/Site.npz

Would this modification work – plate/well/site/embedding.npz? in favor of encoding the structure in the folder vs in the file

Here's what that would look like

└── embeddings
    ├── 2021_04_26_Batch1
    │   ├── BR00117035
    │   │   └── efficientnet_v2_imagenet1k_s_feature_vector_2_ec756ff
    │   │       ├── A01
    │   │       │   └── 1
    │   │       │       └── embedding.npz
    │   │       └── A02
    │   └── BR00117036
    └── 2021_05_31_Batch2

From: Shantanu Singh

Nikita – I've made a bunch of changes to align with DeepProfiler output

https://github.com/broadinstitute/cellpainting-gallery/blob/embeddings/folder_structure.md#embeddings-folder-structure

If this looks good, I'll go ahead and merge


From: Moshkov Nikita

Hi Shantanu, great, thank you!
I am going to make some adjustments and then share the folder structure with you.

We would like to put well-level and treatment-level profiles in the analysis folder, is it ok if we put just full CSV files without splitting those by plate?

Thank you!


From: Shantanu Singh

We would like to put well-level and treatment-level profiles in the analysis folder, is it ok if we put just full CSV files without splitting those by plate?

As such

But I'm open to suggestions


From: Moshkov Nikita

Hi Shantanu,

  • We’d certainly want to split by plate
    Will do.

  • We don't have a location for treatment-level profiles (it's something we just do on the fly)
    Could it work if I create a folder on the same level with batches named "full_profiles" and put there concatenated well-level and treatment-level profiles?

Thank you!


From: Shantanu Singh

Hi Nikita,

Turns out we did decide on a folder structure for those concatenated well-level and treatment-level profiles
From cytomining/profiling-handbook#54 (comment)

  • consensus (treatment-level)
  • collated (well-level)
    There is a single file per batch because it assumes all replicates are in the same batch, but I think it is wise to skip the batch structure and have a single file directly under that folder without any further nesting.

However, taking a step back, I realized it's wisest to split off DL-generated features into a different workspace folder. It's going to get too confusing to have DL-derived data components intermingle with CellProfiler-derived data components. Further, there will likely be several different versions of DL-generated features (vs. CellProfiler, which is relatively stable) – and so the nesting structure should more conveniently allow for this.

Here's what I came up with, hopefully making it easier for you
https://github.com/broadinstitute/cellpainting-gallery/blob/1f999572b7b40f8702a71684de4d145ff2c50674/folder_structure.md#workspace_dl-folder-structure

Diff
1f99957

Aside: We should make our best effort to create a sensible folder structure, but ultimately, the rigidity of folder structures will end up being too constraining, and we may (later) have to rely on configuration files that specify what's where. Just giving you a heads-up that this might happen in the future, but nothing for you to do right now.

Shantanu


From: Moshkov Nikita

Hi Shantanu,

Turns out we did decide on a folder structure for those concatenated well-level and treatment-level profiles
From cytomining/profiling-handbook#54 (comment)

  • consensus (treatment-level)
  • collated (well-level)
    There is a single file per batch because it assumes all replicates are in the same batch, but I think it is wise to skip the batch structure and have a single file directly under that folder without any further nesting.

Got it. Thank you!

I have reorganized the folder (please see it in the related issue: #20 (comment)).
I did not add the additional notebook for reading the features, not sure if it is needed (the folder structure differs a little from DeepProfiler's output).

FYI: LUAD was renamed to BBBC043, though it is not public yet: broadinstitute/imaging-bbbc#52

Thank you!

@broadinstitute broadinstitute locked and limited conversation to collaborators Sep 8, 2023
@ErinWeisbart ErinWeisbart converted this issue into discussion #62 Sep 8, 2023

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
cpg0019 cpg0019
Projects
None yet
Development

No branches or pull requests

2 participants