Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How can I implement it for protein-text retrieval? #1

Open
LTEnjoy opened this issue Aug 15, 2024 · 8 comments
Open

How can I implement it for protein-text retrieval? #1

LTEnjoy opened this issue Aug 15, 2024 · 8 comments

Comments

@LTEnjoy
Copy link

LTEnjoy commented Aug 15, 2024

Dear authors,

Thank you for proposing this cool methodology! Currently I'm working on a cross-modality retrieval system named ProTrek and I'm very interested in engaging this metric into our evaluation system. I'm not familiar with conformal risk control so I would appreciate it if you could give me some suggestions to implement this pipeline.

Given a protein x, ProTrek will calculate similarity scores with all natural language descriptions (denoted as y1, y2, ..., yn) in the database and return scores s1, s2, ..., sn. I want to assign a probability to each score to represent how likely it is to be correct. I have a test set containing 4,000 proteins that were not seen during the training of ProTrek, which I think could use to evaluate and generalize to proteins input by users. Could you give me a pipeline for how can I implement this?

Thank you in advance and I sincerely look forward to your reply!

Best regards,
Jin

@seyonechithrananda
Copy link
Collaborator

Hi Jin!

Sorry for the delay, glad to hear you liked our work! I enjoyed reading ProTrek and i think this methodology would be a really complimentary fit for producing rigorous evals + guarantees.

Based on what you described, the regime seem's complimentary to the style of risk control we employ on the paper, except here the query q is the protein (x), and the lookup database is your set of natural language descriptions y = {y_1, y_2, ..., y_n}, yielding cosine similarity scores in [0, 1]. You should be able to use the exact same pipeline as we use, where:

  1. Some subset of the 4,000 proteins (which have ground truth natural language descriptions for some match function against the lookup set of all langauge descriptions) need to go into your calibration set. Ideally, if you can sample uniformly with respect to the natural language label abundances (i.e.: where you calibration set has a distribution of language annotations that is approximately the same as the test set), and pick some n > 300-400 for calibration. You repeat this shuffling of the 4,000 into calibration and test over multiple trials.
  2. From here, you just need to create a distance matrix with all the scores against all proteins, so with shape (4000, num_language_descriptions). Then sort each row in descending order of highest to lowest similarity, and have a seperate matrix which stores the re-arranged indicies.
  3. Presuming you have the labels for each of the 4,000: you can just score matches by sorting from greatest to least similarity, finding some threshold $\lambda_s$ score that holds for the loss you wish to control (i.e: I want to control false discoveries of natural language descriptions to be no more then 10% of the set).

In our code, we store lists where the length is the number of queries, and each element is a dictionary containing the sorted similarity scores, label matches, indices, etc to make calibration and testing easy, but there are many ways to do so. Here is a notebook that shows how we pre-process the data for calibration in the Pfam study of the paper.

The main question to decide before using this is what consists of a match - are there multiple acceptable descriptions for each protein? Happy to help brainstorm further and help you engage this metric into your models.

@LTEnjoy
Copy link
Author

LTEnjoy commented Oct 8, 2024

Hi Seyone!

Thank you very much for providing such detailed implementation tutorial. Now I guess I have an outline to incorporate this metric in my work. Besides, I still have few questions:

  1. In step 1, you mentioned that I need to split the test set into 2 parts, one for calibration and the other one for test. Suppose I randomly sample 500 proteins for calibration and the remaining 3500 proteins for test. Should I calculate the λs score only on the calibration set and validate it on the test set? Also, you mentioned that I may repeat the sampling multiple times. Does that mean I also need to repeat step 2 and 3 after each sampling and finally average the λs scores obtained in each round of sampling?

  2. There are multiple acceptable descriptions for each protein. For a protein, we retrieve its related text information from UniProt website. There are many types of information attached to a protein, such as "Function", "Protein names", "GO annotations" and etc. You could check this one for reference https://www.uniprot.org/uniprotkb/P05067/entry. So here the text pool is composed of all types of information for 4000 test proteins. Does this setting make it different when computing the λs score?

Thank you again and looking forward to further discussion!

@seyonechithrananda
Copy link
Collaborator

Hi @LTEnjoy,

Sorry for the delay! Thanks for asking such great questions, this is really helpful for us as well to work towards making the tooling more accessible.

  1. Yes, you calculate λs on calibration and validate on test - that split should work! When I refer to sampling multiple times, what I mean is you can run many (10-100) trials where you shuffle the 4000 randomly into different calibrate/test splits and try to see if the risk guarantee holds consistently for different partitions of the dataset. Here, you don't need to repeat 2 and 3, but rather just re-arrange with indices (row-wise for the queries) go into calibration and test.
  2. Here, you can set it up in different ways. You can calibrate the similarity scores for each type of description separately if you want, i.e.: calibration on protein-"function" similarities, then protein-"name", then protein-"Go annotation". You could also set it up such that all the text pool is embedded together too; what matters is you have a way to designate when a retrieved text description (for either) is a "match", so we can compute false negative rate, false discovery rate, etc.
    In the paper we have a slightly similar example where for retrieving enzyme classifications, many proteins have several valid EC assignments. Here, we simply use the number that penalizes the loss on the retrieval set the most - see section Supplementary 2.2.3 here.

Does this help with detailing an outline? It looks like for your work the utils in this repo should work out of the box, you'll just need to pre-process a distance matrix and a match function for protein-text pairs.

@LTEnjoy
Copy link
Author

LTEnjoy commented Oct 12, 2024

Hi Seyone,

Thank you for the explanation! I will refer to your work to implement the metric. Hope we could have further discussion if I run into other questions :)

@seyonechithrananda
Copy link
Collaborator

Yes of course! Let me know how I can be helpful, happy to look over code as well/troubleshoot. :) looking forward to seeing this method in ProTrek!

@seyonechithrananda
Copy link
Collaborator

seyonechithrananda commented Nov 15, 2024

Hi @LTEnjoy,

How's it going with implementing conformal scoring for ProTrek similarity scores? Let me know if we can help at all.

@LTEnjoy
Copy link
Author

LTEnjoy commented Nov 16, 2024

Hi,

Thank you for your continued attention to this issue! I'm sorry we‘ve been busy working on another submission these days. We will keep you informed once we have any progress :)

@seyonechithrananda
Copy link
Collaborator

Hi,

Thank you for your continued attention to this issue! I'm sorry we‘ve been busy working on another submission these days. We will keep you informed once we have any progress :)

No worries - best of luck on the submission!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants