-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How can I implement it for protein-text retrieval? #1
Comments
Hi Jin! Sorry for the delay, glad to hear you liked our work! I enjoyed reading ProTrek and i think this methodology would be a really complimentary fit for producing rigorous evals + guarantees. Based on what you described, the regime seem's complimentary to the style of risk control we employ on the paper, except here the query
In our code, we store lists where the length is the number of queries, and each element is a dictionary containing the sorted similarity scores, label matches, indices, etc to make calibration and testing easy, but there are many ways to do so. Here is a notebook that shows how we pre-process the data for calibration in the Pfam study of the paper. The main question to decide before using this is what consists of a match - are there multiple acceptable descriptions for each protein? Happy to help brainstorm further and help you engage this metric into your models. |
Hi Seyone! Thank you very much for providing such detailed implementation tutorial. Now I guess I have an outline to incorporate this metric in my work. Besides, I still have few questions:
Thank you again and looking forward to further discussion! |
Hi @LTEnjoy, Sorry for the delay! Thanks for asking such great questions, this is really helpful for us as well to work towards making the tooling more accessible.
Does this help with detailing an outline? It looks like for your work the utils in this repo should work out of the box, you'll just need to pre-process a distance matrix and a match function for protein-text pairs. |
Hi Seyone, Thank you for the explanation! I will refer to your work to implement the metric. Hope we could have further discussion if I run into other questions :) |
Yes of course! Let me know how I can be helpful, happy to look over code as well/troubleshoot. :) looking forward to seeing this method in ProTrek! |
Hi @LTEnjoy, How's it going with implementing conformal scoring for ProTrek similarity scores? Let me know if we can help at all. |
Hi, Thank you for your continued attention to this issue! I'm sorry we‘ve been busy working on another submission these days. We will keep you informed once we have any progress :) |
No worries - best of luck on the submission! |
Dear authors,
Thank you for proposing this cool methodology! Currently I'm working on a cross-modality retrieval system named ProTrek and I'm very interested in engaging this metric into our evaluation system. I'm not familiar with conformal risk control so I would appreciate it if you could give me some suggestions to implement this pipeline.
Given a protein
x
, ProTrek will calculate similarity scores with all natural language descriptions (denoted asy1, y2, ..., yn
) in the database and return scoress1, s2, ..., sn
. I want to assign a probability to each score to represent how likely it is to be correct. I have a test set containing 4,000 proteins that were not seen during the training of ProTrek, which I think could use to evaluate and generalize to proteins input by users. Could you give me a pipeline for how can I implement this?Thank you in advance and I sincerely look forward to your reply!
Best regards,
Jin
The text was updated successfully, but these errors were encountered: