Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Low performance when using demonstration retrieval #15

Open
matthewclso opened this issue Jul 4, 2024 · 2 comments
Open

Low performance when using demonstration retrieval #15

matthewclso opened this issue Jul 4, 2024 · 2 comments

Comments

@matthewclso
Copy link

matthewclso commented Jul 4, 2024

I could not find the demonstration retrieval code in code/data_process.py, so I wrote my own implementation, using sentence_transformers and the all-mpnet-base-v2 model. I followed the method outlined in the paper in which I find the top 1 corresponding utterance in the training set, using same-label pairing for training and all-labels pairing for dev/test.

Without demonstration retrieval, I was able to come close to the paper's result. However, with demonstration retrieval, my F1-scores decreased all the way from ~69 to ~25 for MELD (and similar results for other datasets). I find the model is just outputting the demonstration emotion while training, in order to reduce the loss as much as possible, which leads to extreme overfitting. However, in validation, this almost never works since it uses all-labels pairing.

Do you have the demonstration retrieval code we can look at? It seems it is not effective.

@zehuiwu
Copy link

zehuiwu commented Jul 20, 2024

Hi Matthew! How did you manage to get 69? I used the default settings and tried many seeds. But the average F1 is around 65-66. The best seed gives around 67-68.

I run each seed for 15 epochs instead of 6 to get better scores.

I also tried the auxiliary tasks, but they were not effective as claimed in the paper. For both datasets, the scores are lower than they claimed, especially for meld.

I wonder what settings you use to get this score? Is it just a lucky seed or is the average 69?

Thank you so much!

@matthewclso
Copy link
Author

Hi Matthew! How did you manage to get 69? I used the default settings and tried many seeds. But the average F1 is around 65-66. The best seed gives around 67-68.

I run each seed for 15 epochs instead of 6 to get better scores.

I also tried the auxiliary tasks, but they were not effective as claimed in the paper. For both datasets, the scores are lower than they claimed, especially for meld.

I wonder what settings you use to get this score? Is it just a lucky seed or is the average 69?

Thank you so much!

Hi Zehui! I must've gotten lucky. My runs average around 65-66 as well, so it appears that not only can we not reproduce the results for demonstration retrieval, but we also can't reproduce the other results as well. This is quite concerning

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants