Low performance when using demonstration retrieval #15

matthewclso · 2024-07-04T01:07:19Z

I could not find the demonstration retrieval code in code/data_process.py, so I wrote my own implementation, using sentence_transformers and the all-mpnet-base-v2 model. I followed the method outlined in the paper in which I find the top 1 corresponding utterance in the training set, using same-label pairing for training and all-labels pairing for dev/test.

Without demonstration retrieval, I was able to come close to the paper's result. However, with demonstration retrieval, my F1-scores decreased all the way from ~69 to ~25 for MELD (and similar results for other datasets). I find the model is just outputting the demonstration emotion while training, in order to reduce the loss as much as possible, which leads to extreme overfitting. However, in validation, this almost never works since it uses all-labels pairing.

Do you have the demonstration retrieval code we can look at? It seems it is not effective.

The text was updated successfully, but these errors were encountered:

zehuiwu · 2024-07-20T17:51:34Z

Hi Matthew! How did you manage to get 69? I used the default settings and tried many seeds. But the average F1 is around 65-66. The best seed gives around 67-68.

I run each seed for 15 epochs instead of 6 to get better scores.

I also tried the auxiliary tasks, but they were not effective as claimed in the paper. For both datasets, the scores are lower than they claimed, especially for meld.

I wonder what settings you use to get this score? Is it just a lucky seed or is the average 69?

Thank you so much!

matthewclso · 2024-07-20T18:06:21Z

Hi Matthew! How did you manage to get 69? I used the default settings and tried many seeds. But the average F1 is around 65-66. The best seed gives around 67-68.

I run each seed for 15 epochs instead of 6 to get better scores.

I also tried the auxiliary tasks, but they were not effective as claimed in the paper. For both datasets, the scores are lower than they claimed, especially for meld.

I wonder what settings you use to get this score? Is it just a lucky seed or is the average 69?

Thank you so much!

Hi Zehui! I must've gotten lucky. My runs average around 65-66 as well, so it appears that not only can we not reproduce the results for demonstration retrieval, but we also can't reproduce the other results as well. This is quite concerning

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Low performance when using demonstration retrieval #15

Low performance when using demonstration retrieval #15

matthewclso commented Jul 4, 2024 •

edited

Loading

zehuiwu commented Jul 20, 2024 •

edited

Loading

matthewclso commented Jul 20, 2024

Low performance when using demonstration retrieval #15

Low performance when using demonstration retrieval #15

Comments

matthewclso commented Jul 4, 2024 • edited Loading

zehuiwu commented Jul 20, 2024 • edited Loading

matthewclso commented Jul 20, 2024

matthewclso commented Jul 4, 2024 •

edited

Loading

zehuiwu commented Jul 20, 2024 •

edited

Loading