You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I could not find the demonstration retrieval code in code/data_process.py, so I wrote my own implementation, using sentence_transformers and the all-mpnet-base-v2 model. I followed the method outlined in the paper in which I find the top 1 corresponding utterance in the training set, using same-label pairing for training and all-labels pairing for dev/test.
Without demonstration retrieval, I was able to come close to the paper's result. However, with demonstration retrieval, my F1-scores decreased all the way from ~69 to ~25 for MELD (and similar results for other datasets). I find the model is just outputting the demonstration emotion while training, in order to reduce the loss as much as possible, which leads to extreme overfitting. However, in validation, this almost never works since it uses all-labels pairing.
Do you have the demonstration retrieval code we can look at? It seems it is not effective.
The text was updated successfully, but these errors were encountered:
Hi Matthew! How did you manage to get 69? I used the default settings and tried many seeds. But the average F1 is around 65-66. The best seed gives around 67-68.
I run each seed for 15 epochs instead of 6 to get better scores.
I also tried the auxiliary tasks, but they were not effective as claimed in the paper. For both datasets, the scores are lower than they claimed, especially for meld.
I wonder what settings you use to get this score? Is it just a lucky seed or is the average 69?
Hi Matthew! How did you manage to get 69? I used the default settings and tried many seeds. But the average F1 is around 65-66. The best seed gives around 67-68.
I run each seed for 15 epochs instead of 6 to get better scores.
I also tried the auxiliary tasks, but they were not effective as claimed in the paper. For both datasets, the scores are lower than they claimed, especially for meld.
I wonder what settings you use to get this score? Is it just a lucky seed or is the average 69?
Thank you so much!
Hi Zehui! I must've gotten lucky. My runs average around 65-66 as well, so it appears that not only can we not reproduce the results for demonstration retrieval, but we also can't reproduce the other results as well. This is quite concerning
I could not find the demonstration retrieval code in
code/data_process.py
, so I wrote my own implementation, usingsentence_transformers
and theall-mpnet-base-v2
model. I followed the method outlined in the paper in which I find the top 1 corresponding utterance in the training set, using same-label pairing for training and all-labels pairing for dev/test.Without demonstration retrieval, I was able to come close to the paper's result. However, with demonstration retrieval, my F1-scores decreased all the way from ~69 to ~25 for MELD (and similar results for other datasets). I find the model is just outputting the demonstration emotion while training, in order to reduce the loss as much as possible, which leads to extreme overfitting. However, in validation, this almost never works since it uses all-labels pairing.
Do you have the demonstration retrieval code we can look at? It seems it is not effective.
The text was updated successfully, but these errors were encountered: