Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Describing the full end to end pipeline #9

Closed
kondilidisn opened this issue Nov 18, 2019 · 2 comments
Closed

Describing the full end to end pipeline #9

kondilidisn opened this issue Nov 18, 2019 · 2 comments

Comments

@kondilidisn
Copy link

Dear authors. thank you very much for your contribution. I know you have improved the code structure but I am afraid it is still very hard for me to understand some method details.

I thought I should ask here, for anyone else having the same questions.

  1. On Table 2 on the paper, presents the recommender system evaluation. If I understand correctly, you ignore the conversational part wile performing these experiments, so that you can properly compare only the recommender methods.

  2. On Table 3, again you only evaluate the conversational part, ignoring the recommender task. In this case, you calculate the perplexity of the Ground Truth sentences and some of them may include UKN tokens, that might be predicted properly.

  3. I do not understand what is the Dist-N metric. Is it the ratio of distinct N-grams divided by the total number of words produced by the model ? In that case, I would expect it to be greater than one, since the possible distinct N-grams are way more than the distinct 1-gram (distinct one words)

Regarding the big picture of the complete End-to-End model.

  1. You identify Named entities on real time from the conversation or do you have a dictionary with all mentioned named entities mentioned at each utterance (Similarly to the ReDial authors) ?

  2. Do you perform sentiment analysis and use it on your recommending module, or do you ignore the sentiment regarding the entities and only use them as an ordered "bag of words"?

  3. If you perform sentiment analysis during the time of conversation, you only give the utterances that have been sent up to that time ?

  4. You use the same Switching technique for joining the Conversational output space with the Recommending output space, like the ReDial authors. Does any of your results (maybe Table 3) present joint evaluation (recommending and NLG tasks)? If so, when you evaluate the token of some mentioned movie, do you evaluate if the specific movie was predicted, or do you simply evaluate if any movie was predicted, and use that as a correct NLG evaluation?

  5. Figure 2, evaluates the recommending performance of the full End-to-End model or only the performance of the recommending method? If it is about the full End-to-End model, does the predicted recommended item needs to be on the same token position with the Ground Truth
    one, or just mentioned anywhere on the generated response?

I hope my question will not be a lot of trouble, and will help more of us to better understand your work.
Thank you in advance for your time!

Best Regards,
Nikos.

@qibinc
Copy link
Collaborator

qibinc commented Nov 21, 2019

Dear @kondilidisn ,

Thanks for being interested in this work! I apologize that we did not make these points you mentioned clear in the paper.

For Q1, Q2, Q7 and Q8:

First, although the task is named conversational recommendation (following REDIAL authors), it really comes down to two separate parts when using existing automatic evaluation metrics. Based on this, we evaluate the two parts separately as in Table 2 and Table 3 (as indicated in the first sentence of table captions), and leaving devising new evaluation metrics for joint performance for future work.

Second, it is important to note that the proposed recommender system does consider the conversation part by utilizing entities in dialog contents, although it ignores the dialog model in this work. It's also worth mentioning that the entity linking module should be viewed as part of the dialog system (as shown in Figure 1), which enables the possibility of adopting many knowledge aware dialog models.

In contrast, the conversational model depends on the representation provided by the recommender, which is why the recommender can and must be trained first.

Now let's regard these four questions:
Q1: That's right. The dialog model is not used during the evaluation. However, the entities linked from previous utterances and the knowledge graph is used, which both benefit the recommendation performance.
Q2: Yes. We masked the movies to UNK so the results truly represent the conversation quality. This is also true for other metrics, BLEU, etc. We also did this for the ReDial baseline.
Q7: No. Table 2 and 3 shows separate evaluations of the recommendation and conversation, which show the two systems can enhance each other.
Q8: As said earlier, it is the recommendation performance, which guarantees the mentioned position doesn't matter.

Q3

Sorry for missing this info in the paper... It is calculated as distinct n grams produced by the model in test, divided by number of sentences produced in test, which roughly captures how many novel n-grams are there in a sentence, which can be smaller than 1 if the test set is large, etc.

Q4

We tried identifying entities on the fly. However, it had high latency (perhaps because it is web-based) and became the bottleneck of the training process, which is why we cached and saved {utterance: entities_list} like you mentioned. In this way, the identification on the same utterance will not be executed over and over again as the training epoch increases. However, the latency is unnoticeable to humans so it is able to run on realtime in interactive mode.

Q5, Q6

We did not perform sentiment analysis. On the one hand, our main objective in this work is to provide a general framework in which recommendation and conversation truly involves and improves the other. Deciding whether to use sentiment analysis, and how to use it properly both should be delegated to the recommender system and the dialog system, based on whether sentiment analysis will improve their performance. On the other hand, as ReDial treats sentiment analysis as an auxiliary task, and does not shown its contribution to the two main tasks, we believe whether and how to add sentiment analysis to improve the whole system is still an open question and an interesting topic to follow.

Best,
Qibin

@kondilidisn
Copy link
Author

Dear @qibinc,

thank you very much for your thorough analysis, it was very helpful.

I will close this Issue, as all my questions have been answered and I am leaving up to you to decide whether or not you want to display these questions and answers in any way.

Thank you again for your contribution and for the time you took to explain me some details.

Best,
Nikos.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants