Skip to content
This repository has been archived by the owner on Dec 19, 2023. It is now read-only.

error in results due to negative sampling data leakage #28

Open
hshahbazi opened this issue Jul 26, 2020 · 13 comments
Open

error in results due to negative sampling data leakage #28

hshahbazi opened this issue Jul 26, 2020 · 13 comments

Comments

@hshahbazi
Copy link

hshahbazi commented Jul 26, 2020

Hi,
I've noticed an issue in this work as follows:  In module create_batch.py

the negative samples are not allowed to be taken from test data. When valid_triples_dict is defined as:
self.valid_triples_dict = {j: i for i, j in enumerate(
            self.train_triples + self.validation_triples + self.test_triples)}

and later used in:

while (random_entities[current_index], self.batch_indices[last_index + current_index, 1],
                               self.batch_indices[last_index + current_index, 2]) in self.valid_triples_dict.keys():

The model is taught during training that the test data are valid triples. We are not allowed to use test data during training. This will make the performance goes higher and higher with more epochs without any good reasons.  

@378egg
Copy link

378egg commented Sep 10, 2020

I noticed also, so the paper is an error?

@hshahbazi
Copy link
Author

Yes.

@378egg
Copy link

378egg commented Sep 13, 2020

Sorry, have you try more epochs? In experiments.

@hshahbazi
Copy link
Author

I experimented more epochs with other models such as tucker to see that the learning approach is not right.

@wbaz4321
Copy link

I experimented more epochs with other models such as tucker to see that the learning approach is not right.

sorry, I notice this too and I found another paper 'A Re-evaluation of Knowledge Graph Completion Methods' also mentions this issue. But I suppose the model structure, i.e. the attention mechanism and the encoder-decoder structure is still valid, the problem is in the training and evaluation protocol right? Because I found the idea in this paper interesting and plan to modify and apply it for my work.

@hshahbazi
Copy link
Author

Not sure what you mean by valid. Experimentally, the model gives top10=33% for fb15k-237 which is very low for the task. So experimentally, we can not reach to any validity conclusion for any parts.

@wbaz4321
Copy link

Not sure what you mean by valid. Experimentally, the model gives top10=33% for fb15k-237 which is very low for the task. So experimentally, we can not reach to any validity conclusion for any parts.

Many thanks for you reply, could you please tell me how you run the experiment? in this code:

while (random_entities[current_index], self.batch_indices[last_index + current_index, 1],
self.batch_indices[last_index + current_index, 2]) in self.valid_triples_dict.keys():

do you change the 'valid_triples_dict' so that it only includes training data?

@chauhanjatin10
Copy link
Collaborator

chauhanjatin10 commented Sep 25, 2020

Hi @wbaz4321 and @hshahbazi ,
Thanks for showing interest in the work. As you guys have pointed out the issues regarding leakage in the negative sampling, we have been doing experiments to rectify this. After extensive experiments we have empirically concluded that the issue was the use of ConvKB as decoder. When we replace ConvKB with ConvE, the Hits@10 on Freebase reach upto 53% with correct training protocol as proposed in "A Re-evaluation of Knowledge Graph Completion Methods" using the ConvE code provided here - https://github.com/svjan5/kg-reeval . Thus, we have evidence supporting the claims of our encoder model which will be updated soon by my co-author @deepakn97 . Also, as shown in the works - https://arxiv.org/pdf/2009.08694.pdf and https://www.aclweb.org/anthology/2020.acl-main.526.pdf , KBGAT's proposed encoder model indeed works well provided that the decoder is strong enough. I hope this clarifies some of your doubts @hshahbazi and @wbaz4321 .
Thanks and regards
Jatin

@jinjinjin9
Copy link

jinjinjin9 commented Sep 25, 2020

So, just to be sure, as we can conclude by now (can we?):

  1. with test leakage the full system provides very good scores
  2. without test leakage the full system provides much worse scores (by removing test triples from validation triples)

I don't see how this should be "rectified" by not using ConvKB as the decoder? Is there a bug in ConvKB then? Can you describe your experiments that "rectify the results" more closely? I mean, the two points above just suggest that the bug was in fact in the training and caused by the data leakage.

I am sorry for the inconvenience but I am working with your code and as of now I am very confused.

@deepakn97
Copy link
Owner

As you can see in this paper (https://www.aclweb.org/anthology/2020.acl-main.489.pdf), we are dealing with two different issues here.

  1. Test data leakage
  2. Wrong evaluation strategy for the decoder part.

Above mentioned paper removes the test leakage and uses Random evaluation strategy to get the updated results for our paper. However, they don't experiment with the hyperparameters. After re-evaluating our model and using updated hyperparameters we were able to get ~44 Hits @ 10 using ConvKB (These results are after removing test data leakage and using correct evaluation protocol).
ConvKB provides significantly less results using the correct evaluation strategy i.e only 42.1 Hits @ 10 compared to reported 51.7 Hits @ 10.

The paper also mentions that ConvE does not suffer from this problem and the results of ConvE is independent of the evaluation protocol. More specifically, many of the neurons in ConvKB become zeros after applying ReLU activation function as a result of which ConvKB predicts same scores for numerous triples which causes the stark difference in the results using different evaluation protocol. Following this reasoning we decided to experiment with ConvE instead of using ConvKB.

I hope this is helpful to you.

@jinjinjin9
Copy link

jinjinjin9 commented Sep 25, 2020

Thanks for the quick response!

But I wonder.... if it's now 53 with the random evaluation strategy, after you freshly tuned hyper-parameters... it would still fall behind TuckeR (53.6) and would be same as RotatE (53.0), in contrast to what's reported in the original result table displayed in https://www.aclweb.org/anthology/2020.acl-main.489.pdf.

Can you also report new results on the original task formulation (no randomized evaluation strategy), where the paper reported 62.2 hits?

@deepakn97
Copy link
Owner

The reported results (with ConvE as decoder) are with randomized evaluation strategy and they are independent of the evaluation strategy. The task formulation is still the same but the evaluation strategy used was wrong. So we don't think you should compare the results with wrong evaluation strategy.
We also understand your concern that in the current scenario this is not a big leap in the results. However, we still feel that the contribution of the paper was the Attention model used to learn graph embeddings. As the research progresses, we can use different decoders in continuation with KBGAT.
We will try to provide a way to use different decoders (TuckeR, RotatE etc) with our attention model. As of now, we would appreciate if you could save the embeddings generated from the Attention model and use them to train the Decoder models.

@wbaz4321
Copy link

Thanks for your efforts to adjust the model, I also personally think the encoding process is useful and I am currently trying to modify it for my work. Thanks very much for providing a good reference for me. I have another question about the classes SepcialSpmmFinal() and SpecialSpmmFinalFunctionalFinal() could you please tell me what do these function do? In the paper, once we have obtained the exp values of the absolute attention values, we can sum them up the get the value of e_row_sum
so why do we fed the values into the special function? Any suggestions are very appreciated

Also, I found that some one else mentions this issue as well, so if you can help us understand this, we will be very grateful.

@chauhanjatin10 chauhanjatin10 changed the title test data is used during training --> results are not valid error in results due to negative sampling data leakage Mar 20, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants