-
Notifications
You must be signed in to change notification settings - Fork 6.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Difficulties to reproduce XSUM results with BART #1971
Comments
I got raw text from one of Xsum authors. if you can get one from them, you should get a better number. I am not very sure about how to revert their released data (with tokenization) to the raw text. |
Hi @colanim can you all do me a favor. after this line |
@yinhanliu thanks for the details. After modifying as you mention, my score is 1 point higher :
It's great ! But it's still 1 point from the paper's results. I asked for the raw XSUM dataset, I will update this issue when I receive it (author didn't respond yet). In the meantime, any idea on where this 1 point difference might come from ? |
@colanim https://github.com/pytorch/fairseq/blob/d37529ed234ea9173ed35f6797a51a85378ecfca/fairseq/tasks/fairseq_task.py#L350 and at Let me know how it goes. |
@yinhanliu thanks for your help. Here is my results after applying the changes you mentioned :
|
Thank you! |
@yinhanliu According to the author of XSUM, the provided link is the same as the dataset you used. I followed the same train/val/test distribution as the one provided by the author. I didn't apply any additional processing. I applied BPE encoding + binarization with exact same parameters as for CNN/DM. |
Thanks so much for letting me know. I will work on this shortly. The released model is supposed to work better than the one in the paper actually. |
@colanim I figured. in the original paper, we added BOS in to each src and tgt during fine-tune. but we didn't do so when we open-sourced code. set prepend-bos in to True in translation task can enhance the result when you fine-tune. |
@yinhanliu Thanks for your answer. I see. But I didn't finetune BART myself, I used your already fine-tuned checkpoint on XSUM. So I'm doing only evaluation. Should I modify any parameters in the evaluation script ? |
I tuned it incorrectly (I didn't add bos when I fine-tune it) @ngoyal2707 can you double check? I think current code doesn't have the option for prepend bos |
@yinhanliu thanks for the fast answer ! Do you plan to release a fixed checkpoint ? |
Hi, I tried to add above code to sequence_generator.py and it seems to give me an error with self.bos does not exist. Do I have to add this manually? |
hi @colanim , when you say without any preprocessing, do you mean even without lowercasing the text? I am also try to reproduce results for XSum using the uploaded checkpoint but my ROUGE scores are a lower than yours (ROUGE-1 41, ROUGE-2 17, ROUGE-L 32). |
Yes, I didn't apply any other processing, just the raw datasets and the checkpoint given by author. |
@colanim Thanks, I managed to get similar results using the raw dataset. I am also wondering how did you use the following piece of code (provided by author above) to further boost ROUGE? if step == 0: It seems that self.bos is not defined in the code. |
@zsquaredz I can't access my code right now, but I think you can't access
Can you try this ? |
@colanim Got it, thanks for the suggestion. |
Any update on this ? |
hi, can you specify the place which should be revised so that I can try to fix it? |
Hi @yinhanliu , I'm trying to reproduce the results too. I tried this code, and it indeed improved the rouge scores, but I'm confused why it works. I'm using fairseq v0.10.2. Here is what I've tried:
However, this is very confusing, because BART has already force the prefix token to be BOS here. I don't understand why setting the score as 1000 can make any difference. My observation is that, with Could you help me on why it works? |
This issue has been automatically marked as stale. If this issue is still affecting you, please leave any comment (for example, "bump"), and we'll keep it open. We are sorry that we haven't been able to prioritize it yet. If you have any new additional information, please include it with your comment! |
so, how can we reproduce the result in BART? I still confused >.< |
Hey man, thank you for your issue. |
This issue has been automatically marked as stale. If this issue is still affecting you, please leave any comment (for example, "bump"), and we'll keep it open. We are sorry that we haven't been able to prioritize it yet. If you have any new additional information, please include it with your comment! |
Closing this issue after a prolonged period of inactivity. If this issue is still present in the latest release, please create a new issue with up-to-date information. Thank you! |
I'm trying to reproduce the results of BART on XSUM dataset.
I followed README, didn't apply any preprocessing to the XSUM data and use
beam=6, lenpen=1.0, max_len_b=60, min_len=10
for generation.I got following results :
which is a bit lower than reported results :

For the CNN/DM dataset, there was a few details to add in the data preprocessing step, I'm wondering if I missed these details for XSUM dataset.
Adding the missing preprocessing steps lead to score improvments, so I think it's the same issue for XSUM dataset. Does someone know where I can find a detailed explanation on how to preprocess XSUM dataset ?
@ngoyal2707 @yinhanliu
The text was updated successfully, but these errors were encountered: