You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There is a data function called group_texts.
I understand that this function concatenates the texts and creates blocks of text with specific block size.
I wish to understand why you do so? Why not padding to a specific tokenizer max length for a rect tensor?
Could you please explain why you opted for this way?
The text was updated successfully, but these errors were encountered:
Hi Hossam,
I don't think I implemented this function myself, I think I copied it from
Huggingface's language modeling example.
If I remember correctly, it's just more efficient than padding, since you
can pack more documents in the same batch, and padding is basically a waste
of compute.
Another thing I think this function does, if I remember correctly, is
building the sliding window evaluation.
This means that there are overlaps between document, but every token is
predicted only once, and serves as context in the next chunk.
Best,
Uri
On Sun, Oct 20, 2024 at 08:01 Hossam Amer ***@***.***> wrote:
There is a data function called group_texts.
I understand that this function concatenates the texts and creates blocks
of text with specific block size.
I wish to understand why you do so? Why not padding to a specific
tokenizer max length for a rect tensor?
Could you please explain why you opted for this way?
—
Reply to this email directly, view it on GitHub
<#15>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADSOXMHZ2TRAGUAMWDSBKDDZ4OLSHAVCNFSM6AAAAABQIOKWCSVHI2DSMVQWIX3LMV43ASLTON2WKOZSGYYDAMZZGM3TQNY>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
There is a data function called
group_texts
.I understand that this function concatenates the texts and creates blocks of text with specific block size.
I wish to understand why you do so? Why not padding to a specific tokenizer max length for a rect tensor?
Could you please explain why you opted for this way?
The text was updated successfully, but these errors were encountered: