group_texts function: Why? #15

HossamAmer12 · 2024-10-20T12:01:17Z

There is a data function called group_texts.
I understand that this function concatenates the texts and creates blocks of text with specific block size.
I wish to understand why you do so? Why not padding to a specific tokenizer max length for a rect tensor?
Could you please explain why you opted for this way?

The text was updated successfully, but these errors were encountered:

urialon · 2024-10-20T17:23:50Z

Hi Hossam, I don't think I implemented this function myself, I think I copied it from Huggingface's language modeling example. If I remember correctly, it's just more efficient than padding, since you can pack more documents in the same batch, and padding is basically a waste of compute. Another thing I think this function does, if I remember correctly, is building the sliding window evaluation. This means that there are overlaps between document, but every token is predicted only once, and serves as context in the next chunk. Best, Uri

…

On Sun, Oct 20, 2024 at 08:01 Hossam Amer ***@***.***> wrote: There is a data function called group_texts. I understand that this function concatenates the texts and creates blocks of text with specific block size. I wish to understand why you do so? Why not padding to a specific tokenizer max length for a rect tensor? Could you please explain why you opted for this way? — Reply to this email directly, view it on GitHub <#15>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADSOXMHZ2TRAGUAMWDSBKDDZ4OLSHAVCNFSM6AAAAABQIOKWCSVHI2DSMVQWIX3LMV43ASLTON2WKOZSGYYDAMZZGM3TQNY> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

group_texts function: Why? #15

group_texts function: Why? #15

HossamAmer12 commented Oct 20, 2024

urialon commented Oct 20, 2024 via email

group_texts function: Why? #15

group_texts function: Why? #15

Comments

HossamAmer12 commented Oct 20, 2024

urialon commented Oct 20, 2024 via email