-
Notifications
You must be signed in to change notification settings - Fork 265
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
partition_all
outputs padding tokens on sequences with inaccurate __len__
#602
Comments
Thanks for the report! I think it's reasonable to expect Perhaps you would be better served to try using Please let us know if this approach works for you |
Is there a way to do that that's different than what I’m already doing? (In the example, I’m never setting
Yeah, unfortunately it seems like tqdm uses that to set its |
Oh I see. Nice digging! Yeah, that seems like poor behavior from tqdm. |
OK, I’ll see about filing an issue over there, though I expect “fixing” it might be too much of a breaking change for them. 🤷 Meanwhile, any chance this: yield prev[:len(seq) % n] could change to something like: result = prev[:len(seq) % n]
# Or raise RuntimeError or something else if assert doesn't feel appropriate. ¯\_(ツ)_/¯
assert result[-1] is not no_pad, 'The sequence passed to `partition_all` has an invalid length'
yield result This at least would have helped me understand what was going on much quicker. If not, I (or you) can just close this. I totally understand this being too much of an outside-normal-operations case. |
This is definitely an unusual edge case, but it really had me scratching my head for a little bit.
I have a long-running process where I use tqdm to display a progress bar. It works through an unknown number of items, but, at the start, I can estimate the total number, even though it might be slightly off. So I set tqdm’s
total
argument to the estimated total. This causes tqdm to set its iterator’s__len__
to the estimated total. Thenpartition_all
uses__len__
as a shortcut to remove the padding tokens it adds to the batches. Since__len__
isn’t accurate, a bunch of padding tokens get left in the resulting output.Here’s a really simplified example:
Obviously those
'__no__pad__'
strings shouldn’t be there.Here’s the shortcut code that causes the problem by relying on
__len__
:toolz/toolz/itertoolz.py
Lines 731 to 736 in 08f2604
Again, this is obviously a weird case (I can’t think of many other situations where I’d be dealing with an inaccurate
__len__
), but I’m wondering if it could be solved for by checking to make sureprev[len(seq) % n - 1] is not no_pad
. If that check fails, then you could fall back to the binary search that’s used when__len__
is not available.The text was updated successfully, but these errors were encountered: