Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

partition_all outputs padding tokens on sequences with inaccurate __len__ #602

Open
Mr0grog opened this issue Feb 4, 2025 · 4 comments
Open

Comments

@Mr0grog
Copy link

Mr0grog commented Feb 4, 2025

This is definitely an unusual edge case, but it really had me scratching my head for a little bit.

I have a long-running process where I use tqdm to display a progress bar. It works through an unknown number of items, but, at the start, I can estimate the total number, even though it might be slightly off. So I set tqdm’s total argument to the estimated total. This causes tqdm to set its iterator’s __len__ to the estimated total. Then partition_all uses __len__ as a shortcut to remove the padding tokens it adds to the batches. Since __len__ isn’t accurate, a bunch of padding tokens get left in the resulting output.

Here’s a really simplified example:

def make_n(n):
    for i in range(n):
        yield i

# This has no __len__
five = make_n(5)

# This has an incorrect __len__ of 7.
progress = tqdm(five, total=7)

for batch in toolz.partition_all(10, progress):
    print(batch)

# Prints: (0, 1, 2, 3, 4, '__no__pad__', '__no__pad__')

Obviously those '__no__pad__' strings shouldn’t be there.

Here’s the shortcut code that causes the problem by relying on __len__:

toolz/toolz/itertoolz.py

Lines 731 to 736 in 08f2604

if prev[-1] is no_pad:
try:
# If seq defines __len__, then
# we can quickly calculate where no_pad starts
yield prev[:len(seq) % n]
except TypeError:

Again, this is obviously a weird case (I can’t think of many other situations where I’d be dealing with an inaccurate __len__), but I’m wondering if it could be solved for by checking to make sure prev[len(seq) % n - 1] is not no_pad. If that check fails, then you could fall back to the binary search that’s used when __len__ is not available.

@eriknw
Copy link
Member

eriknw commented Feb 4, 2025

Thanks for the report!

I think it's reasonable to expect __len__ to be accurate. For approximate length, see PEP 424: https://peps.python.org/pep-0424/

Perhaps you would be better served to try using __length_hint__. I don't know if tqdm uses it directly (it probably should if not), but you can also pass the expected total number of elements when using tqdm.

Please let us know if this approach works for you

@Mr0grog
Copy link
Author

Mr0grog commented Feb 4, 2025

you can also pass the expected total number of elements when using tqdm.

Is there a way to do that that's different than what I’m already doing? (In the example, I’m never setting __len__ directly, just using tqdm's total.) I couldn't find anything in their docs or that jumped out at me in the source. Would love to know if there's a more correct way here that I've always been missing.

For approximate length, see PEP 424: https://peps.python.org/pep-0424/

Yeah, unfortunately it seems like tqdm uses that to set its __len__, which may just be bad behavior on their part: https://github.com/tqdm/tqdm/blob/0ed5d7f18fa3153834cbac0aa57e8092b217cc16/tqdm/std.py#L1117

@eriknw
Copy link
Member

eriknw commented Feb 4, 2025

Oh I see. Nice digging! Yeah, that seems like poor behavior from tqdm.

@Mr0grog
Copy link
Author

Mr0grog commented Feb 4, 2025

OK, I’ll see about filing an issue over there, though I expect “fixing” it might be too much of a breaking change for them. 🤷

Meanwhile, any chance this:

yield prev[:len(seq) % n]

could change to something like:

result = prev[:len(seq) % n]
# Or raise RuntimeError or something else if assert doesn't feel appropriate. ¯\_(ツ)_/¯
assert result[-1] is not no_pad, 'The sequence passed to `partition_all` has an invalid length'
yield result

This at least would have helped me understand what was going on much quicker.

If not, I (or you) can just close this. I totally understand this being too much of an outside-normal-operations case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants