`partition_all` outputs padding tokens on sequences with inaccurate `len` #602

Mr0grog · 2025-02-04T08:50:02Z

This is definitely an unusual edge case, but it really had me scratching my head for a little bit.

I have a long-running process where I use tqdm to display a progress bar. It works through an unknown number of items, but, at the start, I can estimate the total number, even though it might be slightly off. So I set tqdm’s total argument to the estimated total. This causes tqdm to set its iterator’s __len__ to the estimated total. Then partition_all uses __len__ as a shortcut to remove the padding tokens it adds to the batches. Since __len__ isn’t accurate, a bunch of padding tokens get left in the resulting output.

Here’s a really simplified example:

def make_n(n):
    for i in range(n):
        yield i

# This has no __len__
five = make_n(5)

# This has an incorrect __len__ of 7.
progress = tqdm(five, total=7)

for batch in toolz.partition_all(10, progress):
    print(batch)

# Prints: (0, 1, 2, 3, 4, '__no__pad__', '__no__pad__')

Obviously those '__no__pad__' strings shouldn’t be there.

Here’s the shortcut code that causes the problem by relying on __len__:

toolz/toolz/itertoolz.py

Lines 731 to 736 in 08f2604

    
           if prev[-1] is no_pad: 
        
               try: 
        
                   # If seq defines __len__, then 
        
                   # we can quickly calculate where no_pad starts 
        
                   yield prev[:len(seq) % n] 
        
               except TypeError:

Again, this is obviously a weird case (I can’t think of many other situations where I’d be dealing with an inaccurate __len__), but I’m wondering if it could be solved for by checking to make sure prev[len(seq) % n - 1] is not no_pad. If that check fails, then you could fall back to the binary search that’s used when __len__ is not available.

The text was updated successfully, but these errors were encountered:

eriknw · 2025-02-04T10:38:56Z

Thanks for the report!

I think it's reasonable to expect __len__ to be accurate. For approximate length, see PEP 424: https://peps.python.org/pep-0424/

Perhaps you would be better served to try using __length_hint__. I don't know if tqdm uses it directly (it probably should if not), but you can also pass the expected total number of elements when using tqdm.

Please let us know if this approach works for you

Mr0grog · 2025-02-04T14:44:56Z

you can also pass the expected total number of elements when using tqdm.

Is there a way to do that that's different than what I’m already doing? (In the example, I’m never setting __len__ directly, just using tqdm's total.) I couldn't find anything in their docs or that jumped out at me in the source. Would love to know if there's a more correct way here that I've always been missing.

For approximate length, see PEP 424: https://peps.python.org/pep-0424/

Yeah, unfortunately it seems like tqdm uses that to set its __len__, which may just be bad behavior on their part: https://github.com/tqdm/tqdm/blob/0ed5d7f18fa3153834cbac0aa57e8092b217cc16/tqdm/std.py#L1117

eriknw · 2025-02-04T16:10:35Z

Oh I see. Nice digging! Yeah, that seems like poor behavior from tqdm.

Mr0grog · 2025-02-04T17:22:13Z

OK, I’ll see about filing an issue over there, though I expect “fixing” it might be too much of a breaking change for them. 🤷

Meanwhile, any chance this:

yield prev[:len(seq) % n]

could change to something like:

result = prev[:len(seq) % n]
# Or raise RuntimeError or something else if assert doesn't feel appropriate. ¯\_(ツ)_/¯
assert result[-1] is not no_pad, 'The sequence passed to `partition_all` has an invalid length'
yield result

This at least would have helped me understand what was going on much quicker.

If not, I (or you) can just close this. I totally understand this being too much of an outside-normal-operations case.

Mr0grog mentioned this issue Feb 4, 2025

Setting __len__ from __length_hint__ or an *estimated* total can cause downstream problems tqdm/tqdm#1652

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`partition_all` outputs padding tokens on sequences with inaccurate `len` #602

`partition_all` outputs padding tokens on sequences with inaccurate `len` #602

Mr0grog commented Feb 4, 2025

eriknw commented Feb 4, 2025

Mr0grog commented Feb 4, 2025 •

edited

Loading

eriknw commented Feb 4, 2025

Mr0grog commented Feb 4, 2025

partition_all outputs padding tokens on sequences with inaccurate __len__ #602

partition_all outputs padding tokens on sequences with inaccurate __len__ #602

Comments

Mr0grog commented Feb 4, 2025

eriknw commented Feb 4, 2025

Mr0grog commented Feb 4, 2025 • edited Loading

eriknw commented Feb 4, 2025

Mr0grog commented Feb 4, 2025

`partition_all` outputs padding tokens on sequences with inaccurate `len` #602

`partition_all` outputs padding tokens on sequences with inaccurate `len` #602

Mr0grog commented Feb 4, 2025 •

edited

Loading