AttributeError: 'IntegerArray' object has no attribute 'tobytes' #406

JamesCropcho · 2019-02-24T17:40:29Z

Hello!

I am hoping you might know what is going on.

Using edge versions of Dask and FastParquet on Python 3.7.2 I execute:

df.to_parquet(
    'us_president_tax_returns.parquet', 
    engine='fastparquet', 
    append=False)

Which raises:

AttributeError                            Traceback (most recent call last)
<timed eval> in <module>

~/dask/dask/dataframe/core.py in to_parquet(self, path, *args, **kwargs)
   2828         """ See dd.to_parquet docstring for more information """
   2829         from .io import to_parquet
-> 2830         return to_parquet(self, path, *args, **kwargs)
   2831 
   2832     @derived_from(pd.DataFrame)

~/dask/dask/dataframe/io/parquet.py in to_parquet(df, path, engine, compression, write_index, append, ignore_divisions, partition_on, storage_options, compute, **kwargs)
   1232 
   1233     if compute:
-> 1234         out.compute()
   1235         return None
   1236     return out

~/dask/dask/base.py in compute(self, **kwargs)
    154         dask.base.compute
    155         """
--> 156         (result,) = compute(self, traverse=False, **kwargs)
    157         return result
    158 

~/dask/dask/base.py in compute(*args, **kwargs)
    396     keys = [x.__dask_keys__() for x in collections]
    397     postcomputes = [x.__dask_postcompute__() for x in collections]
--> 398     results = schedule(dsk, keys, **kwargs)
    399     return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])
    400 

~/dask/dask/threaded.py in get(dsk, result, cache, num_workers, pool, **kwargs)
     74     results = get_async(pool.apply_async, len(pool._pool), dsk, result,
     75                         cache=cache, get_id=_thread_get_id,
---> 76                         pack_exception=pack_exception, **kwargs)
     77 
     78     # Cleanup pools associated to dead threads

~/dask/dask/local.py in get_async(apply_async, num_workers, dsk, result, cache, get_id, rerun_exceptions_locally, pack_exception, raise_exception, callbacks, dumps, loads, **kwargs)
    458                         _execute_task(task, data)  # Re-execute locally
    459                     else:
--> 460                         raise_exception(exc, tb)
    461                 res, worker_id = loads(res_info)
    462                 state['cache'][key] = res

~/dask/dask/compatibility.py in reraise(exc, tb)
    110         if exc.__traceback__ is not tb:
    111             raise exc.with_traceback(tb)
--> 112         raise exc
    113 
    114     import pickle as cPickle

~/dask/dask/local.py in execute_task(key, task_info, dumps, loads, get_id, pack_exception)
    228     try:
    229         task, data = loads(task_info)
--> 230         result = _execute_task(task, data)
    231         id = get_id()
    232         result = dumps((result, id))

~/dask/dask/core.py in _execute_task(arg, cache, dsk)
    117         func, args = arg[0], arg[1:]
    118         args2 = [_execute_task(a, cache) for a in args]
--> 119         return func(*args2)
    120     elif not ishashable(arg):
    121         return arg

~/dask/dask/dataframe/io/parquet.py in _write_partition_fastparquet(df, fs, path, filename, fmd, compression, partition_on)
    546         with fs.open(fs.sep.join([path, filename]), 'wb') as fil:
    547             rgs = make_part_file(fil, df, fmd.schema, compression=compression,
--> 548                                  fmd=fmd)
    549     return rgs
    550 

~/.pyenv/versions/3.7.2/lib/python3.7/site-packages/fastparquet/writer.py in make_part_file(f, data, schema, compression, fmd)
    629     with f as f:
    630         f.write(MARKER)
--> 631         rg = make_row_group(f, data, schema, compression=compression)
    632         if fmd is None:
    633             fmd = parquet_thrift.FileMetaData(num_rows=len(data),

~/.pyenv/versions/3.7.2/lib/python3.7/site-packages/fastparquet/writer.py in make_row_group(f, data, schema, compression)
    617                 comp = compression
    618             chunk = write_column(f, data[column.name], column,
--> 619                                  compression=comp)
    620             rg.columns.append(chunk)
    621     rg.total_byte_size = sum([c.meta_data.total_uncompressed_size for c in

~/.pyenv/versions/3.7.2/lib/python3.7/site-packages/fastparquet/writer.py in write_column(f, data, selement, compression)
    511 
    512     bdata = definition_data + repetition_data + encode[encoding](
--> 513             data, selement)
    514     bdata += 8 * b'\x00'
    515     try:

~/.pyenv/versions/3.7.2/lib/python3.7/site-packages/fastparquet/writer.py in encode_plain(data, se)
    254         return pack_byte_array(list(out))
    255     else:
--> 256         return out.tobytes()
    257 
    258 

AttributeError: 'IntegerArray' object has no attribute 'tobytes'

While it's not doable for me to submit a full df.dtypes.values output, I can give you partial output:

array([dtype('uint32'),
       CategoricalDtype(categories=['REDACTED'], ordered=False),
       UInt16Dtype(),
       CategoricalDtype(categories=['REDACTED'], ordered=False),
       dtype('O'),
       dtype('O'), datetime64[ns, UTC], dtype('O'), dtype('uint16'),
       dtype('O'),
       CategoricalDtype(categories=['__UNKNOWN_CATEGORIES__'], ordered=False),
       dtype('bool'),
       dtype('bool'), UInt8Dtype(), datetime64[ns, UTC],
       datetime64[ns, UTC], datetime64[ns, UTC], datetime64[ns, UTC],
       UInt8Dtype(),
       UInt32Dtype(), UInt16Dtype(), UInt32Dtype(), dtype('O'),
       dtype('O'), dtype('O'), dtype('O'), dtype('O'),
       dtype('O'), dtype('O')], dtype=object)

Note my use of the new Nullable Integer data type (https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html#integer-na).

What do you think?

―James

The text was updated successfully, but these errors were encountered:

martindurant · 2019-02-24T19:24:59Z

Note that you would get a much simpler traceback if you tried to write the same data using fastparquet's write() function directly.

Indeed, the code is assuming that the data is a numpy array, which used to always be the case for integers. The new integer-with-nulls should follow the path that was previously one of the options for object-type arrays, to encode asinteger-with-nulls. This is fixable but would take a little poking around.

JamesCropcho · 2019-02-24T20:00:22Z

Thank you for the prompt and helpful reply, Martin. Would it be useful for me to cross-post this to a repository further up the stack trace, or is it instead in FastParquet's domain?

…

On Sun, Feb 24, 2019, 2:25 PM Martin Durant ***@***.***> wrote: Note that you would get a much simpler traceback if you tried to write the same data using fastparquet's write() function directly. Indeed, the code is assuming that the data is a numpy array, which used to always be the case for integers. The new integer-with-nulls should follow the path that was previously one of the options for object-type arrays, to encode asinteger-with-nulls. This is fixable but would take a little poking around. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#406 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAAOSUVZEViqcUpKNxsoVMRe7QLkXf1Xks5vQucMgaJpZM4bOqRs> .

martindurant · 2019-02-24T20:02:50Z

This is certainly fastparquet's remit. I would appreciate any help in fixing it, though. The code is already in place for the object([int, int, None]) case from before, but needs logic to call it correctly in the right place.

martindurant · 2019-02-24T20:09:31Z

(it is likely that the IntegerArray already has the right structures internally to make writing them to parquet easy - @TomAugspurger )

TomAugspurger · 2019-02-24T20:16:35Z

We basically haven't addressed IO for extension arrays: pandas-dev/pandas#20612. Once that's solved, the idea would be for each ExtensionArray to determine how it should be serialized.

I'm not sure whether fastparquet wants to get ahead of pandas here. The internal representation of IntegerArray is likely to change in the near future.

martindurant · 2019-02-24T20:23:47Z

Quite surprised to see that IntegerArray doesn't have public methods to get the values and mask separately. They are available as attributes _data and _mask, which is what we'll have to use.

TomAugspurger · 2019-02-24T21:44:18Z

That was intentional, since we know they’ll be changing. The mask is currently a Boolean ndarray, but we know that we’ll use a bitmask in the future.

…

________________________________ From: Martin Durant <notifications@github.com> Sent: Sunday, February 24, 2019 2:23 PM To: dask/fastparquet Cc: Tom Augspurger; Mention Subject: Re: [dask/fastparquet] AttributeError: 'IntegerArray' object has no attribute 'tobytes' (#406) Quite surprised to see that IntegerArray doesn't have public methods to get the values and mask separately. They are available as attributes _data and _mask, which is what we'll have to use. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#406 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ABQHIiOUezYHwHwvyeZdEYEeJIixP1Iyks5vQvTTgaJpZM4bOqRs>.

martindurant · 2019-02-26T14:25:03Z

Given

       IntegerArray is currently experimental, and its API or internal
       implementation may change without warning.

I think it's reasonable for fastparquet not to support this for the time being. It should become the standard thing that parquet integer columns create, and, as in this case, it should be valid input when writing - but not yet. I would ask that users keep to the standard object or float representation for now, even though it will be less efficient.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AttributeError: 'IntegerArray' object has no attribute 'tobytes' #406

AttributeError: 'IntegerArray' object has no attribute 'tobytes' #406

JamesCropcho commented Feb 24, 2019

martindurant commented Feb 24, 2019

JamesCropcho commented Feb 24, 2019 via email

martindurant commented Feb 24, 2019

martindurant commented Feb 24, 2019

TomAugspurger commented Feb 24, 2019

martindurant commented Feb 24, 2019

TomAugspurger commented Feb 24, 2019 via email

martindurant commented Feb 26, 2019

AttributeError: 'IntegerArray' object has no attribute 'tobytes' #406

AttributeError: 'IntegerArray' object has no attribute 'tobytes' #406

Comments

JamesCropcho commented Feb 24, 2019

martindurant commented Feb 24, 2019

JamesCropcho commented Feb 24, 2019 via email

martindurant commented Feb 24, 2019

martindurant commented Feb 24, 2019

TomAugspurger commented Feb 24, 2019

martindurant commented Feb 24, 2019

TomAugspurger commented Feb 24, 2019 via email

martindurant commented Feb 26, 2019