ENH: write Table meta-data (non_index_axes) as a CArray (rather than as meta-data) #6245

dragoljub · 2014-02-03T22:48:32Z

I get this very strange error when writing very wide HDF5 tables. In this case a random float32 array with 1500 columns cant be written as an HDF5 table. However if rename the columns it seems to write fine...

Not sure whats going on.

-Gagi

In [1]: %paste
import numpy as np
import pandas as pd

print np.__version__
print pd.__version__

pd.options.display.large_repr = 'info'

df = pd.DataFrame(np.random.randn(1000,1500), dtype=np.float32)
#df.columns = df.columns.map(lambda x: 'A'+str(x))
df.to_hdf(r'Y:\table.h5', 'df', format='table', nan_rep='')

## -- End pasted text --
1.7.1
0.13.0
---------------------------------------------------------------------------
HDF5ExtError                              Traceback (most recent call last)
<ipython-input-1-ae7bcadf7834> in <module>()
      9 df = pd.DataFrame(np.random.randn(1000,1500), dtype=np.float32)
     10 #df.columns = df.columns.map(lambda x: 'A'+str(x))
---> 11 df.to_hdf(r'Y:\table.h5', 'df', format='table', nan_rep='')

D:\Python27\lib\site-packages\pandas\core\generic.pyc in to_hdf(self, path_or_buf, key, **kwargs)
    860
    861         from pandas.io import pytables
--> 862         return pytables.to_hdf(path_or_buf, key, self, **kwargs)
    863
    864     def to_msgpack(self, path_or_buf=None, **kwargs):

D:\Python27\lib\site-packages\pandas\io\pytables.pyc in to_hdf(path_or_buf, key, value, mode, complevel, complib, append, **kwargs)
    269         with get_store(path_or_buf, mode=mode, complevel=complevel,
    270                        complib=complib) as store:
--> 271             f(store)
    272     else:
    273         f(path_or_buf)

D:\Python27\lib\site-packages\pandas\io\pytables.pyc in <lambda>(store)
    264         f = lambda store: store.append(key, value, **kwargs)
    265     else:
--> 266         f = lambda store: store.put(key, value, **kwargs)
    267
    268     if isinstance(path_or_buf, compat.string_types):

D:\Python27\lib\site-packages\pandas\io\pytables.pyc in put(self, key, value, format, append, **kwargs)
    792             format = get_option("io.hdf.default_format") or 'fixed'
    793         kwargs = self._validate_format(format, kwargs)
--> 794         self._write_to_group(key, value, append=append, **kwargs)
    795
    796     def remove(self, key, where=None, start=None, stop=None):

D:\Python27\lib\site-packages\pandas\io\pytables.pyc in _write_to_group(self, key, value, format, index, append, complib, encoding, **kwargs)
   1244
   1245         # write the object
-> 1246         s.write(obj=value, append=append, complib=complib, **kwargs)
   1247
   1248         if s.is_table and index:

D:\Python27\lib\site-packages\pandas\io\pytables.pyc in write(self, obj, axes, append, complib, complevel, fletcher32, min_itemsize, chunksize, expectedrows, dropna, **kwa
   3548
   3549             # set the table attributes
-> 3550             self.set_attrs()
   3551
   3552             # create the table

D:\Python27\lib\site-packages\pandas\io\pytables.pyc in set_attrs(self)
   2858         self.attrs.index_cols = self.index_cols()
   2859         self.attrs.values_cols = self.values_cols()
-> 2860         self.attrs.non_index_axes = self.non_index_axes
   2861         self.attrs.data_columns = self.data_columns
   2862         self.attrs.nan_rep = self.nan_rep

D:\Python27\lib\site-packages\tables\attributeset.pyc in __setattr__(self, name, value)
    450
    451         # Set the attribute.
--> 452         self._g__setattr(name, value)
    453
    454         # Log new attribute addition.

D:\Python27\lib\site-packages\tables\attributeset.pyc in _g__setattr(self, name, value)
    393             value = stvalue[()]
    394
--> 395         self._g_setattr(self._v_node, name, stvalue)
    396
    397         # New attribute or value. Introduce it into the local

D:\Python27\lib\site-packages\tables\hdf5extension.pyd in tables.hdf5extension.AttributeSet._g_setattr (tables\hdf5extension.c:6334)()

HDF5ExtError: HDF5 error back trace

  File "..\..\hdf5-1.8.11\src\H5A.c", line 254, in H5Acreate2
    unable to create attribute
  File "..\..\hdf5-1.8.11\src\H5A.c", line 503, in H5A_create
    unable to create attribute in object header
  File "..\..\hdf5-1.8.11\src\H5Oattribute.c", line 347, in H5O_attr_create
    unable to create new attribute in header
  File "..\..\hdf5-1.8.11\src\H5Omessage.c", line 224, in H5O_msg_append_real
    unable to create new message
  File "..\..\hdf5-1.8.11\src\H5Omessage.c", line 1945, in H5O_msg_alloc
    unable to allocate space for message
  File "..\..\hdf5-1.8.11\src\H5Oalloc.c", line 1142, in H5O_alloc
    object header message is too large

End of HDF5 error back trace

Can't set attribute 'non_index_axes' in node:
 /df (Group) u''.

In [2]: df.columns = df.columns.map(lambda x: 'A'+str(x))

In [3]: df.to_hdf(r'Y:\tabl2e3.h5', 'df', format='table', nan_rep='')

jreback · 2014-02-03T22:55:33Z

There is a limit on the amount of meta-data that can be stored within a node, and for example a string of the columns is stored (along with lots of other things). So its not a column limit per-se, but an internal limit. As we have discussed before, this is in general not a good idea to store very wide tables. You might want to explore making these 3-d actually (a Panel); Then these become long tables (which is quite efficient).

maybe something like: http://pandas.pydata.org/pandas-docs/dev/reshaping.html#reshaping-by-melt

might help.

e.g. 1500 * 4 characters * 8 bytes character = 48kb (plus other meta data that I don't know exactly how big)

dragoljub · 2014-02-03T23:23:15Z

But this does not happen when you pre-pend 'A' to each integer column in the DF above and write the same data to a HDF5 table. Which is strange.. Maybe the string representation of the column name is smaller than the int64s...

So we are limited to use wide HDF5 tables because the column names might be too long?

I'll look into the performance of storing the data as a panel. I guess I can also break up the data frame into multiple nodes (not so wide chunks), but then have to pay the penalty of concatenating those columns on read...

dragoljub · 2014-02-03T23:27:35Z

Why is the write HDF (non-table just serialized) not limited by this?

jreback · 2014-02-03T23:32:49Z

the Fixed stores essentially write this column meta data as an CArray itself. Table could do this I think too; How about I reopen and make this an enhancement for that.

IIRC. I did this so that when you ptdump you 'see' the data. But it should be written as an array.

dragoljub · 2014-02-03T23:44:48Z

That would be fantastic Jeff! I know very wide tables are not as efficient. Sometimes for commonalty analysis and optimizations I have no choice but to use very wide data.

Thanks!

jreback · 2014-02-03T23:46:46Z

don't think its very difficult to do; will just have to have to deal with some new version stuff is the only issue (e.g. the version of the table that is written will need to be updated from the current 0.10.1)

dragoljub · 2014-02-03T23:57:00Z

I was actually gonna ask about the version thing a week ago but forgot . Fixed frames I wrote today on pandas 0.13.0 show a pandas_version attribute of 0.10.1.

jreback · 2014-02-04T00:00:02Z

yep...hasn't been changed since then. Its exactly for a change like this. will be easily backward compatible, but an older version of pandas would not be able to read (but at least can 'figure' it out from the version string that it won't work).

e.g. new version would be 0.14, so 0.14 would be able to read it (and prior versions); writing to an existing store would have to respect that store's version (e.g. don't auto upgrade). maybe have to add a 'compat' flag to write in the original format.

but that's why I did it in the first place, to enable changes.

dragoljub · 2014-02-04T00:07:41Z

Good deal thanks for the explanation. 👍

dragoljub · 2014-07-02T17:21:56Z

Where in the code should I look to write/read Table column meta data as an CArray itself? I'm finally in the position where I to need to create long and wide HDF5 append-able tables. Would love to see this limit disappear.

I know for backward compatibility to older HDF5 table formats we have to be careful. Any advice is appreciated.

jreback · 2014-07-02T17:26:26Z

https://github.com/pydata/pandas/blob/master/pandas/io/pytables.py#L2895

then in the next function you can read it in (and determine if its a CArray, convert to a list).

Shouldn't be difficult to change (even with backward compat), lmk.

hhuuggoo · 2015-05-29T15:18:28Z

@jreback I'm interested in picking this up - to summarize, the approach is

predefine a list of attrs that are saved as carrays instead of regular attrs
on read, check if the attr exists as an attr, or a carray, and read the appropriate one
on writes, save data to carray instead of attr
should we prefix the key of the carray with something like attr_ to denote that they are attributes stored as datasets?

jreback · 2015-05-29T15:26:07Z

@hhuuggoo ohhh gr8!.
These are all in Table

  def set_attrs(self):
        """ set our table type & indexables """
        self.attrs.table_type = str(self.table_type)
        self.attrs.index_cols = self.index_cols()
        self.attrs.values_cols = self.values_cols()
        self.attrs.non_index_axes = self.non_index_axes
        self.attrs.data_columns = self.data_columns
        self.attrs.nan_rep = self.nan_rep
        self.attrs.encoding = self.encoding
        self.attrs.levels = self.levels
        self.attrs.metadata = self.metadata
        self.set_info()

I think you need to worry about the ones that are lists ATM, e.g. non_index_axes,data_columns,values_axes,values_cols,index_cols

I wouldn't change the names that would make backward compat really difficult.

So:

change the version (to 0.17.0), we don't actually do much with this atm, but for consistency sake
the hardest part of this is to make sure that you preserve back-compat. So if its already a list, then leave it like that.
when creating new, you can write as a CArray. You should then be able to interpret as list/CArray on reads.

For testing, I would make a couple of tests tables in 0.16.1., that comprise different dtypes/indexes, maybe with compression, etc. all can be small. To ensure that things can be read from prior versions.

This will be a forward-compat break, IOW, an older version of pandas cannot read a 0.17.0 created HDFStore, but I don't think that's a big deal.

dragoljub closed this as completed Feb 3, 2014

jreback reopened this Feb 3, 2014

jreback modified the milestones: 0.15.0, 0.14.0 Mar 28, 2014

jreback mentioned this issue Jul 2, 2014

PyTables: any limitation on the number of columns? #7653

Closed

jreback mentioned this issue Sep 3, 2014

0.13.1: strange PyTables error #8167

Closed

jreback mentioned this issue Sep 19, 2014

Bloomberg Hackathon #8323

Closed

jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015

hhuuggoo mentioned this issue May 31, 2015

wide table support #10243

Closed

MJuddBooth mentioned this issue Dec 7, 2015

ENH: allow saving of meta-data via CArrays to support wide tables #11788

Closed

mroeschke added the Difficulty Intermediate label Jul 6, 2018

datapythonista modified the milestones: Contributions Welcome, Someday Jul 8, 2018

P-Tillmann mentioned this issue Apr 18, 2019

ENH: allow saving wide dataframes to hdf with format table #26135

Closed

jbrockmendel removed the Difficulty Intermediate label Oct 21, 2019

mroeschke removed this from the Someday milestone Oct 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: write Table meta-data (non_index_axes) as a CArray (rather than as meta-data) #6245

ENH: write Table meta-data (non_index_axes) as a CArray (rather than as meta-data) #6245

dragoljub commented Feb 3, 2014

jreback commented Feb 3, 2014

dragoljub commented Feb 3, 2014

dragoljub commented Feb 3, 2014

jreback commented Feb 3, 2014

dragoljub commented Feb 3, 2014

jreback commented Feb 3, 2014

dragoljub commented Feb 3, 2014

jreback commented Feb 4, 2014

dragoljub commented Feb 4, 2014

dragoljub commented Jul 2, 2014

jreback commented Jul 2, 2014

hhuuggoo commented May 29, 2015

jreback commented May 29, 2015

ENH: write Table meta-data (non_index_axes) as a CArray (rather than as meta-data) #6245

ENH: write Table meta-data (non_index_axes) as a CArray (rather than as meta-data) #6245

Comments

dragoljub commented Feb 3, 2014

jreback commented Feb 3, 2014

dragoljub commented Feb 3, 2014

dragoljub commented Feb 3, 2014

jreback commented Feb 3, 2014

dragoljub commented Feb 3, 2014

jreback commented Feb 3, 2014

dragoljub commented Feb 3, 2014

jreback commented Feb 4, 2014

dragoljub commented Feb 4, 2014

dragoljub commented Jul 2, 2014

jreback commented Jul 2, 2014

hhuuggoo commented May 29, 2015

jreback commented May 29, 2015