Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: write Table meta-data (non_index_axes) as a CArray (rather than as meta-data) #6245

Open
dragoljub opened this issue Feb 3, 2014 · 13 comments
Labels
Enhancement IO HDF5 read_hdf, HDFStore

Comments

@dragoljub
Copy link

@jreback

I get this very strange error when writing very wide HDF5 tables. In this case a random float32 array with 1500 columns cant be written as an HDF5 table. However if rename the columns it seems to write fine...

Not sure whats going on.

-Gagi

In [1]: %paste
import numpy as np
import pandas as pd

print np.__version__
print pd.__version__

pd.options.display.large_repr = 'info'

df = pd.DataFrame(np.random.randn(1000,1500), dtype=np.float32)
#df.columns = df.columns.map(lambda x: 'A'+str(x))
df.to_hdf(r'Y:\table.h5', 'df', format='table', nan_rep='')

## -- End pasted text --
1.7.1
0.13.0
---------------------------------------------------------------------------
HDF5ExtError                              Traceback (most recent call last)
<ipython-input-1-ae7bcadf7834> in <module>()
      9 df = pd.DataFrame(np.random.randn(1000,1500), dtype=np.float32)
     10 #df.columns = df.columns.map(lambda x: 'A'+str(x))
---> 11 df.to_hdf(r'Y:\table.h5', 'df', format='table', nan_rep='')

D:\Python27\lib\site-packages\pandas\core\generic.pyc in to_hdf(self, path_or_buf, key, **kwargs)
    860
    861         from pandas.io import pytables
--> 862         return pytables.to_hdf(path_or_buf, key, self, **kwargs)
    863
    864     def to_msgpack(self, path_or_buf=None, **kwargs):

D:\Python27\lib\site-packages\pandas\io\pytables.pyc in to_hdf(path_or_buf, key, value, mode, complevel, complib, append, **kwargs)
    269         with get_store(path_or_buf, mode=mode, complevel=complevel,
    270                        complib=complib) as store:
--> 271             f(store)
    272     else:
    273         f(path_or_buf)

D:\Python27\lib\site-packages\pandas\io\pytables.pyc in <lambda>(store)
    264         f = lambda store: store.append(key, value, **kwargs)
    265     else:
--> 266         f = lambda store: store.put(key, value, **kwargs)
    267
    268     if isinstance(path_or_buf, compat.string_types):

D:\Python27\lib\site-packages\pandas\io\pytables.pyc in put(self, key, value, format, append, **kwargs)
    792             format = get_option("io.hdf.default_format") or 'fixed'
    793         kwargs = self._validate_format(format, kwargs)
--> 794         self._write_to_group(key, value, append=append, **kwargs)
    795
    796     def remove(self, key, where=None, start=None, stop=None):

D:\Python27\lib\site-packages\pandas\io\pytables.pyc in _write_to_group(self, key, value, format, index, append, complib, encoding, **kwargs)
   1244
   1245         # write the object
-> 1246         s.write(obj=value, append=append, complib=complib, **kwargs)
   1247
   1248         if s.is_table and index:

D:\Python27\lib\site-packages\pandas\io\pytables.pyc in write(self, obj, axes, append, complib, complevel, fletcher32, min_itemsize, chunksize, expectedrows, dropna, **kwa
   3548
   3549             # set the table attributes
-> 3550             self.set_attrs()
   3551
   3552             # create the table

D:\Python27\lib\site-packages\pandas\io\pytables.pyc in set_attrs(self)
   2858         self.attrs.index_cols = self.index_cols()
   2859         self.attrs.values_cols = self.values_cols()
-> 2860         self.attrs.non_index_axes = self.non_index_axes
   2861         self.attrs.data_columns = self.data_columns
   2862         self.attrs.nan_rep = self.nan_rep

D:\Python27\lib\site-packages\tables\attributeset.pyc in __setattr__(self, name, value)
    450
    451         # Set the attribute.
--> 452         self._g__setattr(name, value)
    453
    454         # Log new attribute addition.

D:\Python27\lib\site-packages\tables\attributeset.pyc in _g__setattr(self, name, value)
    393             value = stvalue[()]
    394
--> 395         self._g_setattr(self._v_node, name, stvalue)
    396
    397         # New attribute or value. Introduce it into the local

D:\Python27\lib\site-packages\tables\hdf5extension.pyd in tables.hdf5extension.AttributeSet._g_setattr (tables\hdf5extension.c:6334)()

HDF5ExtError: HDF5 error back trace

  File "..\..\hdf5-1.8.11\src\H5A.c", line 254, in H5Acreate2
    unable to create attribute
  File "..\..\hdf5-1.8.11\src\H5A.c", line 503, in H5A_create
    unable to create attribute in object header
  File "..\..\hdf5-1.8.11\src\H5Oattribute.c", line 347, in H5O_attr_create
    unable to create new attribute in header
  File "..\..\hdf5-1.8.11\src\H5Omessage.c", line 224, in H5O_msg_append_real
    unable to create new message
  File "..\..\hdf5-1.8.11\src\H5Omessage.c", line 1945, in H5O_msg_alloc
    unable to allocate space for message
  File "..\..\hdf5-1.8.11\src\H5Oalloc.c", line 1142, in H5O_alloc
    object header message is too large

End of HDF5 error back trace

Can't set attribute 'non_index_axes' in node:
 /df (Group) u''.

In [2]: df.columns = df.columns.map(lambda x: 'A'+str(x))

In [3]: df.to_hdf(r'Y:\tabl2e3.h5', 'df', format='table', nan_rep='')
@jreback
Copy link
Contributor

jreback commented Feb 3, 2014

There is a limit on the amount of meta-data that can be stored within a node, and for example a string of the columns is stored (along with lots of other things). So its not a column limit per-se, but an internal limit. As we have discussed before, this is in general not a good idea to store very wide tables. You might want to explore making these 3-d actually (a Panel); Then these become long tables (which is quite efficient).

maybe something like: http://pandas.pydata.org/pandas-docs/dev/reshaping.html#reshaping-by-melt

might help.

e.g. 1500 * 4 characters * 8 bytes character = 48kb (plus other meta data that I don't know exactly how big)

@dragoljub
Copy link
Author

But this does not happen when you pre-pend 'A' to each integer column in the DF above and write the same data to a HDF5 table. Which is strange.. Maybe the string representation of the column name is smaller than the int64s...

So we are limited to use wide HDF5 tables because the column names might be too long?

I'll look into the performance of storing the data as a panel. I guess I can also break up the data frame into multiple nodes (not so wide chunks), but then have to pay the penalty of concatenating those columns on read...

@dragoljub
Copy link
Author

Why is the write HDF (non-table just serialized) not limited by this?

@jreback
Copy link
Contributor

jreback commented Feb 3, 2014

the Fixed stores essentially write this column meta data as an CArray itself. Table could do this I think too; How about I reopen and make this an enhancement for that.

IIRC. I did this so that when you ptdump you 'see' the data. But it should be written as an array.

@jreback jreback reopened this Feb 3, 2014
@dragoljub
Copy link
Author

That would be fantastic Jeff! I know very wide tables are not as efficient. Sometimes for commonalty analysis and optimizations I have no choice but to use very wide data.

Thanks!

@jreback
Copy link
Contributor

jreback commented Feb 3, 2014

don't think its very difficult to do; will just have to have to deal with some new version stuff is the only issue (e.g. the version of the table that is written will need to be updated from the current 0.10.1)

@dragoljub
Copy link
Author

I was actually gonna ask about the version thing a week ago but forgot . Fixed frames I wrote today on pandas 0.13.0 show a pandas_version attribute of 0.10.1.

@jreback
Copy link
Contributor

jreback commented Feb 4, 2014

yep...hasn't been changed since then. Its exactly for a change like this. will be easily backward compatible, but an older version of pandas would not be able to read (but at least can 'figure' it out from the version string that it won't work).

e.g. new version would be 0.14, so 0.14 would be able to read it (and prior versions); writing to an existing store would have to respect that store's version (e.g. don't auto upgrade). maybe have to add a 'compat' flag to write in the original format.

but that's why I did it in the first place, to enable changes.

@dragoljub
Copy link
Author

Good deal thanks for the explanation. 👍

@jreback jreback modified the milestones: 0.15.0, 0.14.0 Mar 28, 2014
@dragoljub
Copy link
Author

Where in the code should I look to write/read Table column meta data as an CArray itself? I'm finally in the position where I to need to create long and wide HDF5 append-able tables. Would love to see this limit disappear.

I know for backward compatibility to older HDF5 table formats we have to be careful. Any advice is appreciated.

@jreback
Copy link
Contributor

jreback commented Jul 2, 2014

https://github.com/pydata/pandas/blob/master/pandas/io/pytables.py#L2895

then in the next function you can read it in (and determine if its a CArray, convert to a list).

Shouldn't be difficult to change (even with backward compat), lmk.

@hhuuggoo
Copy link

@jreback I'm interested in picking this up - to summarize, the approach is

  • predefine a list of attrs that are saved as carrays instead of regular attrs
  • on read, check if the attr exists as an attr, or a carray, and read the appropriate one
  • on writes, save data to carray instead of attr
  • should we prefix the key of the carray with something like attr_ to denote that they are attributes stored as datasets?

@jreback
Copy link
Contributor

jreback commented May 29, 2015

@hhuuggoo ohhh gr8!.
These are all in Table

  def set_attrs(self):
        """ set our table type & indexables """
        self.attrs.table_type = str(self.table_type)
        self.attrs.index_cols = self.index_cols()
        self.attrs.values_cols = self.values_cols()
        self.attrs.non_index_axes = self.non_index_axes
        self.attrs.data_columns = self.data_columns
        self.attrs.nan_rep = self.nan_rep
        self.attrs.encoding = self.encoding
        self.attrs.levels = self.levels
        self.attrs.metadata = self.metadata
        self.set_info()

I think you need to worry about the ones that are lists ATM, e.g. non_index_axes,data_columns,values_axes,values_cols,index_cols

I wouldn't change the names that would make backward compat really difficult.

So:

  • change the version (to 0.17.0), we don't actually do much with this atm, but for consistency sake
  • the hardest part of this is to make sure that you preserve back-compat. So if its already a list, then leave it like that.
  • when creating new, you can write as a CArray. You should then be able to interpret as list/CArray on reads.

For testing, I would make a couple of tests tables in 0.16.1., that comprise different dtypes/indexes, maybe with compression, etc. all can be small. To ensure that things can be read from prior versions.

This will be a forward-compat break, IOW, an older version of pandas cannot read a 0.17.0 created HDFStore, but I don't think that's a big deal.

@mroeschke mroeschke removed this from the Someday milestone Oct 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO HDF5 read_hdf, HDFStore
Projects
None yet
6 participants