Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What to do about str columns in SolutionArray #896

Closed
ischoegl opened this issue Jul 8, 2020 · 6 comments · Fixed by #900
Closed

What to do about str columns in SolutionArray #896

ischoegl opened this issue Jul 8, 2020 · 6 comments · Fixed by #900

Comments

@ischoegl
Copy link
Member

ischoegl commented Jul 8, 2020

Problem description

The current implementation of SolutionArray allows for extra columns containing str (see discussion in #838), which may or may not be an intended use case: built-in attributes are almost exclusively numeric, with the notable exception of state_of_matter. Other implementations did not anticipate non-numeric data, e.g. CSV import and HDF export/import, so str support either needs to be deprecated/disabled or consistently implemented.

One interpretation is that str input was never intended, but also not explicitly checked for. I.e. it just happens to be supported by numpy (similar to sequences not being checked, see #895). In this case, the most appropriate resolution may be to catch and deprecate str columns, while warning that it is not fully supported for CSV/HDF.

Steps to reproduce

In [1]: import cantera as ct
    ...: gas = ct.Solution('h2o2.yaml')
    ...: arr = ct.SolutionArray(gas, 3, extra={'spam': 'eggs'})
    ...: 

In [2]: arr._extra
Out[2]: OrderedDict([('spam', ['eggs', 'eggs', 'eggs'])])

In [3]: arr.write_csv('test.csv')

In [3]: !cat test.csv
spam,T,density,Y_H2,Y_H,Y_O,Y_O2,Y_OH,Y_H2O,Y_HO2,Y_H2O2,Y_AR
eggs,300.0,0.08189392763801234,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
eggs,300.0,0.08189392763801234,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
eggs,300.0,0.08189392763801234,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0

In [4]: arr2 = ct.SolutionArray(gas, extra={'spam'})

In [4]: arr2.read_csv('test.csv')

In [5]: arr2._extra
Out[5]: {'spam': array([ nan,  nan,  nan])}

In [6]: arr.write_hdf('test.h5')
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-26-d708cd336db5> in <module>()
----> 1 arr.write_hdf('test.h5')

/usr/local/lib/python3.6/dist-packages/cantera/composite.py in write_hdf(self, filename, cols, group, subgroup, attrs, mode, append, compression, compression_opts, *args, **kwargs)
   1117                 dgroup.attrs[key] = val
   1118             for header, col in data.items():
-> 1119                 dgroup.create_dataset(header, data=col, **hdf_kwargs)
   1120 
   1121         return group

/home/docker/.local/lib/python3.6/site-packages/h5py/_hl/group.py in create_dataset(self, name, shape, dtype, data, **kwds)
    134 
    135         with phil:
--> 136             dsid = dataset.make_new_dset(self, shape, dtype, data, **kwds)
    137             dset = dataset.Dataset(dsid)
    138             if name is not None:

/home/docker/.local/lib/python3.6/site-packages/h5py/_hl/dataset.py in make_new_dset(parent, shape, dtype, data, chunks, compression, shuffle, fletcher32, maxshape, compression_opts, fillvalue, scaleoffset, track_times, external, track_order, dcpl)
    116         else:
    117             dtype = numpy.dtype(dtype)
--> 118         tid = h5t.py_create(dtype, logical=1)
    119 
    120     # Legacy

h5py/h5t.pyx in h5py.h5t.py_create()

h5py/h5t.pyx in h5py.h5t.py_create()

h5py/h5t.pyx in h5py.h5t.py_create()

TypeError: No conversion path for dtype: dtype('<U4')

In [7]: arr.spam
Out[7]: 
array(['eggs', 'eggs', 'eggs'],
      dtype='<U4')

Behavior

The pre-existing write_csv supports str, the newly introduced read_csv fails to import it, and h5py has issues.

I am filing this as a bug report as this presumably needs to be fixed prior to the release of 2.5.

System information

  • Cantera version: 2.5.a4
  • OS: Ubuntu 18.04
  • Python 3.6

Additional context

#838, #895

@bryanwweber
Copy link
Member

Thanks again for investigating @ischoegl. I think it may be appropriate to handle CSV and HDF differently here. In particular, my understanding is that CSV has a much easier time representing heterogeneous data types, so IMO importing strings from CSV should be supported. It's actually a pretty small change to read_csv(), although it does require NumPy >= 1.14:

    def read_csv(self, filename):
        """
        Read a CSV file named *filename* and restore data to the `SolutionArray`
        using `restore_data`. This method allows for recreation of data
        previously exported by `write_csv`.
        """
        data = np.genfromtxt(filename, delimiter=',', dtype=None, names=True, encoding=None)
        data_dict = OrderedDict({l: data[l] for l in data.dtype.names})
        self.restore_data(data_dict)

Specifying dtype=None gives each column its own dtype inferred from the data and names=True reads the column names from the header line. The encoding argument was added in 1.14, without that, a DeprecationWarning is raised and bytestrings are returned instead of regular strings (at least, in 1.19.0).

Since HDF may have more trouble with the heterogeneous data, it might make sense to throw an exception if strings are written to/read from HDF, but I didn't look into how this could be supported.

The lack of a phase_of_matter property on the SolutionArray was due to an oversight on my part when I implemented it. That said, right now, a naive implementation (adding phase_of_matter to the _scalar list) does not work:

In [1]: import cantera as ct
In [2]: water = ct.Water()
In [3]: states = ct.SolutionArray(water, 3)
In [4]: states.phase_of_matter
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-12-787428bd3c98> in <module>
----> 1 states.phase_of_matter

~/GitHub/cantera/build/python/cantera/composite.py in getter(self)
   1326             for index in self._indices:
   1327                 self._phase.state = self._states[index]
-> 1328                 v[index] = getattr(self._phase, name)
   1329             return v
   1330         return property(getter, doc=getattr(doc_source, name).__doc__)

ValueError: could not convert string to float: 'liquid'

@ischoegl
Copy link
Member Author

ischoegl commented Jul 9, 2020

@bryanwweber ... thanks for the feedback. And 👍 for the CSV import suggestion! Regarding, state_of_matter I wasn’t aware of it either, but the fix is likely not too difficult.

There may be a way to even fix HDF: a quick search resulted in storing-a-list-of-strings-to-a-hdf5-dataset-from-python. So allowing for str may not be too difficult after all.

@ischoegl
Copy link
Member Author

ischoegl commented Jul 9, 2020

As an aside: I’ve seen that the minimum requirements for numpy create issues at multiple levels (e.g. matplotlib in examples). Are there any plans to bump requirements?

@bryanwweber
Copy link
Member

@ischoegl It depends on whether we want to continue supporting NumPy from the Ubuntu 16.04 repositories, which is up for discussion (which should probably be at #838 (comment))

@ischoegl
Copy link
Member Author

Alright. Fix is proposed in #900 for both CSV and HDF. I also added phase_of_matter support

@bryanwweber
Copy link
Member

Totally unrelated, but congrats on filing issue #900 😃

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants