What to do about `str` columns in `SolutionArray` #896

ischoegl · 2020-07-08T21:56:50Z

Problem description

The current implementation of SolutionArray allows for extra columns containing str (see discussion in #838), which may or may not be an intended use case: built-in attributes are almost exclusively numeric, with the notable exception of state_of_matter. Other implementations did not anticipate non-numeric data, e.g. CSV import and HDF export/import, so str support either needs to be deprecated/disabled or consistently implemented.

One interpretation is that str input was never intended, but also not explicitly checked for. I.e. it just happens to be supported by numpy (similar to sequences not being checked, see #895). In this case, the most appropriate resolution may be to catch and deprecate str columns, while warning that it is not fully supported for CSV/HDF.

Steps to reproduce

In [1]: import cantera as ct
    ...: gas = ct.Solution('h2o2.yaml')
    ...: arr = ct.SolutionArray(gas, 3, extra={'spam': 'eggs'})
    ...: 

In [2]: arr._extra
Out[2]: OrderedDict([('spam', ['eggs', 'eggs', 'eggs'])])

In [3]: arr.write_csv('test.csv')

In [3]: !cat test.csv
spam,T,density,Y_H2,Y_H,Y_O,Y_O2,Y_OH,Y_H2O,Y_HO2,Y_H2O2,Y_AR
eggs,300.0,0.08189392763801234,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
eggs,300.0,0.08189392763801234,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
eggs,300.0,0.08189392763801234,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0

In [4]: arr2 = ct.SolutionArray(gas, extra={'spam'})

In [4]: arr2.read_csv('test.csv')

In [5]: arr2._extra
Out[5]: {'spam': array([ nan,  nan,  nan])}

In [6]: arr.write_hdf('test.h5')
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-26-d708cd336db5> in <module>()
----> 1 arr.write_hdf('test.h5')

/usr/local/lib/python3.6/dist-packages/cantera/composite.py in write_hdf(self, filename, cols, group, subgroup, attrs, mode, append, compression, compression_opts, *args, **kwargs)
   1117                 dgroup.attrs[key] = val
   1118             for header, col in data.items():
-> 1119                 dgroup.create_dataset(header, data=col, **hdf_kwargs)
   1120 
   1121         return group

/home/docker/.local/lib/python3.6/site-packages/h5py/_hl/group.py in create_dataset(self, name, shape, dtype, data, **kwds)
    134 
    135         with phil:
--> 136             dsid = dataset.make_new_dset(self, shape, dtype, data, **kwds)
    137             dset = dataset.Dataset(dsid)
    138             if name is not None:

/home/docker/.local/lib/python3.6/site-packages/h5py/_hl/dataset.py in make_new_dset(parent, shape, dtype, data, chunks, compression, shuffle, fletcher32, maxshape, compression_opts, fillvalue, scaleoffset, track_times, external, track_order, dcpl)
    116         else:
    117             dtype = numpy.dtype(dtype)
--> 118         tid = h5t.py_create(dtype, logical=1)
    119 
    120     # Legacy

h5py/h5t.pyx in h5py.h5t.py_create()

h5py/h5t.pyx in h5py.h5t.py_create()

h5py/h5t.pyx in h5py.h5t.py_create()

TypeError: No conversion path for dtype: dtype('<U4')

In [7]: arr.spam
Out[7]: 
array(['eggs', 'eggs', 'eggs'],
      dtype='<U4')

Behavior

The pre-existing write_csv supports str, the newly introduced read_csv fails to import it, and h5py has issues.

I am filing this as a bug report as this presumably needs to be fixed prior to the release of 2.5.

System information

Cantera version: 2.5.a4
OS: Ubuntu 18.04
Python 3.6

Additional context

#838, #895

The text was updated successfully, but these errors were encountered:

bryanwweber · 2020-07-09T17:52:23Z

Thanks again for investigating @ischoegl. I think it may be appropriate to handle CSV and HDF differently here. In particular, my understanding is that CSV has a much easier time representing heterogeneous data types, so IMO importing strings from CSV should be supported. It's actually a pretty small change to read_csv(), although it does require NumPy >= 1.14:

    def read_csv(self, filename):
        """
        Read a CSV file named *filename* and restore data to the `SolutionArray`
        using `restore_data`. This method allows for recreation of data
        previously exported by `write_csv`.
        """
        data = np.genfromtxt(filename, delimiter=',', dtype=None, names=True, encoding=None)
        data_dict = OrderedDict({l: data[l] for l in data.dtype.names})
        self.restore_data(data_dict)

Specifying dtype=None gives each column its own dtype inferred from the data and names=True reads the column names from the header line. The encoding argument was added in 1.14, without that, a DeprecationWarning is raised and bytestrings are returned instead of regular strings (at least, in 1.19.0).

Since HDF may have more trouble with the heterogeneous data, it might make sense to throw an exception if strings are written to/read from HDF, but I didn't look into how this could be supported.

The lack of a phase_of_matter property on the SolutionArray was due to an oversight on my part when I implemented it. That said, right now, a naive implementation (adding phase_of_matter to the _scalar list) does not work:

In [1]: import cantera as ct
In [2]: water = ct.Water()
In [3]: states = ct.SolutionArray(water, 3)
In [4]: states.phase_of_matter
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-12-787428bd3c98> in <module>
----> 1 states.phase_of_matter

~/GitHub/cantera/build/python/cantera/composite.py in getter(self)
   1326             for index in self._indices:
   1327                 self._phase.state = self._states[index]
-> 1328                 v[index] = getattr(self._phase, name)
   1329             return v
   1330         return property(getter, doc=getattr(doc_source, name).__doc__)

ValueError: could not convert string to float: 'liquid'

ischoegl · 2020-07-09T17:59:35Z

@bryanwweber ... thanks for the feedback. And 👍 for the CSV import suggestion! Regarding, state_of_matter I wasn’t aware of it either, but the fix is likely not too difficult.

There may be a way to even fix HDF: a quick search resulted in storing-a-list-of-strings-to-a-hdf5-dataset-from-python. So allowing for str may not be too difficult after all.

ischoegl · 2020-07-09T18:04:46Z

As an aside: I’ve seen that the minimum requirements for numpy create issues at multiple levels (e.g. matplotlib in examples). Are there any plans to bump requirements?

bryanwweber · 2020-07-09T18:19:52Z

@ischoegl It depends on whether we want to continue supporting NumPy from the Ubuntu 16.04 repositories, which is up for discussion (which should probably be at #838 (comment))

ischoegl · 2020-07-10T22:58:58Z

Alright. Fix is proposed in #900 for both CSV and HDF. I also added phase_of_matter support

bryanwweber · 2020-07-11T00:07:36Z

Totally unrelated, but congrats on filing issue #900 😃

ischoegl mentioned this issue Jul 8, 2020

Fixes Indexed assignment of extra columns in SolutionArray #838

Merged

4 tasks

bryanwweber mentioned this issue Jul 9, 2020

Sequences can be appended as extra items on a SolutionArray #895

Closed

ischoegl mentioned this issue Jul 10, 2020

Support file I/O for string columns in SolutionArray #900

Merged

4 tasks

bryanwweber closed this as completed in #900 Aug 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What to do about `str` columns in `SolutionArray` #896

What to do about `str` columns in `SolutionArray` #896

ischoegl commented Jul 8, 2020 •

edited

Loading

bryanwweber commented Jul 9, 2020

ischoegl commented Jul 9, 2020 •

edited

Loading

ischoegl commented Jul 9, 2020

bryanwweber commented Jul 9, 2020

ischoegl commented Jul 10, 2020

bryanwweber commented Jul 11, 2020

What to do about str columns in SolutionArray #896

What to do about str columns in SolutionArray #896

Comments

ischoegl commented Jul 8, 2020 • edited Loading

bryanwweber commented Jul 9, 2020

ischoegl commented Jul 9, 2020 • edited Loading

ischoegl commented Jul 9, 2020

bryanwweber commented Jul 9, 2020

ischoegl commented Jul 10, 2020

bryanwweber commented Jul 11, 2020

What to do about `str` columns in `SolutionArray` #896

What to do about `str` columns in `SolutionArray` #896

ischoegl commented Jul 8, 2020 •

edited

Loading

ischoegl commented Jul 9, 2020 •

edited

Loading