-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Column API and DataType Containers #180
Conversation
…st of hangar backend module
Combined reader and writer arrayset column methods in order to allow schema type (and specification), along with argument validators, to be dynamically mixed in. This is to support the new "columns" API where different column data types (previously 'arraysets' and 'metadata') are combined into a unified API, each operating on backends independently. Renamed columns/arrayset_nested.py -> columns/nested.py and columns/arrayset_flat.py to columns/flat.py to mark this change in how we are thinking about the accessor methods. Tests are observed to pass.
…of moving in preparation for pulling out column types
…set up special (highly isolated) mock of hangar repo dirs / dbs and initialize the column class for writing to lmdb30 backend (and reading back out)
28bb2e3
to
9838855
Compare
5b962e3
to
051fea3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great so far. I have added a few minor comments. I'll spend more time soon
@hhsecond any idea why this test is failing? I'm not sure what tensorflow wants here... https://travis-ci.com/tensorwerk/hangar-py/jobs/289376731#L423-L506 |
Codecov Report
@@ Coverage Diff @@
## master #180 +/- ##
=======================================
Coverage 95.25% 95.25%
=======================================
Files 97 97
Lines 16175 16175
Branches 1547 1547
=======================================
Hits 15407 15407
Misses 525 525
Partials 243 243 |
|
src/hangar/columns/column.py
Outdated
return column | ||
|
||
@writer_checkout_only | ||
def create_str_column(self, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rlizzo @lantiga
IMO; these names are verbose and maybe we could try something else. Few ideas on top of my mind (which needs polishing) are
co.columns.create(type='ndarray' / 'str', layout='flat' / 'nested')
where the argument names are arbitrary and could be anything else- Constructors being nouns like
np.array
->co.column.flat
,co.column.nested
,col.column.str
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that verbosity is a feature here; customers need that verbosity when they're trying to learn the API.
I don't think that co.columns.create
is particularly Pythonic, since the intermediate columns
object is tied to and operating on the co
, but presumably is also a dict-like thing. Calling create on a dict to manipulate the dict's parent object? That's weird. Also, flat
has no meaning to me, though perhaps that will be fixed once the v0.5.0 docs are live.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I said verbose, I meant, we are repeating information here. The API in this PR is co.columns.create_xxx_column
and by this, we are repeating the word column
which would be obvious even if we use co.columns.create()
. I am not particularly sentimental about the name create
. It could be anything else, like add
or update
(like dict.update
, thinking aloud here).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I misunderstood, sorry.
I think the function should go on the checkout object; putting it on the columns pseudo-container seems odd.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey guys as a follow up for completeness here...
-
The method has been renamed to
add_ndarray_column()
andadd_string_column()
. This is to reflect the actual workings of the method. Me and @lantiga considereddefine_foo_column()
as well (as suggested by @elistevens) but decided against it because while acting as a definition, the operation also creates a column ready for use. We wanted to make it clear that nothing else needs to be done in order to use the column that was "defined/initialized" -
The method does now live on the
checkout
class (instead ofcheckout.columns
). This is just better API design. ex:
co = repo.checkout(write=True)
col = co.add_ndarray_column('foo', shape=(10, 10), dtype=np.uint8)
col[1] = np.arange(10, dtype=np.uint8)
co.commit('added column')
co.close()
- I'm going to be working on improving docstrings and tutorials shortely, but with the new "generalized" column forms, the options essentially break down into the following:
-
"layout"
: the method by which data in columns are named/indexed/accessed by. This can currently take on the values of"flat"
(for single-level story type mapping) OR"nested"
(for a sample name which contains subsample name/data pairs) -
"type"
: indicates the format of data stored within the column container. Valid values are"ndarray"
or"string"
. These are not accepted as keyword arguments, but are determined by the method name which is actually called to create the column (add_ndarray_column()
for"ndarray"
andadd_str_column()
for"string"
) -
"variable_shape"
: bool indicating if data length / size can be variable sized (will have different defaults and avalability depending on the column"type"). For example,
ndarraycolumns accept both
Trueand
Falsevalues, while
stringcolumn types only accept
True`. -
"backend"
and"backend_options"
: specify specific backend and any filter / compression options to apply to the column. These are advanced arguments which are type & value checked / enforced based on the definition of column values described above. I won't get into the full details here...
Making this all clear is a high priority for me, and i'm going to need input on the explanations we eventually get around to writing. Let me know if you have any questions!
8ba7619
to
ca3bc16
Compare
Motivation and Context
Why is this change required? What problem does it solve?:
This PR implements the new
columns
API, replacing the oldarraysets
andmetadata
terminology. The concept of a column is to act as a highly modular container for potentially arbitrary data types. As a simple proof of concept, astring
type column has been introduced - replacing the functionality ofmetadata
in a much more useful and expandable way.Description
Describe your changes in detail:
More details will be posted soon.
Types of changes
What types of changes does your code introduce? Put an
x
in all the boxes that apply:Is this PR ready for review, or a work in progress?
How Has This Been Tested?
Put an
x
in the boxes that apply:Checklist: