Keeping names in *cat #74

ghost · 2018-09-07T13:42:41Z

Currently *cat of NamedArrays drop the column/row names. It would be nice if the behavior was the same as R, where names are merged in the dimension perpendicular to the bound and are kept in the other iff they overlap, dropped otherwise.

This would raise the question of what doing if two combined NamedArrays contain the same name for a row/column: warning and dropping names or error? R ignores this problem and permit that named vectors / matrices have multiple rows/cols with the same name (when you select that name it will return only the first istance among the cases).

I have already implemented an alternative hcat function that would keep names, but it's important to define the behavior in caso of conflicting names.

The text was updated successfully, but these errors were encountered:

davidavdav · 2018-09-07T14:03:39Z

In NamedArrays the lookup name->index happens through a dictionary, so names must be unique.

Indeed, names are always reset in the concatenating direction. This would not always be necessary, indeed. Come to think of it, the resetting of names probably has an impact on type stability (would not know now whether that would be positive or negative).

Major hurdle in generating better names along the concatenating direction is coming up with a unique naming scheme, that would work for any key type. And then there is type stability---the resulting keytype should be computable by the compiler. Writing type-stable code is beyond my capabilities.

But perhaps, for keys of type String and Symbol, we can make specialized versions that do try to combine the keys in a sensible way.

ghost · 2018-09-07T14:13:45Z

I could try to make some test-implementation focusing on the type-stability problem, but my concern was mainly the warning/error question. Following the package "standard" until now, it seems to me that the most consistent solution would be to silently drop names.

nalimilan · 2018-09-07T15:08:11Z

DataFrames has a makeunique keyword argument which has to be set to true to automatically add a suffix to duplicate names. It would make sense to use the same approach here.

Regarding the type of the combined names, I'd just call promote_type, which is standard in Julia.

ghost · 2018-09-07T15:26:28Z

@nalimilan I was thinking more about typejoin, to cause less conversions.

nalimilan · 2018-09-07T15:34:57Z

promote_type is really the standard way of combining elements: that's what vcat and merge do (so you get this behavior for free if you call them). That will only make a difference for types which have a reasonable conversion path.

davidavdav · 2018-09-07T16:20:56Z

DataFrame keys are always Symbol, right? Then it is probably much easier than the general case.

I didn't know of merge(), this would be extremely useful indeed. Wonder if it works at the type stability level.

ghost · 2018-09-07T16:24:14Z

I think that the makeunique keyword would be a quite expensive operation for the general case, something like a error(/warn)ifdropnames would be much easier.

nalimilan · 2018-09-07T16:41:24Z

I think that the makeunique keyword would be a quite expensive operation for the general case, something like a error(/warn)ifdropnames would be much easier.

That wouldn't be applied by default, so it would have zero cost.

ghost · 2018-09-07T16:43:19Z

Yes, I meant "when invoked", and mainly in term of implementation time, but even if feasible

yurivish · 2019-01-04T23:49:49Z

I jus ran into what I think is this issue while constructing an aggregated table like this:

ft is a frequency table (named array) whose row labels are names, and column labels are breeds
top_names and other_names together span the set of row labels, and likewise for columns.

[ ft[top_names, top_breeds] sum(ft[top_names, other_breeds], dims=2)
  sum(ft[other_names, top_breeds], dims=1) sum(ft[other_names, other_breeds]) ]

This concatenation works beautifully but sadly the labels are lost. Now that I've typed this out I see that one problem here is that of making up a name for the final row and column, which got reduced out in all three cases. It would still be very nice if the known labels were preserved.

Thanks for a fantastic package at any rate (and FreqTables) — having names around has been an incredibly useful thing!

Edit: I figured out that I can do this:

NamedArray(
    [ ft[top_names, top_breeds] sum(ft[top_names, other_breeds], dims=2)
      sum(ft[other_names, top_breeds], dims=1) sum(ft[other_names, other_breeds]) ],
    [
        [top_names; "other"],
        [top_breeds; "other"]
    ]
)

davidavdav · 2019-01-05T10:46:58Z

Hello, it took me a while before I realized you were using the space/newline array constructor operator, which I always find difficult to parse and matlabish somehow...

Is it like you are trying to make ft[top_names, top_breeds] a row and a column bigger with marginal stats? By heart, the sums should have something like "sum(...)" in their name. Would it not be better then to use for the final element

sum(ft[other_names, other_breeds], dims=(1,2))

which keeps the dimensions and the names, that match the other marginals. The space/newline concatenation might work automatically in that case.

Well, I just see that the hcat/vcat rewrite the concatenated dimension names, and this is exactly what this Issue is about. I would like to be able to fix this in a sensible way.

arnaudmgh · 2019-03-22T18:46:41Z

Hello, I did run into a similar problem when writing the result of freqtable to a CSV file: I converted to DataFrame and lost all the names. I think it will be quite a common thing that users would like to share tables as CSV with collaborators, and a solution to that would be quite useful to many I believe.

The solution I came up with was to overwrite CSV.write follows:

function CSV.write(file::Union{String, IO}, named::NamedArray)
  nsc = names(named)
  named2 = hcat(nsc[1], named)
  named2 = DataFrame(named2)
  names!(named2, Symbol.(vcat("row_names", string.(nsc[2]))))
  CSV.write(file, named2)
end

I'd be willing to help, submit a PR or else, depending on what you would suggest, @davidavdav:

this solution adds a dependence on DataFrames.jl, and using Tables.jl may-be more appropriate.
freqtables output String or Union{String, Missing} as names; and also, names are unique by definition. Therefore I did not need to worry about complex types for names, or non unique names (the more general cases you mentioned above)
That may be a special solution for FreqTables.jl, and therefore I should suggest this solution in a FreqTables issue instead?

Let me know what you think...

nalimilan · 2019-03-24T13:49:36Z

@arnaudmgh This sounds like a completely different problem, please file a separate issue.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Keeping names in *cat #74

Keeping names in *cat #74

ghost commented Sep 7, 2018

davidavdav commented Sep 7, 2018 •

edited

Loading

ghost commented Sep 7, 2018

nalimilan commented Sep 7, 2018

ghost commented Sep 7, 2018

nalimilan commented Sep 7, 2018

davidavdav commented Sep 7, 2018

ghost commented Sep 7, 2018

nalimilan commented Sep 7, 2018

ghost commented Sep 7, 2018

yurivish commented Jan 4, 2019 •

edited

Loading

davidavdav commented Jan 5, 2019

arnaudmgh commented Mar 22, 2019

nalimilan commented Mar 24, 2019

Keeping names in *cat #74

Keeping names in *cat #74

Comments

ghost commented Sep 7, 2018

davidavdav commented Sep 7, 2018 • edited Loading

ghost commented Sep 7, 2018

nalimilan commented Sep 7, 2018

ghost commented Sep 7, 2018

nalimilan commented Sep 7, 2018

davidavdav commented Sep 7, 2018

ghost commented Sep 7, 2018

nalimilan commented Sep 7, 2018

ghost commented Sep 7, 2018

yurivish commented Jan 4, 2019 • edited Loading

davidavdav commented Jan 5, 2019

arnaudmgh commented Mar 22, 2019

nalimilan commented Mar 24, 2019

davidavdav commented Sep 7, 2018 •

edited

Loading

yurivish commented Jan 4, 2019 •

edited

Loading