Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keeping names in *cat #74

Open
ghost opened this issue Sep 7, 2018 · 13 comments
Open

Keeping names in *cat #74

ghost opened this issue Sep 7, 2018 · 13 comments

Comments

@ghost
Copy link

ghost commented Sep 7, 2018

Currently *cat of NamedArrays drop the column/row names. It would be nice if the behavior was the same as R, where names are merged in the dimension perpendicular to the bound and are kept in the other iff they overlap, dropped otherwise.

This would raise the question of what doing if two combined NamedArrays contain the same name for a row/column: warning and dropping names or error? R ignores this problem and permit that named vectors / matrices have multiple rows/cols with the same name (when you select that name it will return only the first istance among the cases).

I have already implemented an alternative hcat function that would keep names, but it's important to define the behavior in caso of conflicting names.

@davidavdav
Copy link
Owner

davidavdav commented Sep 7, 2018

In NamedArrays the lookup name->index happens through a dictionary, so names must be unique.

Indeed, names are always reset in the concatenating direction. This would not always be necessary, indeed. Come to think of it, the resetting of names probably has an impact on type stability (would not know now whether that would be positive or negative).

Major hurdle in generating better names along the concatenating direction is coming up with a unique naming scheme, that would work for any key type. And then there is type stability---the resulting keytype should be computable by the compiler. Writing type-stable code is beyond my capabilities.

But perhaps, for keys of type String and Symbol, we can make specialized versions that do try to combine the keys in a sensible way.

@ghost
Copy link
Author

ghost commented Sep 7, 2018

I could try to make some test-implementation focusing on the type-stability problem, but my concern was mainly the warning/error question. Following the package "standard" until now, it seems to me that the most consistent solution would be to silently drop names.

@nalimilan
Copy link
Contributor

DataFrames has a makeunique keyword argument which has to be set to true to automatically add a suffix to duplicate names. It would make sense to use the same approach here.

Regarding the type of the combined names, I'd just call promote_type, which is standard in Julia.

@ghost
Copy link
Author

ghost commented Sep 7, 2018

@nalimilan I was thinking more about typejoin, to cause less conversions.

@nalimilan
Copy link
Contributor

promote_type is really the standard way of combining elements: that's what vcat and merge do (so you get this behavior for free if you call them). That will only make a difference for types which have a reasonable conversion path.

@davidavdav
Copy link
Owner

DataFrame keys are always Symbol, right? Then it is probably much easier than the general case.

I didn't know of merge(), this would be extremely useful indeed. Wonder if it works at the type stability level.

@ghost
Copy link
Author

ghost commented Sep 7, 2018

I think that the makeunique keyword would be a quite expensive operation for the general case, something like a error(/warn)ifdropnames would be much easier.

@nalimilan
Copy link
Contributor

I think that the makeunique keyword would be a quite expensive operation for the general case, something like a error(/warn)ifdropnames would be much easier.

That wouldn't be applied by default, so it would have zero cost.

@ghost
Copy link
Author

ghost commented Sep 7, 2018

Yes, I meant "when invoked", and mainly in term of implementation time, but even if feasible

@yurivish
Copy link

yurivish commented Jan 4, 2019

I jus ran into what I think is this issue while constructing an aggregated table like this:

  • ft is a frequency table (named array) whose row labels are names, and column labels are breeds
  • top_names and other_names together span the set of row labels, and likewise for columns.
[ ft[top_names, top_breeds] sum(ft[top_names, other_breeds], dims=2)
  sum(ft[other_names, top_breeds], dims=1) sum(ft[other_names, other_breeds]) ]

This concatenation works beautifully but sadly the labels are lost. Now that I've typed this out I see that one problem here is that of making up a name for the final row and column, which got reduced out in all three cases. It would still be very nice if the known labels were preserved.

Thanks for a fantastic package at any rate (and FreqTables) — having names around has been an incredibly useful thing!

Edit: I figured out that I can do this:

NamedArray(
    [ ft[top_names, top_breeds] sum(ft[top_names, other_breeds], dims=2)
      sum(ft[other_names, top_breeds], dims=1) sum(ft[other_names, other_breeds]) ],
    [
        [top_names; "other"],
        [top_breeds; "other"]
    ]
)

@davidavdav
Copy link
Owner

Hello, it took me a while before I realized you were using the space/newline array constructor operator, which I always find difficult to parse and matlabish somehow...

Is it like you are trying to make ft[top_names, top_breeds] a row and a column bigger with marginal stats? By heart, the sums should have something like "sum(...)" in their name. Would it not be better then to use for the final element

sum(ft[other_names, other_breeds], dims=(1,2))

which keeps the dimensions and the names, that match the other marginals. The space/newline concatenation might work automatically in that case.

Well, I just see that the hcat/vcat rewrite the concatenated dimension names, and this is exactly what this Issue is about. I would like to be able to fix this in a sensible way.

@arnaudmgh
Copy link

Hello, I did run into a similar problem when writing the result of freqtable to a CSV file: I converted to DataFrame and lost all the names. I think it will be quite a common thing that users would like to share tables as CSV with collaborators, and a solution to that would be quite useful to many I believe.

The solution I came up with was to overwrite CSV.write follows:

function CSV.write(file::Union{String, IO}, named::NamedArray)
  nsc = names(named)
  named2 = hcat(nsc[1], named)
  named2 = DataFrame(named2)
  names!(named2, Symbol.(vcat("row_names", string.(nsc[2]))))
  CSV.write(file, named2)
end

I'd be willing to help, submit a PR or else, depending on what you would suggest, @davidavdav:

  • this solution adds a dependence on DataFrames.jl, and using Tables.jl may-be more appropriate.
  • freqtables output String or Union{String, Missing} as names; and also, names are unique by definition. Therefore I did not need to worry about complex types for names, or non unique names (the more general cases you mentioned above)
  • That may be a special solution for FreqTables.jl, and therefore I should suggest this solution in a FreqTables issue instead?

Let me know what you think...

@nalimilan
Copy link
Contributor

@arnaudmgh This sounds like a completely different problem, please file a separate issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants