Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not store names in two different objects #2

Closed
nalimilan opened this issue Nov 18, 2013 · 6 comments
Closed

Do not store names in two different objects #2

nalimilan opened this issue Nov 18, 2013 · 6 comments

Comments

@nalimilan
Copy link
Contributor

Currently names are duplicated in the names and dicts members. This is suboptimal in the long-term. It is probably possible to only use the dictionaries and to rebuild the ordered vector of names from it when needed. Since that vector is not typically needed in performance-critical operations, that's not a problem (indexing is fast thanks to dictionaries, and that's what matters). What do you think?

@davidavdav
Copy link
Owner

Hi,

On Mon, Nov 18, 2013 at 1:47 PM, Milan Bouchet-Valat <
notifications@github.com> wrote:

Currently names are duplicated in the names and dicts members. This is
suboptimal in the long-term. It is probably possible to only use the
dictionaries and to rebuild the ordered vector of names from it when
needed. Since that vector is not typically needed in performance-critical
operations, that's not a problem (indexing is fast thanks to dictionaries,
and that's what matters). What do you think?

The names in the dictionary are unordered, but you can find them back by
sorting by value, like you say.

We could have a function

names(a::NamedArray, d::Int) =
collect(keys(n.dicts[d]))[sortperm(collect(values(n.dicts[d])))]

that always resolves the current dict.

In slices etc. I now first recompute the names, and then redefine the dict.
This would then be:

  • first recompute names form dict
  • slice the names
  • recompute dict from slice

a bit more overhead. Perhaps it can be done more cleverly.

---david


Reply to this email directly or view it on GitHubhttps://github.com//issues/2
.

David van Leeuwen

@nalimilan
Copy link
Contributor Author

In theory when extracting a slice you only need to create a subset of the dictionary, which could be even more efficient than creating it from a vector (since you do not need to compute hashes again). So what we need is a function to extract a subset of a dict. Probably worth asking the main devs.

There's also OrderedDict (see JuliaLang/julia#2548), which could be more efficient to select some elements by index (but slower when inserting).

@davidavdav
Copy link
Owner

Hello,

On Mon, Nov 18, 2013 at 9:48 PM, Milan Bouchet-Valat <
notifications@github.com> wrote:

In theory when extracting a slice you only need to create a subset of the
dictionary, which could be even more efficient than creating it from a
vector (since you do not need to compute hashes again). So what we need is
a function to extract a subset of a dict. Probably worth asking the main
devs.

This is an idea indeed. I don't know how costly it is to create a new
dict. There is del(dict,key), but I don't know if we want to iterate over
all keys and delete them one by one. Could be done I suppose.

There's also OrderedDict (see JuliaLang/julia#2548JuliaLang/julia#2548),
which could be more efficient to select some elements by index (but slower
when inserting).


Reply to this email directly or view it on GitHubhttps://github.com//issues/2#issuecomment-28736410
.

I worked on more efficient getindex(), which now is pretty fast for Int /
String indices up to 3. I tried to use the type-specific method
dispatching to speed things up. The real killer seems to be a Union(Int,
String).

I also have a small test.jl now. Needs to be expanded.

---david

David van Leeuwen

@nalimilan
Copy link
Contributor Author

I've just found that filter works fine on Dicts. filter((k, i) -> i in new_indexes, dict) should do what we need. Currently, it does exactly what you said: it deletes items that don't match, after copying the whole dict. But it could probably be improved in the future to only copy the items that match. Not sure that will be efficient enough so that we do not need two copies of the names...

Yet another solution would be to avoid storing copies of the strings, and only store in the dict references to the strings in the vector - that would save some space when names are long.

I'll find the time to give a try at your performance improvements.

@davidavdav
Copy link
Owner

Hi Milan,

This morning I removed names::Vector from the type definition and went
through the code to basically remove a.names[i] by names(a,i). Most of
the stuff I checked works. I pushed it to Github.

I didn't do the names slicing cleverly, yet. I just re-define a new dict.

Cheers,

---david

On Tue, Nov 19, 2013 at 11:44 PM, Milan Bouchet-Valat <
notifications@github.com> wrote:

I've just found that filter works fine on Dicts. filter((k, i) -> i in
new_indexes, dict) should do what we need. Currently, it does exactly
what you said: it deletes items that don't match, after copying the whole
dict. But it could probably be improved in the future to only copy the
items that match. Not sure that will be efficient enough so that we do not
need two copies of the names...

Yet another solution would be to avoid storing copies of the strings, and
only store in the dict references to the strings in the vector - that would
save some space when names are long.

I'll find the time to give a try at your performance improvements.


Reply to this email directly or view it on GitHubhttps://github.com//issues/2#issuecomment-28844581
.

David van Leeuwen

@nalimilan
Copy link
Contributor Author

This should probably be revisited when the new version of OrderedDict (JuliaLang/julia#2548) is ready. What we need is really an object allowing both fast access by name and fast extraction of a subset of the names (for cases where a new NamedArray has to be built). OrderedDict might be a good compromise, improving on Dicts with regard to the latter goal since you do not need to sort names every time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants