Do not store names in two different objects #2

nalimilan · 2013-11-18T12:47:43Z

Currently names are duplicated in the names and dicts members. This is suboptimal in the long-term. It is probably possible to only use the dictionaries and to rebuild the ordered vector of names from it when needed. Since that vector is not typically needed in performance-critical operations, that's not a problem (indexing is fast thanks to dictionaries, and that's what matters). What do you think?

The text was updated successfully, but these errors were encountered:

davidavdav · 2013-11-18T13:30:34Z

Hi,

On Mon, Nov 18, 2013 at 1:47 PM, Milan Bouchet-Valat <
notifications@github.com> wrote:

Currently names are duplicated in the names and dicts members. This is
suboptimal in the long-term. It is probably possible to only use the
dictionaries and to rebuild the ordered vector of names from it when
needed. Since that vector is not typically needed in performance-critical
operations, that's not a problem (indexing is fast thanks to dictionaries,
and that's what matters). What do you think?

The names in the dictionary are unordered, but you can find them back by
sorting by value, like you say.

We could have a function

names(a::NamedArray, d::Int) =
collect(keys(n.dicts[d]))[sortperm(collect(values(n.dicts[d])))]

that always resolves the current dict.

In slices etc. I now first recompute the names, and then redefine the dict.
This would then be:

first recompute names form dict
slice the names
recompute dict from slice

a bit more overhead. Perhaps it can be done more cleverly.

---david

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/2
.

David van Leeuwen

nalimilan · 2013-11-18T20:48:25Z

In theory when extracting a slice you only need to create a subset of the dictionary, which could be even more efficient than creating it from a vector (since you do not need to compute hashes again). So what we need is a function to extract a subset of a dict. Probably worth asking the main devs.

There's also OrderedDict (see JuliaLang/julia#2548), which could be more efficient to select some elements by index (but slower when inserting).

davidavdav · 2013-11-18T23:46:53Z

Hello,

On Mon, Nov 18, 2013 at 9:48 PM, Milan Bouchet-Valat <
notifications@github.com> wrote:

In theory when extracting a slice you only need to create a subset of the
dictionary, which could be even more efficient than creating it from a
vector (since you do not need to compute hashes again). So what we need is
a function to extract a subset of a dict. Probably worth asking the main
devs.

This is an idea indeed. I don't know how costly it is to create a new
dict. There is del(dict,key), but I don't know if we want to iterate over
all keys and delete them one by one. Could be done I suppose.

There's also OrderedDict (see JuliaLang/julia#2548 JuliaLang/julia#2548),
which could be more efficient to select some elements by index (but slower
when inserting).

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/2#issuecomment-28736410
.

I worked on more efficient getindex(), which now is pretty fast for Int /
String indices up to 3. I tried to use the type-specific method
dispatching to speed things up. The real killer seems to be a Union(Int,
String).

I also have a small test.jl now. Needs to be expanded.

---david

David van Leeuwen

nalimilan · 2013-11-19T22:44:10Z

I've just found that filter works fine on Dicts. filter((k, i) -> i in new_indexes, dict) should do what we need. Currently, it does exactly what you said: it deletes items that don't match, after copying the whole dict. But it could probably be improved in the future to only copy the items that match. Not sure that will be efficient enough so that we do not need two copies of the names...

Yet another solution would be to avoid storing copies of the strings, and only store in the dict references to the strings in the vector - that would save some space when names are long.

I'll find the time to give a try at your performance improvements.

davidavdav · 2013-11-21T14:08:55Z

Hi Milan,

This morning I removed names::Vector from the type definition and went
through the code to basically remove a.names[i] by names(a,i). Most of
the stuff I checked works. I pushed it to Github.

I didn't do the names slicing cleverly, yet. I just re-define a new dict.

Cheers,

---david

On Tue, Nov 19, 2013 at 11:44 PM, Milan Bouchet-Valat <
notifications@github.com> wrote:

I've just found that filter works fine on Dicts. filter((k, i) -> i in
new_indexes, dict) should do what we need. Currently, it does exactly
what you said: it deletes items that don't match, after copying the whole
dict. But it could probably be improved in the future to only copy the
items that match. Not sure that will be efficient enough so that we do not
need two copies of the names...

Yet another solution would be to avoid storing copies of the strings, and
only store in the dict references to the strings in the vector - that would
save some space when names are long.

I'll find the time to give a try at your performance improvements.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/2#issuecomment-28844581
.

David van Leeuwen

nalimilan · 2013-11-28T09:27:28Z

This should probably be revisited when the new version of OrderedDict (JuliaLang/julia#2548) is ready. What we need is really an object allowing both fast access by name and fast extraction of a subset of the names (for cases where a new NamedArray has to be built). OrderedDict might be a good compromise, improving on Dicts with regard to the latter goal since you do not need to sort names every time.

davidavdav closed this as completed Nov 28, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do not store names in two different objects #2

Do not store names in two different objects #2

nalimilan commented Nov 18, 2013

davidavdav commented Nov 18, 2013

nalimilan commented Nov 18, 2013

davidavdav commented Nov 18, 2013

nalimilan commented Nov 19, 2013

davidavdav commented Nov 21, 2013

nalimilan commented Nov 28, 2013

Do not store names in two different objects #2

Do not store names in two different objects #2

Comments

nalimilan commented Nov 18, 2013

davidavdav commented Nov 18, 2013

nalimilan commented Nov 18, 2013

davidavdav commented Nov 18, 2013

nalimilan commented Nov 19, 2013

davidavdav commented Nov 21, 2013

nalimilan commented Nov 28, 2013