Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: OrderedDict, v3, based on AbstractHashDict #2548

Closed
wants to merge 5 commits into from

Conversation

kmsquire
Copy link
Member

EDIT: updated title, description

An alterate OrderedDict. This is essentially the same version of OrderedDict as #2558, but without all of the copied code. As with that version, the delete performance here is better than in #2556 (~10% overhead, vs ~50% overhead in #2556), but the code in #2556 is a bit cleaner.

(Copied from #2558)
If either of these move forward, I would love some specific feedback on whether to allow integer indexing for OrderedDicts with nonnumerical keys (e.g., od["key"] looks up by hashing "key", and od[1] returns the first (key,value) pair, but only for nonnumerical keys).

There is also getitem(od, 1) for doing numerical-index based lookup, and indexof(key) to get the numerical index of a particular key. Suggestions for better names for these are welcome.

Original AbstractHashDict description

This generalizes most Dict functions to work on an abstract hash type.

Rationale: I needed an OrderedDict class for a project, and found no way to create one by wrapping Dict that was in any way efficient. The only implementation that did not have severe performance issues was to copy almost all of Dict and modify it directly to keep track of the order of the keys. The result was the near duplication of most functions. See https://github.com/kmsquire/OrderedCollections.jl.

I'm not sure if any other Dict subclass will ever need this, and I don't know if an OrderedDict needs to be in base (it probably belongs in a Collections or OrderedCollections package). So I'm just putting this out there to get some feedback.

If it goes in, I have a version of OrderedDict on a branch which will use it (and reduce in size greatly). If not, I'll publish the version which copies most of the Dict functions.

If nothing else, this patch does show the potential utility of abstract classes with fields, as most of the functions for the abstract class depend on the existence of most of the fields of Dict (and which were copied in OrderedDict).

@StefanKarpinski
Copy link
Sponsor Member

This is definitely much needed. I would be in favor of having all the basic collection types in Base. We've already got PriorityQueue in there at this point and OrderedDict seems like a good standard addition as well. If nothing else, having these things in Base will force us to get the type hierarchy right.

Once question: why AbstractHashDict rather then just AbstractDict? Is this abstraction specific to using a hash table?

@ViralBShah
Copy link
Member

Yup - it is great to see Base slowly becoming richer with more collection types.

@kmsquire
Copy link
Member Author

So, regarding AbstractDict vs AbstractHashDict: the OrderedDict I created is a hash table backed by an array which maintains an order on the elements. By default, this is the insertion order, but the order can be manipulated (via most dequeue operations and sorting, currently). I really wanted these things to be separate and simply wrap a Dict, but couldn't find a way to do it without incurring a huge performance penalty. The version I came up with is basically a Dict with 3 additional fields and minor modifications to a few of the Dict functions, so it makes sense to reuse most of the Dict functions, and (EDIT:) have an AbstractHashDict.

OrderedDict is a little presumptuous (as is Dict), as technically tree-based dictionaries (as were discussed a while back) are ordered dictionaries. But I couldn't come up with a better name without getting a little unwieldy (OrderedHashDict maybe?).

For comparison:

  • Java has a LinkedHashMap (but the OrderedDict here is backed by an array, not a linked list)
  • Python has OrderedDict (also backed by a doubly linked list)
  • Ruby's default OrderedHash implementation seems to use a doubly linked list (no special name)

The benefit of using an array as backing, rather than a linked list, are that things like sort!, push!, pop!, and access by numerical index become possible. The disadvantage are that some operations become rather expensive as the dictionary grows (but, e.g., delete!() is amortized).

Anyway, as long as this sounds okay, I'll add the OrderedDict to this commit.

@nolta
Copy link
Member

nolta commented Mar 12, 2013

Slightly off-topic, but why are we planning on putting PriorityQueue and OrderedDict in base and not extras? I thought the goal was to make Base as small as possible.

@StefanKarpinski
Copy link
Sponsor Member

Well, there are conflicting goals: keep it small but also provide lots of useful functionality.

@StefanKarpinski
Copy link
Sponsor Member

One reason to have at least some of these collection types in base is to ensure that we get the hierarchy right. That's hard to do without having actual concrete implementations of different abstractions.

@kmsquire kmsquire closed this Mar 12, 2013
@kmsquire kmsquire reopened this Mar 12, 2013
@nolta
Copy link
Member

nolta commented Mar 12, 2013

Sure, but we can have both: small Base and lots of functionality in the standard library.

@kmsquire
Copy link
Member Author

My impression was that Base and the standard library were the same thing at this point...

@StefanKarpinski
Copy link
Sponsor Member

They are the same.

@JeffBezanson
Copy link
Sponsor Member

I find this AbstractHashDict type very strange: it assumes a very specific representation of a dictionary, which we have in fact changed several times. Seems like a hack. I'm more interested in the "severe performance issues". There obviously needs to be a way to do stuff without severe performance issues. I'd like to look at that code.

@kmsquire
Copy link
Member Author

@JeffBezanson, well, it is a hack, if by hack you mean "attempting to subclass a concrete class in julia". ;-)

I've put together a gist with 3 different versions of OrderedDict, including my final version, benchmark code and benchmarks, here:

https://gist.github.com/kmsquire/5147894

The gist gives a brief description of the versions. You should be able to require, use, and modify each directly.

I did profile the first two versions, and didn't find much that I could do to improve performance, but maybe there's something I missed. I think there's more of a chance of an improvement by changing the representation than by optimizing the implementation or julia--the difference in delete!() between my second best version and the version where I copied Dict is 20x. But prove me wrong (or right)--good luck!

@JeffBezanson
Copy link
Sponsor Member

Great, thank you!

@JeffBezanson
Copy link
Sponsor Member

Isn't the entire problem here that the slow versions do some O(n) operation on an array for each deletion?

@kmsquire
Copy link
Member Author

That shouldn't be the case. Delete only marks array positions, and they're cleaned up only upon a call to _compact().

@JeffBezanson
Copy link
Sponsor Member

It calls findfirst, which is O(n).

@kmsquire
Copy link
Member Author

Yes, but it's calling findfirst on a bit array which is at least 1/4 full, so the expected number of steps is 3.

(Unless it's not actually being compacted...)

@kmsquire
Copy link
Member Author

Okay, that is a problem--thanks! I only am compacting on assignment. I thought that was a good thing, but maybe things will improve if I compact sooner.

@JeffBezanson
Copy link
Sponsor Member

Another couple things, not sure how important they are:

  • DictItem inside OrderedDict should be DictItem{K,V}
  • There is a try/catch happening on each delete!

@kmsquire
Copy link
Member Author

The findfirst issue was only with the second version. I'll make that chance and fix the others you just suggested and rerun the timings.

@kmsquire
Copy link
Member Author

I was wrong about the findfirst. For the second version, delete is bad for just that reason, and I shouldn't even have put it up. For the first version, that's not a problem. But I will address the other comments you just made.

@kmsquire
Copy link
Member Author

Okay, the try/catch was the big problem--changing that made up most of the performance gap. Thanks!

test_del
========
Dict{K,V}: [0.584101, 0.565818, 0.565554, 0.564916, 0.565354], median=0.5655537869999999
OrderedDict{K,V}: [0.797462, 0.785816, 0.781939, 0.783949, 0.784663], median=0.784663365

(Previously, the median for OrderedDictv1 was 11.775718537000001.) For v3, the timing is still better (median=0.609851745), but can I read your previous comments as preferring the (slightly) slower, less invasive version?

@kmsquire
Copy link
Member Author

I updated the gist with the new code and bindings timings.

@kmsquire kmsquire mentioned this pull request Mar 13, 2013
@kmsquire
Copy link
Member Author

Closing in lieu of #2556.

@kmsquire kmsquire closed this Mar 13, 2013
@kmsquire
Copy link
Member Author

Or rather, that one is in lieu of this one...

@kmsquire kmsquire reopened this Mar 14, 2013
Minor updates to OrderedDict to fix tests.
@kmsquire
Copy link
Member Author

Added a version of OrderedDict here, and retitled.

One additional issue with this version, beyond anything already mentioned, is that serialize/deserialize don't work.

@kmsquire
Copy link
Member Author

While they work, both this and the other OrderedDict version in #2556 are a little too complex. I'm working on a simpler version, so I'm going to withdraw them both for now, and submit the simpler version soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants