-
-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
==
incorrect for tables with repeated keys
#13499
Comments
I'm still of opinion that So in my book, this should be a low priority, and if fixing this causes any slowdowns - we shouldn't do it. |
I realized just now that tables in Nim support duplicate keys. :-/ Now I wonder why in the only use case I can think of (HTTP headers) we take the trouble of defining |
It's not that bad, the only use case for table. |
First, I think a better exhibition of this defect is simply: import tables # watch me fail!
assert {1:2, 1:2, 3:4}.toTable != {1:2, 3:4, 3:4}.toTable I would also qualify (different from the issue title) that (key,value) pairs need to be repeated, not only keys. I.e., Second, another way to go here that does not penalize any other operations is, during/after the This does A) use a bit more than twice the memory and B) require a The compiler will inform people when they need to define I would think if C) is acceptable then this counting approach is the easy fix here. Though I doubt many know about dup key (and dup key,val!) powers of Table, it is also probably very rare to compare two tables at all and more rare still to expect them to be equal except as @Araq mentioned in tests. Getting something like this right in tests is good "project optics" for a language advertising "safety". If, OTOH, C) is unacceptable then I think you should close this issue as "won't fix". For completeness, I should also mention that we could also prefix the final count compare with a commutative hash like { Indeed, if we were willing to make every edit a bit slower, Robin Hood re-org can actually go further and break key ties with |
@c-blake your test seems like an unrelated bug: .toTable() always creates tables w/o duplicates. Shown with:
.toTable currently does Making that change, it fails because of the original problem:
Seems like it would be a good idea to have a setting like t.dups = true/false with the default being false as most people (my opinion) expect for hash tables. Or maybe the first time someone calls add, it sets the flag to true, for compatibility. |
@hashbackup - Oops! Good, catch. Thanks! I think if we continue to support dup keys then we should change If we wanted any kind of table constructor flags, then we should also be able to pass them to Another inconsistency is that while |
One common use of hash tables / sets is to eliminate duplicates from a list. Making the default no dups, like most other languages / hash tables, makes more sense to me. |
@c-blake Can this be done with one count table each? It seems like there are two different cases:
It seems like you'd need one count table for keys and another for values, for both tables, so 4. |
I may not have described it clearly but I was imagining two (one for each table being compared) |
Ah, ok. My way is dumb and won't work. :-) For regular Tables (not OrderedTables) the hash tables are already a seq[tuple[key:K, val: V]]. We could make a copy of each seq, sort it, compare it. OrderedTables could be copied, ot.sort() both, compare the two seqs. Or, build two new seqs that don't have the hcode/linked list junk, sort and compare them. Not sure this would be better than your way, but it's another option. Edit: they'd have to be rebuilt to remove the hash code (and links). If you know there are duplicates, the initial membership test can be skipped with either method (I think). |
Sorted seqs are another option. My approach uses less memory (e.g. avg num dups = 3, uses 3x less memory) and less "operation" time at scale (linear), but the stdlib merge sort might be sufficiently more cache friendly to use less "wall" time in spite of being O(n*lg n). So, as usual, which is better/faster "just depends". ;-) As mentioned, the initial membership test makes A whole other possibility with zero impact on current code and also no need of an instance-level flag is to provide We could also have |
Good points |
The better algorithm here, which I should have thought of immediately, would be to work like the for key, val in s:
for (sv, tv, diffLens) in zip(s.allValues(key), t.allValues(key)):
if diffLens or sv != tv: return false but with a When no duplicate pairs are involved this approach would require no real memory (an empty Now, It's a bit of loopy logic to work out the details, but I think a fix along these lines is the way to resolve this issue. |
There is consensus on phasing out the "multi" aspect of Tables aka deprecating |
When I was first looking for the equivalent of a Python dict, Nim's syntax was different enough that I had to read the Tables docs a couple of times to understand how they worked vs Python's dict. For example First thing that threw me off was that the table page starts with a lot of examples how to create populated tables, but no example of how to create an empty table. Then there was some confusion about whether I had to call initTable. I wanted to set the initial table size (since I read about it), and So I stumbled upon add from reading the tables doc page a few times. If add wasn't there, I would have never known or expected that duplicates could be added to a table. It's hard to realize how strange new things can seem to newcomers when they are like 2nd nature to others who have been using Nim for years. It only takes a few days for them not to be strange of course, but the initial encounter is usually a little jarring because:
|
There is? Last I saw there was a TC RFC that literally no one even engaged with (no emojis even!) besides me and @hlaaftana's name suggestion where TC was at least partly misarguing performance implications. Looking into other channels, as recently as 2019/02/10:12:55:09 @narimiran on IRC said on the topic: "it doesn't matter what somebody thinks, jesus christ. it is part of nim v1.0." Then he reversed himself a year later on 2020/03/09 in IRC. I guess @Araq made this pronouncement on IRC 2020-03-09-"14:47:56 I consider this solved, we will deprecate the 'add' proc" you got immediate push back from @disruptek that seemed to simply fade. This consensus was news to me. Decision/discussion should really be reflected on TC's RFC. I get that In any event, if the plan is now to instead introduce a This doubling of all related types has not seemed to arise much in prior discussions. That also represents cognitive load on both maintainers and users -- not dissimilar to knowing WTF A (viable?) alternative to "deprecation & type duplication" might be keeping the superset-capability types with an |
Regarding the confusing |
Anyway, I'm happy to do some PR to fix |
==
incorrect for tables with repeated keysExample
Current Output
Expected Output
Possible Solution
fix this to handle duplicate keys:
Additional Information
proposal
Handling correctly duplicated keys may make certain operations slower for the common case (no repeated keys).
If fixing this issue makes
==
slower (ditto with other operations?), maybe we should consider adding a runtime (not compile time, that adds more complexity) flagenableDups = false
, and for all operations where duplicate keys could be handled, check this flag and throw if needed for some operations (eg:add
).This should not result in slowdown as it's a single bool flag that should be easy for branch predictors
The text was updated successfully, but these errors were encountered: