Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Variable hashes have too many collisions #1594

Closed
arkocal opened this issue Oct 4, 2024 · 15 comments · Fixed by #1602
Closed

Variable hashes have too many collisions #1594

arkocal opened this issue Oct 4, 2024 · 15 comments · Fixed by #1602
Labels
bug performance Analysis time, memory usage
Milestone

Comments

@arkocal
Copy link

arkocal commented Oct 4, 2024

Summary: With about 60k variables encountered, we have about 1.04 entries per hash. With an ideal hash function the chances for even a single collision would have been very low.

The experiment can be repeated using the topdown solver on arkocal/hash_collision_demo branch, which includes a minimal module for counting how often each hash value appears.

In particular, variables that differ by a single string are mapped to the same hash.

The following changes already significantly reduce collisions (1.0008 keys per hash, no obvious similarities between colliding keys):
In cdomains/stringDomain.ml (https://github.com/goblint/analyzer/blob/master/src/cdomain/value/cdomains/stringDomain.ml) :

let hash x =
  if get_string_domain () = Disjoint then
    hash x
  else
    (* 13859 <- old value*)
    hash x (* new value *)

In cdomains/intDomain.ml, add to line 3680, after let to_yojson ...

let hash = Hashtbl.hash
@michael-schwarz michael-schwarz added performance Analysis time, memory usage bug labels Oct 4, 2024
@michael-schwarz
Copy link
Member

michael-schwarz commented Oct 4, 2024

Nice catch!

@sim642: I think for stringDomain.ml this is simply an oversight from when we only had Disjoint and Unit? Or was there some reasoning behind it?

For the hashing in cdomains/intDomain.ml, I guess this comes from https://github.com/sim642/ppx_deriving_hash? Maybe something can also be tweaked there? Iirc the example that @arkocal had, the hash for T and Excluded{0} were the same, which is maybe not ideal as both are quite common.

@sim642
Copy link
Member

sim642 commented Oct 7, 2024

I think for stringDomain.ml this is simply an oversight from when we only had Disjoint and Unit? Or was there some reasoning behind it?

That seems possible. All uses of get_string_domain () there are in ifs, not matches, so it's error-prone indeed when something is added.

In cdomains/intDomain.ml, add to line 3680, after let to_yojson ...

let hash = Hashtbl.hash

This is wrong because some of the int domains (e.g. def_exc, enums) contain OCaml Sets where structural equality and hashing lead to problems because sets don't have a canonical representation (equivalent sets may have different Hashtbl.hash values).
So it may be fast, but wrong things can be arbitrarily fast.

Iirc the example that @arkocal had, the hash for T and Excluded{0} were the same, which is maybe not ideal as both are quite common.

I just tried (e.g. in base's init)

    let i1 = ID.top_of IInt in
    let i2 = ID.of_excl_list IInt [Z.zero] in
    Logs.error "%d %d" (ID.hash i1) (ID.hash i2)

and this prints out different hash values [Error] -75963065849016984 39336082039394930 at 0deedb9.

@michael-schwarz
Copy link
Member

Maybe I also misremembered some of the details. @arkocal do you still have the example at hand?

@arkocal
Copy link
Author

arkocal commented Oct 7, 2024

I am not sure if the suggested hashes above are correct or performant. They are not meant to be solution suggestions, just illustrative examples for the source of collisions, sorry for the confusion.

Testing with dd_comb.c from coreutils, without overwriting the hash for intDomain, I get the following:

%%% coll: Colliding variables:
 L:node 34387 "prev = (uintmax_t )prev_nread;" on dd.c:918:11-918:40
and
 L:node 34387 "prev = (uintmax_t )prev_nread;" on dd.c:918:11-918:40
%%% coll: Hash of a: 4529258937874703681,
 Hash of b: 4529258937874703681
%%% coll: Values are equal: false

Coming from:

      let interesting = 4529258937874703681 in
      let colliding = HM.keys rho |> List.of_enum |> List.filter (fun k -> S.Var.hash k = interesting) in
      match colliding with
      | a::b::[] -> 
          if tracing then trace "coll" "Colliding variables:\n %a \nand\n %a" S.Var.pretty_trace a S.Var.pretty_trace b;
          if tracing then trace "coll" "Hash of a: %d,\n Hash of b: %d" (S.Var.hash a) (S.Var.hash b);
          if tracing then trace "coll" "Values are equal: %b" (S.Var.equal a b);
      | _ -> ();
      ;

Since the string representations are the same, this might also be an issue with S.Var.equal.

@sim642
Copy link
Member

sim642 commented Oct 7, 2024

Enabling dbg.trace.context would print out the entire unknown with the context. The difference hopefully is visible there.

@arkocal
Copy link
Author

arkocal commented Oct 7, 2024

This is very strange, trying:

    let i1 = ID.top_of IInt in
    let i2 = ID.of_excl_list IInt [Z.zero] in
    Logs.error "%d %d" (ID.hash i1) (ID.hash i2)

Outputs

[Error] -75963065849016984 -75963065849016984

on my machine on 0deedb9

@arkocal
Copy link
Author

arkocal commented Oct 7, 2024

With dbg.trace.context the only diffence is in base.size with T vs. (Not {0}((0,64)).

For completeness:

 L:(node 34387 "prev = (uintmax_t )prev_nread;", [expRelation:(),
                                                  mallocWrapper:(wrapper call:Unknown node, unique calls:{}),
                                                  base:({
                                                          Parameter {
                                                            fd ->   0
                                                            buf___1 ->   {?, &(alloc@sid:35301@tid:[main])[⊤], &(alloc@sid:35315@tid:[main])[⊤]}
                                                            size ->   ⊤
                                                          }
                                                        }, {}, {}, {}),
                                                  threadid:(wrapper call:unknown node, Thread:[main], created:(current function:created (once) * created (multiple times):(created (once):{maybe_close_stdout@dd.c:2015:3-2015:31, interrupt_handler@dd.c:736:5-736:109, siginfo_handler@dd.c:729:5-729:110}, created (multiple times):{maybe_close_stdout@dd.c:2015:3-2015:31, interrupt_handler@dd.c:736:5-736:109, siginfo_handler@dd.c:729:5-729:110}), callees:bot)),
                                                  threadflag:Multithreaded (main),
                                                  threadreturn:false,
                                                  escape:{},
                                                  mutexEvents:(),
                                                  access:(),
                                                  mutex:(lockset:{}, multiplicity:{}),
                                                  race:(),
                                                  mhp:(),
                                                  assert:(),
                                                  pthreadMutexType:()]) on dd.c:918:11-918:40
and

 L:(node 34387 "prev = (uintmax_t )prev_nread;", [expRelation:(),
                                                  mallocWrapper:(wrapper call:Unknown node, unique calls:{}),
                                                  base:({
                                                          Parameter {
                                                            fd ->   0
                                                            buf___1 ->   {?, &(alloc@sid:35301@tid:[main])[⊤], &(alloc@sid:35315@tid:[main])[⊤]}
                                                            size ->   (Not {0}([0,64]))
                                                          }
                                                        }, {}, {}, {}),
                                                  threadid:(wrapper call:unknown node, Thread:[main], created:(current function:created (once) * created (multiple times):(created (once):{maybe_close_stdout@dd.c:2015:3-2015:31, interrupt_handler@dd.c:736:5-736:109, siginfo_handler@dd.c:729:5-729:110}, created (multiple times):{maybe_close_stdout@dd.c:2015:3-2015:31, interrupt_handler@dd.c:736:5-736:109, siginfo_handler@dd.c:729:5-729:110}), callees:bot)),
                                                  threadflag:Multithreaded (main),
                                                  threadreturn:false,
                                                  escape:{},
                                                  mutexEvents:(),
                                                  access:(),
                                                  mutex:(lockset:{}, multiplicity:{}),
                                                  race:(),
                                                  mhp:(),
                                                  assert:(),
                                                  pthreadMutexType:()]) on dd.c:918:11-918:40

@michael-schwarz
Copy link
Member

Do you maybe have different versions of ppx_deriving_hash?

@arkocal
Copy link
Author

arkocal commented Oct 7, 2024

This would explain it.

I am getting the same values from a docker build as well. Looking at the Dockerfile it looks like my local state should not somehow creep into the container.

@sim642
Copy link
Member

sim642 commented Oct 7, 2024

ppx_deriving_hash is not at fault because Goblint has a lower bound on the latest version. You couldn't even compile with a different version.

The problem is our fork of Zarith where we have version 1.12 with custom patches: https://github.com/goblint/Zarith/tree/goblint-1.12.
Meanwhile, Zarith 1.14 improved the hash functions (ocaml/Zarith#150) but our fork hasn't been adapted. Thus, depending on Gobview opam pins, the hash function used for the zero in the exclusion set could be different. The unlucky case happens when Z.hash Z.zero = 0.

That patch probably needs to be adapted if we want to benefit from Zarith improvements and keep Gobview.

@arkocal
Copy link
Author

arkocal commented Oct 7, 2024

With Zarith 1.14 I get the same values, and the few remaining conflicts have no obvious similarities.

@michael-schwarz
Copy link
Member

Yeah, potentially GobView can be retired at some point --- currently no one seems interested in it (correct me if I'm wrong). Though it feels like a waste to have invested so much effort to then not get a paper out of it.

@sim642
Copy link
Member

sim642 commented Oct 8, 2024

It's not necessarily a blocker for now: we could use Zarith 1.14 by default, but use our Zarith 1.12 fork with Gobview. One just needs to be careful when benchmarking to not use Gobview opam pins.

There might be other places where we should also use Hashtbl.hash for hashing int rather than making it identity to avoid 0 from annihilating some hash mixing. In particular, that's where ppx_deriving_hash might have to be improved. It currently uses very simple (universal) hashing. OCaml runtime has something based on MurmurHash 3, but annoyingly the low-level primitives for mixing are not exposed.

sim642 added a commit that referenced this issue Oct 18, 2024
This is more robust against changes to the possible choices of domain.
It would have avoided issue #1594.
@sim642 sim642 added this to the v2.5.0 milestone Oct 18, 2024
@sim642
Copy link
Member

sim642 commented Oct 24, 2024

@arkocal feel free to reopen this issue if you encounter any more instances of this.

@arkocal
Copy link
Author

arkocal commented Oct 24, 2024

Thanks, I think this should be enough for all practical purposes for now. And of course, will do :)

sim642 added a commit to sim642/opam-repository that referenced this issue Nov 28, 2024
CHANGES:

Functionally equivalent to Goblint in SV-COMP 2025.

* Add 32bit vs 64bit architecture support (goblint/analyzer#54, goblint/analyzer#1574).
* Add per-function context gas analysis (goblint/analyzer#1569, goblint/analyzer#1570, goblint/analyzer#1598).
* Adapt automatic static loop unrolling (goblint/analyzer#1516, goblint/analyzer#1582, goblint/analyzer#1583, goblint/analyzer#1584, goblint/analyzer#1590, goblint/analyzer#1595, goblint/analyzer#1599).
* Adapt automatic configuration tuning (goblint/analyzer#1450, goblint/analyzer#1612, goblint/analyzer#1181, goblint/analyzer#1604).
* Simplify non-relational integer invariants in witnesses (goblint/analyzer#1517).
* Fix excessive hash collisions (goblint/analyzer#1594, goblint/analyzer#1602).
* Clean up various code (goblint/analyzer#1095, goblint/analyzer#1523, goblint/analyzer#1554, goblint/analyzer#1575, goblint/analyzer#1588, goblint/analyzer#1597, goblint/analyzer#1614).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug performance Analysis time, memory usage
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants