Variable hashes have too many collisions #1594

arkocal · 2024-10-04T15:23:20Z

Summary: With about 60k variables encountered, we have about 1.04 entries per hash. With an ideal hash function the chances for even a single collision would have been very low.

The experiment can be repeated using the topdown solver on arkocal/hash_collision_demo branch, which includes a minimal module for counting how often each hash value appears.

In particular, variables that differ by a single string are mapped to the same hash.

The following changes already significantly reduce collisions (1.0008 keys per hash, no obvious similarities between colliding keys):
In cdomains/stringDomain.ml (https://github.com/goblint/analyzer/blob/master/src/cdomain/value/cdomains/stringDomain.ml) :

let hash x =
  if get_string_domain () = Disjoint then
    hash x
  else
    (* 13859 <- old value*)
    hash x (* new value *)

In cdomains/intDomain.ml, add to line 3680, after let to_yojson ...

let hash = Hashtbl.hash

The text was updated successfully, but these errors were encountered:

michael-schwarz · 2024-10-04T16:34:15Z

Nice catch!

@sim642: I think for stringDomain.ml this is simply an oversight from when we only had Disjoint and Unit? Or was there some reasoning behind it?

For the hashing in cdomains/intDomain.ml, I guess this comes from https://github.com/sim642/ppx_deriving_hash? Maybe something can also be tweaked there? Iirc the example that @arkocal had, the hash for T and Excluded{0} were the same, which is maybe not ideal as both are quite common.

sim642 · 2024-10-07T07:39:41Z

I think for stringDomain.ml this is simply an oversight from when we only had Disjoint and Unit? Or was there some reasoning behind it?

That seems possible. All uses of get_string_domain () there are in ifs, not matches, so it's error-prone indeed when something is added.

In cdomains/intDomain.ml, add to line 3680, after let to_yojson ...
let hash = Hashtbl.hash

This is wrong because some of the int domains (e.g. def_exc, enums) contain OCaml Sets where structural equality and hashing lead to problems because sets don't have a canonical representation (equivalent sets may have different Hashtbl.hash values).
So it may be fast, but wrong things can be arbitrarily fast.

Iirc the example that @arkocal had, the hash for T and Excluded{0} were the same, which is maybe not ideal as both are quite common.

I just tried (e.g. in base's init)

    let i1 = ID.top_of IInt in
    let i2 = ID.of_excl_list IInt [Z.zero] in
    Logs.error "%d %d" (ID.hash i1) (ID.hash i2)

and this prints out different hash values [Error] -75963065849016984 39336082039394930 at 0deedb9.

michael-schwarz · 2024-10-07T08:57:42Z

Maybe I also misremembered some of the details. @arkocal do you still have the example at hand?

arkocal · 2024-10-07T11:05:09Z

I am not sure if the suggested hashes above are correct or performant. They are not meant to be solution suggestions, just illustrative examples for the source of collisions, sorry for the confusion.

Testing with dd_comb.c from coreutils, without overwriting the hash for intDomain, I get the following:

%%% coll: Colliding variables:
 L:node 34387 "prev = (uintmax_t )prev_nread;" on dd.c:918:11-918:40
and
 L:node 34387 "prev = (uintmax_t )prev_nread;" on dd.c:918:11-918:40
%%% coll: Hash of a: 4529258937874703681,
 Hash of b: 4529258937874703681
%%% coll: Values are equal: false

Coming from:

      let interesting = 4529258937874703681 in
      let colliding = HM.keys rho |> List.of_enum |> List.filter (fun k -> S.Var.hash k = interesting) in
      match colliding with
      | a::b::[] -> 
          if tracing then trace "coll" "Colliding variables:\n %a \nand\n %a" S.Var.pretty_trace a S.Var.pretty_trace b;
          if tracing then trace "coll" "Hash of a: %d,\n Hash of b: %d" (S.Var.hash a) (S.Var.hash b);
          if tracing then trace "coll" "Values are equal: %b" (S.Var.equal a b);
      | _ -> ();
      ;

Since the string representations are the same, this might also be an issue with S.Var.equal.

sim642 · 2024-10-07T11:16:51Z

Enabling dbg.trace.context would print out the entire unknown with the context. The difference hopefully is visible there.

arkocal · 2024-10-07T12:47:39Z

This is very strange, trying:

    let i1 = ID.top_of IInt in
    let i2 = ID.of_excl_list IInt [Z.zero] in
    Logs.error "%d %d" (ID.hash i1) (ID.hash i2)

Outputs

[Error] -75963065849016984 -75963065849016984

on my machine on 0deedb9

arkocal · 2024-10-07T12:51:26Z

With dbg.trace.context the only diffence is in base.size with T vs. (Not {0}((0,64)).

For completeness:

 L:(node 34387 "prev = (uintmax_t )prev_nread;", [expRelation:(),
                                                  mallocWrapper:(wrapper call:Unknown node, unique calls:{}),
                                                  base:({
                                                          Parameter {
                                                            fd ->   0
                                                            buf___1 ->   {?, &(alloc@sid:35301@tid:[main])[⊤], &(alloc@sid:35315@tid:[main])[⊤]}
                                                            size ->   ⊤
                                                          }
                                                        }, {}, {}, {}),
                                                  threadid:(wrapper call:unknown node, Thread:[main], created:(current function:created (once) * created (multiple times):(created (once):{maybe_close_stdout@dd.c:2015:3-2015:31, interrupt_handler@dd.c:736:5-736:109, siginfo_handler@dd.c:729:5-729:110}, created (multiple times):{maybe_close_stdout@dd.c:2015:3-2015:31, interrupt_handler@dd.c:736:5-736:109, siginfo_handler@dd.c:729:5-729:110}), callees:bot)),
                                                  threadflag:Multithreaded (main),
                                                  threadreturn:false,
                                                  escape:{},
                                                  mutexEvents:(),
                                                  access:(),
                                                  mutex:(lockset:{}, multiplicity:{}),
                                                  race:(),
                                                  mhp:(),
                                                  assert:(),
                                                  pthreadMutexType:()]) on dd.c:918:11-918:40
and

 L:(node 34387 "prev = (uintmax_t )prev_nread;", [expRelation:(),
                                                  mallocWrapper:(wrapper call:Unknown node, unique calls:{}),
                                                  base:({
                                                          Parameter {
                                                            fd ->   0
                                                            buf___1 ->   {?, &(alloc@sid:35301@tid:[main])[⊤], &(alloc@sid:35315@tid:[main])[⊤]}
                                                            size ->   (Not {0}([0,64]))
                                                          }
                                                        }, {}, {}, {}),
                                                  threadid:(wrapper call:unknown node, Thread:[main], created:(current function:created (once) * created (multiple times):(created (once):{maybe_close_stdout@dd.c:2015:3-2015:31, interrupt_handler@dd.c:736:5-736:109, siginfo_handler@dd.c:729:5-729:110}, created (multiple times):{maybe_close_stdout@dd.c:2015:3-2015:31, interrupt_handler@dd.c:736:5-736:109, siginfo_handler@dd.c:729:5-729:110}), callees:bot)),
                                                  threadflag:Multithreaded (main),
                                                  threadreturn:false,
                                                  escape:{},
                                                  mutexEvents:(),
                                                  access:(),
                                                  mutex:(lockset:{}, multiplicity:{}),
                                                  race:(),
                                                  mhp:(),
                                                  assert:(),
                                                  pthreadMutexType:()]) on dd.c:918:11-918:40

michael-schwarz · 2024-10-07T13:37:47Z

Do you maybe have different versions of ppx_deriving_hash?

arkocal · 2024-10-07T13:53:54Z

This would explain it.

I am getting the same values from a docker build as well. Looking at the Dockerfile it looks like my local state should not somehow creep into the container.

sim642 · 2024-10-07T13:56:19Z

ppx_deriving_hash is not at fault because Goblint has a lower bound on the latest version. You couldn't even compile with a different version.

The problem is our fork of Zarith where we have version 1.12 with custom patches: https://github.com/goblint/Zarith/tree/goblint-1.12.
Meanwhile, Zarith 1.14 improved the hash functions (ocaml/Zarith#150) but our fork hasn't been adapted. Thus, depending on Gobview opam pins, the hash function used for the zero in the exclusion set could be different. The unlucky case happens when Z.hash Z.zero = 0.

That patch probably needs to be adapted if we want to benefit from Zarith improvements and keep Gobview.

arkocal · 2024-10-07T14:11:54Z

With Zarith 1.14 I get the same values, and the few remaining conflicts have no obvious similarities.

michael-schwarz · 2024-10-07T14:26:35Z

Yeah, potentially GobView can be retired at some point --- currently no one seems interested in it (correct me if I'm wrong). Though it feels like a waste to have invested so much effort to then not get a paper out of it.

sim642 · 2024-10-08T12:42:08Z

It's not necessarily a blocker for now: we could use Zarith 1.14 by default, but use our Zarith 1.12 fork with Gobview. One just needs to be careful when benchmarking to not use Gobview opam pins.

There might be other places where we should also use Hashtbl.hash for hashing int rather than making it identity to avoid 0 from annihilating some hash mixing. In particular, that's where ppx_deriving_hash might have to be improved. It currently uses very simple (universal) hashing. OCaml runtime has something based on MurmurHash 3, but annoyingly the low-level primitives for mixing are not exposed.

This is more robust against changes to the possible choices of domain. It would have avoided issue #1594.

sim642 · 2024-10-24T07:26:35Z

@arkocal feel free to reopen this issue if you encounter any more instances of this.

arkocal · 2024-10-24T07:52:47Z

Thanks, I think this should be enough for all practical purposes for now. And of course, will do :)

michael-schwarz added performance Analysis time, memory usage bug labels Oct 4, 2024

sim642 added a commit that referenced this issue Oct 17, 2024

Lock zarith 1.14 for better hash (issue #1594)

732b69a

sim642 added a commit that referenced this issue Oct 18, 2024

Improve StringDomain type safety by matching all domains

2284da8

This is more robust against changes to the possible choices of domain. It would have avoided issue #1594.

sim642 added this to the v2.5.0 milestone Oct 18, 2024

sim642 mentioned this issue Oct 18, 2024

Improve flat string domain hash #1602

Merged

sim642 closed this as completed in #1602 Oct 24, 2024

sim642 mentioned this issue Nov 28, 2024

[new release] goblint (2.5.0) ocaml/opam-repository#26987

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Variable hashes have too many collisions #1594

Variable hashes have too many collisions #1594

arkocal commented Oct 4, 2024 •

edited by michael-schwarz

Loading

michael-schwarz commented Oct 4, 2024 •

edited

Loading

sim642 commented Oct 7, 2024

michael-schwarz commented Oct 7, 2024

arkocal commented Oct 7, 2024

sim642 commented Oct 7, 2024

arkocal commented Oct 7, 2024

arkocal commented Oct 7, 2024

michael-schwarz commented Oct 7, 2024

arkocal commented Oct 7, 2024

sim642 commented Oct 7, 2024

arkocal commented Oct 7, 2024

michael-schwarz commented Oct 7, 2024

sim642 commented Oct 8, 2024

sim642 commented Oct 24, 2024

arkocal commented Oct 24, 2024

Variable hashes have too many collisions #1594

Variable hashes have too many collisions #1594

Comments

arkocal commented Oct 4, 2024 • edited by michael-schwarz Loading

michael-schwarz commented Oct 4, 2024 • edited Loading

sim642 commented Oct 7, 2024

michael-schwarz commented Oct 7, 2024

arkocal commented Oct 7, 2024

sim642 commented Oct 7, 2024

arkocal commented Oct 7, 2024

arkocal commented Oct 7, 2024

michael-schwarz commented Oct 7, 2024

arkocal commented Oct 7, 2024

sim642 commented Oct 7, 2024

arkocal commented Oct 7, 2024

michael-schwarz commented Oct 7, 2024

sim642 commented Oct 8, 2024

sim642 commented Oct 24, 2024

arkocal commented Oct 24, 2024

arkocal commented Oct 4, 2024 •

edited by michael-schwarz

Loading

michael-schwarz commented Oct 4, 2024 •

edited

Loading