HashmapAccumulator has several unused members, misnamed parameters #508

brian-kelley · 2019-12-04T18:10:09Z

The HashmapAccumulator has three members (hash_key_size, max_value_size, and used_size) that are not used internally, but they are set by the constructors. used_size_ and max_value_size_ (with trailing underscores) are instead passed to the insert() routines.

hash_key_size_ is not passed to any insert routines. It is the number of possible hash outputs, aka 1 plus the bit mask that forms the hash function: hash(value) = value & (hash_key_size - 1).

I think these changes would make sense and make things a lot more readable and debuggable:

max_value_size is the length of the keys and nexts arrays, so it should be an immutable member. max_value_size_ should not be passed into the insert routines. In fact, what SPGEMM symbolic does now is read out the member from the hashmap and pass it in.
max_value_size should really be called capacity or something similar - it's meaning is the length of the two arrays (keys, nexts).
used_size is an atomically incremented counter that measures how full the hashmap is. It is updated in parallel, so it usually lives in shared/global memory. It should either not be a member and just be passed by ref to insert, OR kept as a member, but be a pointer.
3 ways to deal with hash_key_size:
- Remove the member, since it is unused
- Keep it as an immutable member, and have KOKKOS_FORCEINLINE_FUNCTION key_type hash() be a member of Hashmap. The SPGEMM code wouldn't need to keep the hash mask as a local variable anymore.
- Keep as immutable member, and instead of passing the both the key and the hash to insert(), just pass the key, and compute the hash internally.

The text was updated successfully, but these errors were encountered:

ndellingwood · 2019-12-16T03:48:33Z

I'll mark this cleanup enhancement since this isn't a bug

e10harvey · 2020-05-20T17:55:23Z

Keep as immutable member, and instead of passing the both the key and the hash to insert(), just pass the key, and compute the hash internally.

@brian-kelley: Although removing hash_key_size is the simplest solution, I will attempt to refactor all the code to compute the hash internally (within HashmapAccumulator). Computing the hash internally is a step in the right direction. There is one special use-case of

kokkos-kernels/src/common/KokkosKernels_HashmapAccumulator.hpp

Lines 442 to 502 in 750fe24

    
           //used in kkmem's numeric phase to insert to first level hashmaps. 
        
           //function to be called from device. 
        
           //Accumulation is Add operation. It is not atomicAdd, as this 
        
           //is for the cases where we know that none of the simultanous 
        
           //insertions will have the same key. 
        
           //Insertion is simulteanous for the vector lanes of a thread. 
        
           //used_size should be a shared pointer among the thread vectors 
        
           template <typename team_member_t> 
        
           KOKKOS_INLINE_FUNCTION 
        
           int vector_atomic_insert_into_hash_mergeAdd ( 
        
               const team_member_t & /* teamMember */, 
        
               const int /* vector_size */, 
        
               size_type &hash, 
        
               const key_type key, 
        
               const value_type value, 
        
               volatile size_type *used_size_, 
        
               const size_type max_value_size_ ) 
        
           { 
        
             if (hash != -1) { 
        
               size_type i = hash_begins[hash]; 
        
               for (; i != -1; i = hash_nexts[i]) { 
        
                 if (keys[i] == key) { 
        
                   values[i] = values[i] + value; 
        
                   return INSERT_SUCCESS; 
        
                 } 
        
               } 
        
             } else { 
        
               return INSERT_SUCCESS; 
        
             } 
        
             if (used_size_[0] >= max_value_size_) { 
        
               return INSERT_FULL; 
        
             } 
        
             size_type my_write_index = Kokkos::atomic_fetch_add(used_size_, size_type(1)); 
        
             if (my_write_index >= max_value_size_) { 
        
               return INSERT_FULL; 
        
             } else { 
        
               keys[my_write_index] = key; 
        
               values[my_write_index] = value; 
        
               #if defined(KOKKOS_ARCH_VOLTA) || defined(KOKKOS_ARCH_VOLTA70) || defined(KOKKOS_ARCH_VOLTA72) 
        
               //this is an issue on VOLTA because warps do not go in SIMD fashion anymore. 
        
               //while some thread might insert my_write_index into linked list, another 
        
               //thread in the warp might be reading keys in above loop. 
        
               //before inserting the new value in liked list -- which is done with atomic exchange below, 
        
               //we make sure that the linked is is complete my assigning the hash_next to current head. 
        
               //the head might be different when we do the atomic exchange. 
        
               //this would cause temporarily skipping a key in the linkedlist until 
        
               //hash_nexts is updated second time as below. 
        
               //but this is okay for spgemm, 
        
               //because no two keys will be inserted into hashmap at the same time, as rows have unique columns. 
        
               hash_nexts[my_write_index] = hash_begins[hash]; 
        
               #endif 
        
               size_type hashbeginning = Kokkos::atomic_exchange(hash_begins+hash, my_write_index); 
        
               hash_nexts[my_write_index] = hashbeginning; 
        
               return INSERT_SUCCESS; 
        
             } 
        
           }

in

kokkos-kernels/src/sparse/impl/KokkosSparse_spgemm_impl_speed.hpp

Lines 433 to 444 in 750fe24

    
           if (overall_num_unsuccess){ 
        
             nnz_lno_t hash_ = -1; 
        
             if (num_unsuccess) { 
        
               hash_ = b_col_ind % global_memory_hash_size; 
        
             } 
        
             //int insertion = 
        
             hm2.vector_atomic_insert_into_hash_mergeAdd( 
        
                 teamMember, vector_size, 
        
                 hash_,b_col_ind,b_val, 
        
                 used_hash_sizes + 1, hm2.max_value_size 
        
             );

which calculates the hash using modulo instead of bit-wise AND.

This same use-case is making completely removing max_value_size as a parameter challenging.

srajama1 · 2020-05-20T18:09:51Z

Keep in mind this is used at the very low level. Performance of this data structure is important for the performance of the kernels.

e10harvey · 2020-05-20T18:36:45Z

Keep in mind this is used at the very low level. Performance of this data structure is important for the performance of the kernels.

Thanks, Siva. What are the reasons for calculating the hash outside of the inlined HashmapAccumulator insertion routines?

srajama1 · 2020-05-20T18:40:53Z

I don't see a particular use case for that.

e10harvey · 2020-05-20T18:42:45Z

I mean, we are currently calculating the hash outside of the inlined HashmapAccumulator insertion routines. Why are we doing this?

brian-kelley · 2020-05-20T20:18:00Z

@e10harvey I guess you could make bool powerOfTwoHashSize a template parameter, and that way no flags need to be stored and there's no runtime overhead. However you do it, we definitely need to use bitwise-AND masking where possible because integer div/mod are expensive on CUDA.

e10harvey · 2020-05-20T20:19:10Z

@e10harvey I guess you could make bool powerOfTwoHashSize a template parameter, and that way no flags need to be stored and there's no runtime overhead. However you do it, we definitely need to use bitwise-AND masking where possible because integer div/mod are expensive on CUDA.

Thanks @brian-kelley. This is exactly what I've started.

e10harvey · 2020-05-20T20:27:06Z

@e10harvey I guess you could make bool powerOfTwoHashSize a template parameter, and that way no flags need to be stored and there's no runtime overhead. However you do it, we definitely need to use bitwise-AND masking where possible because integer div/mod are expensive on CUDA.

Thanks @brian-kelley. This is exactly what I've started.

@brian-kelley: Actually, thinking about this more, is there any reason against adding the following to the constructor bodies:

compute_hash_func = compute_hash_func_table[__hash_op];

Then the instertion routines will invoke:

hash = compute_hash_func(arg1, arg2);

brian-kelley · 2020-05-20T21:36:33Z

@e10harvey Function pointers/first-class functions incur overhead on CPUs and I don't think they can be used at all in CUDA. I would just go with the template.

brian-kelley · 2020-05-20T21:41:49Z

Actually function pointers are supported in all CUDA versions supported by Kokkos, but they still have a performance cost.

ndellingwood added the enhancement label Dec 16, 2019

brian-kelley added Cleanup Code maintenance that isn't a bugfix or new feature and removed enhancement labels Apr 7, 2020

e10harvey self-assigned this May 14, 2020

This was referenced May 29, 2020

spmv, spmv_mv, spmv_struct, spgemm_jacobi fail on CUDA+ float, complex_float #715

Closed

First iteration of HashmapAccumulator cleanup #731

Merged

e10harvey added this to the 3.2 Release milestone Jul 9, 2020

e10harvey linked a pull request Jul 9, 2020 that will close this issue

First iteration of HashmapAccumulator cleanup #731

Merged

e10harvey closed this as completed Jul 9, 2020

ndellingwood added the InDevelop label Aug 19, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HashmapAccumulator has several unused members, misnamed parameters #508

HashmapAccumulator has several unused members, misnamed parameters #508

brian-kelley commented Dec 4, 2019 •

edited

Loading

ndellingwood commented Dec 16, 2019

e10harvey commented May 20, 2020 •

edited

Loading

srajama1 commented May 20, 2020

e10harvey commented May 20, 2020

srajama1 commented May 20, 2020

e10harvey commented May 20, 2020

brian-kelley commented May 20, 2020

e10harvey commented May 20, 2020

e10harvey commented May 20, 2020

brian-kelley commented May 20, 2020

brian-kelley commented May 20, 2020

HashmapAccumulator has several unused members, misnamed parameters #508

HashmapAccumulator has several unused members, misnamed parameters #508

Comments

brian-kelley commented Dec 4, 2019 • edited Loading

ndellingwood commented Dec 16, 2019

e10harvey commented May 20, 2020 • edited Loading

srajama1 commented May 20, 2020

e10harvey commented May 20, 2020

srajama1 commented May 20, 2020

e10harvey commented May 20, 2020

brian-kelley commented May 20, 2020

e10harvey commented May 20, 2020

e10harvey commented May 20, 2020

brian-kelley commented May 20, 2020

brian-kelley commented May 20, 2020

brian-kelley commented Dec 4, 2019 •

edited

Loading

e10harvey commented May 20, 2020 •

edited

Loading