Parallelization Opportunity #73

itzmeanjan · 2025-02-24T12:46:09Z

Hi @lemire , good day.

Is it possible to parallelize this loop

xor_singleheader/include/binaryfusefilter.h

Lines 733 to 747 in a5a3630

    
           for (uint32_t i = size - 1; i < size; i--) { 
        
             // the hash of the key we insert next 
        
             uint64_t hash = reverseOrder[i]; 
        
             uint16_t xor2 = binary_fuse16_fingerprint(hash); 
        
             uint8_t found = reverseH[i]; 
        
             h012[0] = binary_fuse16_hash(0, hash, filter); 
        
             h012[1] = binary_fuse16_hash(1, hash, filter); 
        
             h012[2] = binary_fuse16_hash(2, hash, filter); 
        
             h012[3] = h012[0]; 
        
             h012[4] = h012[1]; 
        
             filter->Fingerprints[h012[found]] = (uint16_t)( 
        
                 (uint32_t)xor2 ^ 
        
                 (uint32_t)filter->Fingerprints[h012[found + 1]] ^ 
        
                 (uint32_t)filter->Fingerprints[h012[found + 2]]); 
        
           }

? Or the computation of fingerprints is always supposed to be linearly dependent on the previous (iterating in reverse order) hash value ?

thomasmueller · 2025-02-24T13:30:55Z

There is a dependency in this loop, yes.

Some of the loops in the generation can be parallelized however, and if the logic is changed, then more of the loops can be parallalized. However, not all loops are easy to change, and the speedup wouldn't always be linear: the algorithm would have to be changed a bit.

Paralallization is much easier (and much more linear) if the set is split into multiple sub-sets; basically if multiple filters are generated that are independent of each other. The disadvantage of that approach is: lookup would be slower in this case.

lemire · 2025-02-24T18:01:40Z

The disadvantage of that approach is: lookup would be slower in this case.

Is that necessarily true?

Suppose I have 10 filters. I get an input, I hash to a value n between 0 and 9. The hash function is fixed and is selected before any of the filters are constructed. Then I check filter n.

There is addition step but if the filter is large enough to warrant splitting it up, it should be a negligible cost.

(I have a cold so maybe I am not thinking straight.)

itzmeanjan · 2025-02-26T17:15:15Z

Let me give you some more context why I want to specifically parallelize that loop.

In ChalametPIR (https://github.com/itzmeanjan/ChalametPIR), we extend binary fuse filters to key-value maps. Think of a key-value map, which we are going to encode as a matrix s.t. each column of that matrix is a binary fuse filter. Rows are representing encoded key and value pairs s.t. each element of an row only keeps say 8 significant bits (must be <32 -bits). So when constructing the filters, we do common task of computing reverse_order, reverse_h etc. only once https://github.com/itzmeanjan/ChalametPIR/blob/befe5afbf7c99a58a5a4cf6c9347ece369ef22aa/src/pir_internals/matrix.rs#L338-L339, because that part is same for all the vertical binary fuse filters.

Computation of the fingerprints for a specific key-value pair (encoded as a row in the resulting matrix) is done inside the loop https://github.com/itzmeanjan/ChalametPIR/blob/befe5afbf7c99a58a5a4cf6c9347ece369ef22aa/src/pir_internals/matrix.rs#L353. Now if we have a way to parallelize the loop we can enjoy huge boost in the performance of server-setup function of the key-value PIR scheme.

lemire · 2025-02-26T19:21:58Z

@itzmeanjan Can you elaborate on why you think that this particular loop can be parallelized for a huge boost in performance?

It is not at all obvious to me what you have in mind.

Could you sketch the algorithm you envision?

itzmeanjan · 2025-02-27T16:52:18Z

@lemire let me know if this helps. Forgive my handwriting 🙏 😆

lemire · 2025-02-27T17:50:50Z

@itzmeanjan I am afraid that this does not help me.

I assume that you have an algorithm in mind and some cost model. Can you spell it out?

thomasmueller · 2025-02-28T11:05:59Z

The disadvantage of that approach is: lookup would be slower in this case.
Is that necessarily true?

Yes: there is one more step in the calculation. But as you wrote: it is most likely a negligible cost. It is just one operation. I think for the problem at hand, it wouldn't matter.

I have to admit I do not fully understand the problem to be solved, but I still try to comment:

So when constructing the filters, we do common task

So all the filters have the same order. The question is then: instead of multiple filters with the same order, why not use one filter that has more bits per key? I can understand if the number of bits is not a multiple of 2 (eg. 240 bits) then it might be easier to use multiple filters -- but it is slower to query.

But actually that is not my main point... I just want to understand why you are using multiple filters that have the same ordering... If I understood correctly, you are interested in performance, so this doesn't match my expectation. It might just show that I don't fully understand the problem...

The real point is: I think splitting the filter (or the filters) into multiple smaller filters would resolve the parallelization problem. Let's say you split the large filter into 16 smaller filters: you can build all of them in parallel. Without having to change the code in the filters.

At the end, you might then have a multi-dimensional array of filters. But I think the problem is solved, and performance would still be very good: actually construction would be better, because building 16 (or 256,...) independent filters concurrently is certainly easier (and likely faster) than trying to paralellize construction of one filter. I think.

thomasmueller · 2025-02-28T11:11:39Z

What would help me, probably, in understanding the problem you want to solve if you explain what is the "end user" lookup / query API, and algorithm you use. I assume it is "get(key) => value / error". The key is as byte array... is the size of the byte array fixed? The value is a byte array... the size is variable? The errors are just probabillistic, right? So with a very, very low probability you could get a value if you have asked for a key that is not stored?

itzmeanjan changed the title ~~Paralleization Opportunity~~ Parallelization Opportunity Feb 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallelization Opportunity #73

Parallelization Opportunity #73

itzmeanjan commented Feb 24, 2025

thomasmueller commented Feb 24, 2025

lemire commented Feb 24, 2025

itzmeanjan commented Feb 26, 2025

lemire commented Feb 26, 2025

itzmeanjan commented Feb 27, 2025

lemire commented Feb 27, 2025

thomasmueller commented Feb 28, 2025 •

edited

Loading

thomasmueller commented Feb 28, 2025

Parallelization Opportunity #73

Parallelization Opportunity #73

Comments

itzmeanjan commented Feb 24, 2025

thomasmueller commented Feb 24, 2025

lemire commented Feb 24, 2025

itzmeanjan commented Feb 26, 2025

lemire commented Feb 26, 2025

itzmeanjan commented Feb 27, 2025

lemire commented Feb 27, 2025

thomasmueller commented Feb 28, 2025 • edited Loading

thomasmueller commented Feb 28, 2025

thomasmueller commented Feb 28, 2025 •

edited

Loading