-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Get rid of state
to improve performance (…failed?)
#4
Comments
Interesting. Did you use the extra two bits to increase the LUT size before you benchmarked? The performance drop can be explained by the bit operation overhead (which is low, but not zero). A LUT lookup on a modern CPU with a hot cache is very faster. But that said, my implementation might be slower on a system without a cache such as a low cost microcontoller. On the other hand on such a system you might better also want to avoid a 256 Byte LUT anyways... Actually I ran the benchmark again on a full 256 * 256 2D space and the performance gain was more in the range of 5 times faster. See the updated readme. But again, it also depends on the system you are benchmarking on. And the 2.5 times was before your pr which did also increase the performance because counting zeros is faster than the iteration over the LUT and saves time for smaller numbers. |
state
to improve performance (FAILED)state
to improve performance (…failed?)
I figured out how to make it go faster. However, if I make it calculate the new state imperatively, it merely matches your performance; to make it go faster (by about a third) I had to also extract the state into a new LUT. Please find an example implementation for const STATE_LUT: [u8; 256] = [0, 1, 1, 2, 1, 0, 0, 3, 1, 0, 0, 3, 2, 3, 3, 0, 1, 0, 0, 3, 0, 1, 1, 2, 0, 1, 1, 2, 3, 2, 2, 1, 1, 0, 0, 3, 0, 1, 1, 2, 0, 1, 1, 2, 3, 2, 2, 1, 2, 3, 3, 0, 3, 2, 2, 1, 3, 2, 2, 1, 0, 1, 1, 2, 1, 0, 0, 3, 0, 1, 1, 2, 0, 1, 1, 2, 3, 2, 2, 1, 0, 1, 1, 2, 1, 0, 0, 3, 1, 0, 0, 3, 2, 3, 3, 0, 0, 1, 1, 2, 1, 0, 0, 3, 1, 0, 0, 3, 2, 3, 3, 0, 3, 2, 2, 1, 2, 3, 3, 0, 2, 3, 3, 0, 1, 0, 0, 3, 1, 0, 0, 3, 0, 1, 1, 2, 0, 1, 1, 2, 3, 2, 2, 1, 0, 1, 1, 2, 1, 0, 0, 3, 1, 0, 0, 3, 2, 3, 3, 0, 0, 1, 1, 2, 1, 0, 0, 3, 1, 0, 0, 3, 2, 3, 3, 0, 3, 2, 2, 1, 2, 3, 3, 0, 2, 3, 3, 0, 1, 0, 0, 3, 2, 3, 3, 0, 3, 2, 2, 1, 3, 2, 2, 1, 0, 1, 1, 2, 3, 2, 2, 1, 2, 3, 3, 0, 2, 3, 3, 0, 1, 0, 0, 3, 3, 2, 2, 1, 2, 3, 3, 0, 2, 3, 3, 0, 1, 0, 0, 3, 0, 1, 1, 2, 1, 0, 0, 3, 1, 0, 0, 3, 2, 3, 3, 0];
const LUT_4_BPC: [u8; 256] = [
000, 001, 014, 015, 016, 019, 020, 021, 234, 235, 236, 239, 240, 241, 254, 255,
003, 002, 013, 012, 017, 018, 023, 022, 233, 232, 237, 238, 243, 242, 253, 252,
004, 007, 008, 011, 030, 029, 024, 025, 230, 231, 226, 225, 244, 247, 248, 251,
005, 006, 009, 010, 031, 028, 027, 026, 229, 228, 227, 224, 245, 246, 249, 250,
058, 057, 054, 053, 032, 035, 036, 037, 218, 219, 220, 223, 202, 201, 198, 197,
059, 056, 055, 052, 033, 034, 039, 038, 217, 216, 221, 222, 203, 200, 199, 196,
060, 061, 050, 051, 046, 045, 040, 041, 214, 215, 210, 209, 204, 205, 194, 195,
063, 062, 049, 048, 047, 044, 043, 042, 213, 212, 211, 208, 207, 206, 193, 192,
064, 067, 068, 069, 122, 123, 124, 127, 128, 131, 132, 133, 186, 187, 188, 191,
065, 066, 071, 070, 121, 120, 125, 126, 129, 130, 135, 134, 185, 184, 189, 190,
078, 077, 072, 073, 118, 119, 114, 113, 142, 141, 136, 137, 182, 183, 178, 177,
079, 076, 075, 074, 117, 116, 115, 112, 143, 140, 139, 138, 181, 180, 179, 176,
080, 081, 094, 095, 096, 097, 110, 111, 144, 145, 158, 159, 160, 161, 174, 175,
083, 082, 093, 092, 099, 098, 109, 108, 147, 146, 157, 156, 163, 162, 173, 172,
084, 087, 088, 091, 100, 103, 104, 107, 148, 151, 152, 155, 164, 167, 168, 171,
085, 086, 089, 090, 101, 102, 105, 106, 149, 150, 153, 154, 165, 166, 169, 170,
];
fn hilbert_index(&self) -> u64 {
fn separate_nybbles(x: u32) -> u64 {
let mut x = x as u64;
x |= x<<16;
x &= 0x0000_FFFF_0000_FFFF;
x |= x<<8;
x &= 0x00FF_00FF_00FF_00FF;
x |= x<<4;
x &= 0x0F0F_0F0F_0F0F_0F0F;
x
}
let bloated_1 = separate_nybbles(self[0]);
let bloated_2 = separate_nybbles(self[1]);
let key_0 = bloated_2 << 4 | bloated_1;
// Now, each byte of key_0 has one nybble of x and one nybble of y.
let key_1 = bloated_1 << 4 | bloated_2;
let key_2 = !key_0;
let key_3 = !key_1;
let combineds = [key_0, key_1, key_2, key_3];
let mut state = 0u8;
let mut result = [0u8; 8];
for i in 0..8 {
let index = combineds[state as usize].to_be_bytes()[i];
let new_thing = LUT_4_BPC[index as usize];
result[i] = new_thing;
state ^= STATE_LUT[new_thing as usize];
}
u64::from_be_bytes(result)
} Since we have the state LUT already, and since it's indexed by the Hilbert index's keys rather than X and Y, I think it would probably be possible to also use the same LUT to also hep accelerate the inverse transformation. Any ideas? |
I am not 100% sure how you crated the state LUT. But you still have 6 "free" bits in the state LUT since you only used 2 so far for the state. So one pragmatic approach could be to compute the inverser LUT and add the result in the existing LUT as well. Does this make sense? |
I thought I'd try to figure out a way to get rid of the
state
parametre in both the LUT's index and the results it outputs. I succeeded in achieving that, but to my astonishment it ended up consistently ~50% slower! I tried optimising things further here and there, but no consistent result came up. Here is themain.rs
file that explains how I did all those things; in thelib.rs
file you'll see the actual methods I used.That said…
You said in your benchmarks that
fast_hilbert
is around 2-2.5 times faster than other comparable methods. My own benchmarks show the improvement to be on the order of 8-10 times faster! I strongly suspect that our benchmarks lead to different results. Thus, if this seems interesting to you, feel free to experiment as well and see; otherwise, just close the issue.PS: The main reason I ultimately decided to show you the work I did is because I think that this will help illuminate the way to generalise to other dimensions.
The text was updated successfully, but these errors were encountered: